System Status - IT Service Portal

Curent Status - High Performance Computing (HPC)

This service is not reporting an issue.

Planned Maintenance - High Performance Computing (HPC)

No system maintenance is planned over the next 30 days

High Performance Computing (HPC) (Last 90 days)

Color bar indicators flow left to right, from oldest to most recent status

System History - High Performance Computing (HPC)

From most recent to oldest, this list shows all outages, degradations and planned maintenance for this service

Degradation - High Performance Computing (HPC)

Started 2025-03-15 00:00:00, Duration 1 Second

Data Center Electrical Work

This work has been postponed - new notice will be posted when its rescheduled

NC DIT will be doing electrical work in the data center where Hazel cluster is housed. A number of nodes will need to be powered down during this maintenance due to reduced power availability during the data center electrical work. Queue wait times may be increased during this time and there is some increased risk of power outage with electrical contractors working around cluster.

Degradation - High Performance Computing (HPC)

Started 2025-03-05 08:00:00, Duration 16 Days 8 Hours

Ethernet switch failure

Failed Switch was replaced and all nodes returned to production operation

An Ethernet switch failed overnight leaving one rack of nodes unavailable. Staff are investigating, no restoration time estimate at the point. Nodes c042-c056 are currently not accessible.

Degradation - High Performance Computing (HPC)

Started 2025-03-01 00:00:00, Duration 17 Hours

Data Center Electrical Work

An issue prevented the scheduled work from being done. Will provide notification as soon as we have information about rescheduling.

NC DIT will be doing electrical work in the data center where Hazel cluster is housed. A number of nodes will need to be powered down during this maintenance due to reduced power availability during the data center electrical work. Queue wait times may be increased during this time and there is some increased risk of power outage with electrical contractors working around cluster.

Degradation - High Performance Computing (HPC)

Started 2024-12-06 11:00:00, Duration 1 Hour 45 Minutes

OIT Research Services Web Application

Task Number: INC4364939

Issue was related to a password change and has been resolved.

The OIT Research Services Web Application is reporting database error - staff are investigating

Degradation - High Performance Computing (HPC)

Started 2024-11-20 00:00:28, Duration 9 Hours 14 Minutes

HPC Cluster Overloaded

Task Number: INC4357219

New logins to HPC cluster are hanging after printing message of the day. This appears to be due to jobs overloading GPFS file system. Staff are working to resolve the issue

Jobs causing the overload were terminated - access should be back to normal

Planned maintenance - High Performance Computing (HPC)

Started 2024-10-26 08:00:00, Duration 4 Hours

File transfer node - servxfer - update

The HPC file transfer node (servxfer) will be unavailable Saturday 26 October between 8am and noon for an OS update.

Planned maintenance - High Performance Computing (HPC)

Started 2024-10-16 09:00:00, Duration 1 Hour

Open OnDemand Update

Open OnDemand will be updated to new version and will add R Studio support. Existing server will be taken offline at 9am Wednesday 16 October and new server will be online by 10am. The URL for access will remain the same following the update.

Degradation - High Performance Computing (HPC)

Started 2024-07-27 18:30:00, Duration 1 Day 17 Hours

HPC login issue

ssh login to login.hpc.ncsu.edu has been restored

ssh logins to login.hpc.ncsu.edu are currently failing - issue is under investigation

CentOS 7.9 login node - login01.hpc.ncsu.edu is available and Open OnDemand also remains available for RHEL 9 logins - https://servood.hpc.ncsu.edu/

Planned maintenance - High Performance Computing (HPC)

Started 2024-07-25 08:00:00, Duration 2 Hours

LSF Maintenance

LSF manager node will be moved to a RHEL 9 host starting 8am Thursday 25 July. There should be minimal impact but could be delay in new jobs starting as state information is updated on new LSF manager.

Degradation - High Performance Computing (HPC)

Started 2024-07-11 15:00:00, Duration 3 Hours

Research Storage not available

Task Number: INC4273115

During a planned software update on file server - which is normally non-disruptive - the research storage file systems became unavailable on HPC nodes running RHEL 9.2 - these file systems have been remounted and appear to be working normally at this time.

Investigating if this software update may have caused the issue and if so why.

/rs1 and /rsstu are currently unavailable on RHEL 9.2 nodes on HPC cluster

Degradation - High Performance Computing (HPC)

Started 2024-03-23 09:40:00, Duration 20 Minutes

Data Center Power Issue

There was a data center power issue this morning (Saturday) that caused 52 nodes in the Hazel cluster to temporarily lose power. Jobs that were running on those nodes at the time of the power outage were lost. Outage was at approximately 9:40am.

Planned maintenance - High Performance Computing (HPC)

Started 2024-03-16 08:00:00, Duration 1 Day 9 Hours

Ethernet Switch Replacement

Starting 8am on Saturday March 16 through 5pm Sunday March 17 the HPC Cluster Hazel will be unavailable to allow for replacement of its core Ethernet switches.

LSF will be upgraded following the switch replacement to version 10.1.0.14

Planned maintenance - High Performance Computing (HPC)

Started 2024-02-12 17:00:00, Duration 3 Days 19 Hours

GPU node network change

Task Number: INC4200795

GPU nodes (except for A30 nodes) have been closed in LSF to allow jobs to finish prior to network work scheduled for Thursday 15 February to add top of rack Ethernet switch to aggregate GPU node connections to core HPC switches.

Outage - High Performance Computing (HPC)

Started 2024-02-04 20:00:00, Duration 12 Hours 30 Minutes

LSF issue

Task Number: INC4195251

Working with IBM the huge number of jobs in gpu queue were removed from the system and LSF has returned to normal operation

An excessive number of jobs were submitted to LSF which is causing the system to fail with the error message

Failed in an LSF library call: Internal library error

A ticket has been opened with IBM. Currently LSF commands including bsub, bjobs, and skill fail with this error.

Degradation - High Performance Computing (HPC)

Started 2023-12-16 00:00:01, Duration 12 Hours 44 Minutes

Open OnDemand unavailable

Task Number: INC4175797

SSL cert for servood.hpc.ncsu.edu has been updated and Open OnDemand service restored

Renewal cert for servood.hpc.ncsu.edu has issues and Open OnDemand is currently unavailable

Degradation - High Performance Computing (HPC)

Started 2023-11-06 17:00:00, Duration 24 Days 23 Hours

Hazel open with one core Ethernet switch

As of Dec 1 the redundant pair of core Ethernet switches have been restored.

A bug was identified in the switch software that caused the failure when the time changed from daylight to standard time.

Access to Hazel via login nodes and Open OnDemand has been restored.

However, the cluster is operating with a single core Ethernet switch rather than a redundant pair.

Outage - High Performance Computing (HPC)

Started 2023-11-05 05:50:00, Duration 1 Day 11 Hours 10 Minutes

Login nodes not reachable

Task Number: INC4159894

Login node and Open OnDemand access has been opened for the HPC Cluster. Cluster is running with a single core Ethernet switch rather than the usual redundant pair.

One of the core Ethernet switches for the cluster is working. Queued jobs are running.

At this time the login nodes remain closed as we work with network switch vendor to identify cause of the partial switch failure yesterday morning. The partial nature of the failure prevented the redundant switch from taking over. There is concern that investigating the cause of yesterday's event could cause another disruption which is why login nodes currently remain closed.

Planned maintenance - High Performance Computing (HPC)

Started 2023-10-01 00:30:00, Duration 1 Day 7 Hours 30 Minutes

HPC Partner Storage Unavailable

Task Number: INC4144582

To finalize move of HPC Partner Storage (/gpfs_backup) to new location it will be unavailable starting Sunday October 1 until morning of Monday Oct 2.

Degradation - High Performance Computing (HPC)

Started 2023-09-28 17:00:00, Duration 19 Hours

HPC Hazel Globus Endpoint Unavailable

Task Number: INC4144582

Service accounts used by Globus were removed by an automated process. The process has been updated to ensure the service accounts are not impacted by future updates.

There are reports of issues reaching the Globus endpoint for Hazel cluster. Staff are investigating

Degradation - High Performance Computing (HPC)

Started 2023-07-04 09:30:00, Duration 2 Days 4 Hours

LSF Restart

A routine restart of LSF is taking considerably longer to complete than expected.

An incident has been opened with IBM support to assist with resolution of this issue.

Following discussion with IBM support, configuration changes were applied to allow faster processing of batch event history necessary for restoring state of LSF. Based on progress processing the history it appears that it may complete before midnight Wednesday 5 July.

Rate of recovery slowed considerably and only 24M of 55M lines of the event log had been read by morning. After additional work with IBM we are attempting to recover only the last 12M lines of event log. Some job information may be lost due to this approach if job has been queued or running for many weeks.

Shortened event log was able to be loaded and LSF b* commands restored to operation about 13:30 6 July

Planned maintenance - High Performance Computing (HPC)

Started 2023-02-17 17:00:00, Duration 1 Hour

LSF maintenance

About 5pm on Friday February 17 LSF will be temporarily unavailable as a patch is applied in attempt to address GPU scheduling issue

Planned maintenance - High Performance Computing (HPC)

Started 2023-02-15 17:00:00, Duration 1 Hour

LSF Maintenance between 5-6pm Feb 15

IBM has recommended a patch be applied to LSF in order to potentially address issue with scheduling GPU jobs. This will occur between 5-6pm on Feb 15. LSF will be unavailable for several minutes during the window while it restarts after the patch is applied.

Degradation - High Performance Computing (HPC)

Started 2023-02-03 17:30:00, Duration 14 Hours 45 Minutes

LSF communication issue

Task Number: INC4034859

Network configuration adjustment was made on LSF manager node and communication with compute nodes appears to be working normally again.

Currently compute nodes on Hazel cluster are closed in LSF due to internal LSF communication issue.

Jobs can be submitted, but will pend until compute nodes are open and available.

Running jobs have continued running

Planned maintenance - High Performance Computing (HPC)

Started 2023-01-04 17:00:00, Duration 4 Days 15 Hours

HPC Service moving to new Linux Cluster

As of about 8am Jan 9th Hazel is available for use.

There are limited number of cores currently available. Additional cores will be added over the coming days/weeks. As partner nodes are integrated into cluster partner queues will return. Also currently there are no GPU nodes.. as GPU nodes are moved and integrated GPU queues will return.

Henry2 HPC Linux Cluster will be permanently retired at 5pm January 4th. The HPC service is expected to be available for use on new Hazel HPC Linux Cluster by 8am January 9th.

No HPC service will be available between 5pm Jan 4th and 8am Jan 9th.

Any jobs running at 5pm Jan 4th will be lost

Data in /home, /usr/local, /share, and /gpfs_archive file systems will be relocated from Henry2 to Hazel during the transition.

Degradation - High Performance Computing (HPC)

Started 2022-10-30 18:30:00, Duration 16 Hours

Group Issue

Task Number: INC3991558

Following database move yesterday the group information on Henry2 cluster did not get created correctly. Staff are working to correct the issue

Group information from Saturday has been restored on the cluster. Normal update should occur at 6:30pm today.

Degradation - High Performance Computing (HPC)

Started 2022-10-15 18:30:00, Duration 1 Day 13 Hours 50 Minutes

Group File Issue

Task Number: INC3984718

The group file generated for the Henry2 cluster over the weekend was incorrect. Staff are working to build and sync a correct group file.

Corrected file has been synced to cluster - group issues should be resolved.

Degradation - High Performance Computing (HPC)

Started 2022-09-23 12:00:00, Duration 5 Hours

/gpfs_share unavailable

Hardware error has been corrected and /gpfs_share is available again on Henry2 cluster

Storage array providing /gpfs_share is experiencing issue resulting in /gpfs_share being unavailable. Staff are investigating

Planned maintenance - High Performance Computing (HPC)

Started 2021-05-02 21:00:00, Duration 2 Hours

network maintenance

During this maintenance some network configuration changes to HPC network will be applied, which might cause very short network disruption. It shouldn't, but users should be aware of this possibility.

Planned maintenance - High Performance Computing (HPC)

Started 2021-03-21 09:00:00, Duration 3 Hours

HPC Web App Database Upgrade

Task Number: CHG0030634

Sunday March 21st between 9am and noon the database that supports the OIT Research Services web applicatioin will be upgraded. The web application will not be available during this time.