System Status

Curent Status - High Performance Computing (HPC)

This service is not reporting an issue.

Planned Maintenance - High Performance Computing (HPC)

No system maintenance is planned over the next 30 days

High Performance Computing (HPC) (Last 90 days)

Color bar indicators flow left to right, from oldest to most recent status

System History - High Performance Computing (HPC)

From most recent to oldest, this list shows all outages, degradations and planned maintenance for this service

Degradation - High Performance Computing (HPC)

Started 2024-12-06 11:00:00, Duration 1 Hour 45 Minutes
OIT Research Services Web Application
View Details
Task Number: INC4364939

Issue was related to a password change and has been resolved.

 

The OIT Research Services Web Application is reporting database error - staff are investigating

Degradation - High Performance Computing (HPC)

Started 2024-11-20 00:00:28, Duration 9 Hours 14 Minutes
HPC Cluster Overloaded
View Details
Task Number: INC4357219

New logins to HPC cluster are hanging after printing message of the day. This appears to be due to jobs overloading GPFS file system. Staff are working to resolve the issue

 

Jobs causing the overload were terminated - access should be back to normal

Planned maintenance - High Performance Computing (HPC)

Started 2024-10-26 08:00:00, Duration 4 Hours
File transfer node - servxfer - update
View Details

The HPC file transfer node (servxfer) will be unavailable Saturday 26 October between 8am and noon for an OS update. 

Planned maintenance - High Performance Computing (HPC)

Started 2024-10-16 09:00:00, Duration 1 Hour
Open OnDemand Update
View Details

Open OnDemand will be updated to new version and will add R Studio support. Existing server will be taken offline at 9am Wednesday 16 October and new server will be online by 10am. The URL for access will remain the same following the update.

Degradation - High Performance Computing (HPC)

Started 2024-07-27 18:30:00, Duration 1 Day 17 Hours
HPC login issue
View Details

 

ssh login to login.hpc.ncsu.edu has been restored

 

 

 

ssh logins to login.hpc.ncsu.edu are currently failing - issue is under investigation

CentOS 7.9 login node - login01.hpc.ncsu.edu is available and Open OnDemand also remains available for RHEL 9 logins - https://servood.hpc.ncsu.edu/

Planned maintenance - High Performance Computing (HPC)

Started 2024-07-25 08:00:00, Duration 2 Hours
LSF Maintenance
View Details

LSF manager node will be moved to a RHEL 9 host starting 8am Thursday 25 July. There should be minimal impact but could be delay in new jobs starting as state information is updated on new LSF manager.

Degradation - High Performance Computing (HPC)

Started 2024-07-11 15:00:00, Duration 3 Hours
Research Storage not available
View Details
Task Number: INC4273115

During a planned software update on file server - which is normally non-disruptive - the research storage file systems became unavailable on HPC nodes running RHEL 9.2 - these file systems have been remounted and appear to be working normally at this time.

Investigating if this software update may have caused the issue and if so why.

 

/rs1 and /rsstu are currently unavailable on RHEL 9.2 nodes on HPC cluster

Degradation - High Performance Computing (HPC)

Started 2024-03-23 09:40:00, Duration 20 Minutes
Data Center Power Issue
View Details

There was a data center power issue this morning (Saturday) that caused 52 nodes in the Hazel cluster to temporarily lose power. Jobs that were running on those nodes at the time of the power outage were lost. Outage was at approximately 9:40am.

Planned maintenance - High Performance Computing (HPC)

Started 2024-03-16 08:00:00, Duration 1 Day 9 Hours
Ethernet Switch Replacement
View Details

Starting 8am on Saturday March 16 through 5pm Sunday March 17 the HPC Cluster Hazel will be unavailable to allow for replacement of its core Ethernet switches.

LSF will be upgraded following the switch replacement to version 10.1.0.14

Planned maintenance - High Performance Computing (HPC)

Started 2024-02-12 17:00:00, Duration 3 Days 19 Hours
GPU node network change
View Details
Task Number: INC4200795

GPU nodes (except for A30 nodes) have been closed in LSF to allow jobs to finish prior to network work scheduled for Thursday 15 February to add top of rack Ethernet switch to aggregate GPU node connections to core HPC switches. 

Outage - High Performance Computing (HPC)

Started 2024-02-04 20:00:00, Duration 12 Hours 30 Minutes
LSF issue
View Details
Task Number: INC4195251

Working with IBM the huge number of jobs in gpu queue were removed from the system and LSF has returned to normal operation

 

 

An excessive number of jobs were submitted to LSF which is causing the system to fail with the error message

Failed in an LSF library call: Internal library error

A ticket has been opened with IBM. Currently LSF commands including bsub, bjobs, and skill fail with this error.

Degradation - High Performance Computing (HPC)

Started 2023-12-16 00:00:01, Duration 12 Hours 44 Minutes
Open OnDemand unavailable
View Details
Task Number: INC4175797

 

SSL cert for servood.hpc.ncsu.edu has been updated and Open OnDemand service restored

 

Renewal cert for servood.hpc.ncsu.edu has issues and Open OnDemand is currently unavailable 

Degradation - High Performance Computing (HPC)

Started 2023-11-06 17:00:00, Duration 24 Days 23 Hours
Hazel open with one core Ethernet switch
View Details

 

As of Dec 1 the redundant pair of core Ethernet switches have been restored.

A bug was identified in the switch software that caused the failure when the time changed from daylight to standard time.

 

Access to Hazel via login nodes and Open OnDemand has been restored.

However, the cluster is operating with a single core Ethernet switch rather than a redundant pair.

 

 

Outage - High Performance Computing (HPC)

Started 2023-11-05 05:50:00, Duration 1 Day 11 Hours 10 Minutes
Login nodes not reachable
View Details
Task Number: INC4159894

Login node and Open OnDemand access has been opened for the HPC Cluster. Cluster is running with a single core Ethernet switch rather than the usual redundant pair.

 

One of the core Ethernet switches for the cluster is working. Queued jobs are running.

At this time the login nodes remain closed as we work with network switch vendor to identify cause of the partial switch failure yesterday morning. The partial nature of the failure prevented the redundant switch from taking over. There is concern that investigating the cause of yesterday's event could cause another disruption which is why login nodes currently remain closed.

Planned maintenance - High Performance Computing (HPC)

Started 2023-10-01 00:30:00, Duration 1 Day 7 Hours 30 Minutes
HPC Partner Storage Unavailable
View Details
Task Number: INC4144582

To finalize move of HPC Partner Storage (/gpfs_backup) to new location it will be unavailable starting Sunday October 1 until morning of Monday Oct 2.

Degradation - High Performance Computing (HPC)

Started 2023-09-28 17:00:00, Duration 19 Hours
HPC Hazel Globus Endpoint Unavailable
View Details
Task Number: INC4144582

Service accounts used by Globus were removed by an automated process. The process has been updated to ensure the service accounts are not impacted by future updates.

There are reports of issues reaching the Globus endpoint for Hazel cluster. Staff are investigating

Degradation - High Performance Computing (HPC)

Started 2023-07-04 09:30:00, Duration 2 Days 4 Hours
LSF Restart
View Details

A routine restart of LSF is taking considerably longer to complete than expected. 

An incident has been opened with IBM support to assist with resolution of this issue.

 

Following discussion with IBM support, configuration changes were applied to allow faster processing of batch event history necessary for restoring state of LSF. Based on progress processing the history it appears that it may complete before midnight Wednesday 5 July.

Rate of recovery slowed considerably and only 24M of 55M lines of the event log had been read by morning. After additional work with IBM we are attempting to recover only the last 12M lines of event log. Some job information may be lost due to this approach if job has been queued or running for many weeks.

Shortened event log was able to be loaded and LSF b* commands restored to operation about 13:30 6 July 

 

 

Planned maintenance - High Performance Computing (HPC)

Started 2023-02-17 17:00:00, Duration 1 Hour
LSF maintenance
View Details

About 5pm on Friday February 17 LSF will be temporarily unavailable as a patch is applied in attempt to address GPU scheduling issue

Planned maintenance - High Performance Computing (HPC)

Started 2023-02-15 17:00:00, Duration 1 Hour
LSF Maintenance between 5-6pm Feb 15
View Details

IBM has recommended a patch be applied to LSF in order to potentially address issue with scheduling GPU jobs. This will occur between 5-6pm on Feb 15. LSF will be unavailable for several minutes during the window while it restarts after the patch is applied.

Degradation - High Performance Computing (HPC)

Started 2023-02-03 17:30:00, Duration 14 Hours 45 Minutes
LSF communication issue
View Details
Task Number: INC4034859

Network configuration adjustment was made on LSF manager node and communication with compute nodes appears to be working normally again.

 

Currently compute nodes on Hazel cluster are closed in LSF due to internal LSF communication issue.

Jobs can be submitted, but will pend until compute nodes are open and available.

Running jobs have continued running

Planned maintenance - High Performance Computing (HPC)

Started 2023-01-04 17:00:00, Duration 4 Days 15 Hours
HPC Service moving to new Linux Cluster
View Details

As of about 8am Jan 9th Hazel is available for use.

There are limited number of cores currently available. Additional cores will be added over the coming days/weeks. As partner nodes are integrated into cluster partner queues will return. Also currently there are no GPU nodes.. as GPU nodes are moved and integrated GPU queues will return.

 

Henry2 HPC Linux Cluster will be permanently retired at 5pm January 4th. The HPC service is expected to be available for use on new Hazel HPC Linux Cluster by 8am January 9th.

No HPC service will be available between 5pm Jan 4th and 8am Jan 9th. 

Any jobs running at 5pm Jan 4th will be lost

Data in /home, /usr/local, /share, and /gpfs_archive file systems will be relocated from Henry2 to Hazel during the transition.

Degradation - High Performance Computing (HPC)

Started 2022-10-30 18:30:00, Duration 16 Hours
Group Issue
View Details
Task Number: INC3991558

Following database move yesterday the group information on Henry2 cluster did not get created correctly. Staff are working to correct the issue

 

Group information from Saturday has been restored on the cluster. Normal update should occur at 6:30pm today.

Degradation - High Performance Computing (HPC)

Started 2022-10-15 18:30:00, Duration 1 Day 13 Hours 50 Minutes
Group File Issue
View Details
Task Number: INC3984718

The group file generated for the Henry2 cluster over the weekend was incorrect. Staff are working to build and sync a correct group file.

Corrected file has been synced to cluster - group issues should be resolved.

Degradation - High Performance Computing (HPC)

Started 2022-09-23 12:00:00, Duration 5 Hours
/gpfs_share unavailable
View Details

 

Hardware error has been corrected and /gpfs_share is available again on Henry2 cluster

 

Storage array providing /gpfs_share is experiencing issue resulting in /gpfs_share being unavailable. Staff are investigating

Planned maintenance - High Performance Computing (HPC)

Started 2021-05-02 21:00:00, Duration 2 Hours
network maintenance
View Details

During this maintenance some network configuration changes to HPC network will be applied, which might cause very short network disruption. It shouldn't, but users should be aware of this possibility.

Planned maintenance - High Performance Computing (HPC)

Started 2021-03-21 09:00:00, Duration 3 Hours
HPC Web App Database Upgrade
View Details
Task Number: CHG0030634

Sunday March 21st between 9am and noon the database that supports the OIT Research Services web applicatioin will be upgraded. The web application will not be available during this time.