Curent Status - High Performance Computing (HPC)
Planned Maintenance - High Performance Computing (HPC)
High Performance Computing (HPC) (Last 90 days)
System History - High Performance Computing (HPC)
Degradation - High Performance Computing (HPC)
Issue was related to a password change and has been resolved.
The OIT Research Services Web Application is reporting database error - staff are investigating
Degradation - High Performance Computing (HPC)
New logins to HPC cluster are hanging after printing message of the day. This appears to be due to jobs overloading GPFS file system. Staff are working to resolve the issue
Jobs causing the overload were terminated - access should be back to normal
Planned maintenance - High Performance Computing (HPC)
The HPC file transfer node (servxfer) will be unavailable Saturday 26 October between 8am and noon for an OS update.
Planned maintenance - High Performance Computing (HPC)
Open OnDemand will be updated to new version and will add R Studio support. Existing server will be taken offline at 9am Wednesday 16 October and new server will be online by 10am. The URL for access will remain the same following the update.
Degradation - High Performance Computing (HPC)
ssh login to login.hpc.ncsu.edu has been restored
ssh logins to login.hpc.ncsu.edu are currently failing - issue is under investigation
CentOS 7.9 login node - login01.hpc.ncsu.edu is available and Open OnDemand also remains available for RHEL 9 logins - https://servood.hpc.ncsu.edu/
Planned maintenance - High Performance Computing (HPC)
LSF manager node will be moved to a RHEL 9 host starting 8am Thursday 25 July. There should be minimal impact but could be delay in new jobs starting as state information is updated on new LSF manager.
Degradation - High Performance Computing (HPC)
During a planned software update on file server - which is normally non-disruptive - the research storage file systems became unavailable on HPC nodes running RHEL 9.2 - these file systems have been remounted and appear to be working normally at this time.
Investigating if this software update may have caused the issue and if so why.
/rs1 and /rsstu are currently unavailable on RHEL 9.2 nodes on HPC cluster
Degradation - High Performance Computing (HPC)
There was a data center power issue this morning (Saturday) that caused 52 nodes in the Hazel cluster to temporarily lose power. Jobs that were running on those nodes at the time of the power outage were lost. Outage was at approximately 9:40am.
Planned maintenance - High Performance Computing (HPC)
Starting 8am on Saturday March 16 through 5pm Sunday March 17 the HPC Cluster Hazel will be unavailable to allow for replacement of its core Ethernet switches.
LSF will be upgraded following the switch replacement to version 10.1.0.14
Planned maintenance - High Performance Computing (HPC)
GPU nodes (except for A30 nodes) have been closed in LSF to allow jobs to finish prior to network work scheduled for Thursday 15 February to add top of rack Ethernet switch to aggregate GPU node connections to core HPC switches.
Outage - High Performance Computing (HPC)
Working with IBM the huge number of jobs in gpu queue were removed from the system and LSF has returned to normal operation
An excessive number of jobs were submitted to LSF which is causing the system to fail with the error message
Failed in an LSF library call: Internal library error
A ticket has been opened with IBM. Currently LSF commands including bsub, bjobs, and skill fail with this error.
Degradation - High Performance Computing (HPC)
SSL cert for servood.hpc.ncsu.edu has been updated and Open OnDemand service restored
Renewal cert for servood.hpc.ncsu.edu has issues and Open OnDemand is currently unavailable
Degradation - High Performance Computing (HPC)
As of Dec 1 the redundant pair of core Ethernet switches have been restored.
A bug was identified in the switch software that caused the failure when the time changed from daylight to standard time.
Access to Hazel via login nodes and Open OnDemand has been restored.
However, the cluster is operating with a single core Ethernet switch rather than a redundant pair.
Outage - High Performance Computing (HPC)
Login node and Open OnDemand access has been opened for the HPC Cluster. Cluster is running with a single core Ethernet switch rather than the usual redundant pair.
One of the core Ethernet switches for the cluster is working. Queued jobs are running.
At this time the login nodes remain closed as we work with network switch vendor to identify cause of the partial switch failure yesterday morning. The partial nature of the failure prevented the redundant switch from taking over. There is concern that investigating the cause of yesterday's event could cause another disruption which is why login nodes currently remain closed.
Planned maintenance - High Performance Computing (HPC)
To finalize move of HPC Partner Storage (/gpfs_backup) to new location it will be unavailable starting Sunday October 1 until morning of Monday Oct 2.
Degradation - High Performance Computing (HPC)
Service accounts used by Globus were removed by an automated process. The process has been updated to ensure the service accounts are not impacted by future updates.
There are reports of issues reaching the Globus endpoint for Hazel cluster. Staff are investigating
Degradation - High Performance Computing (HPC)
A routine restart of LSF is taking considerably longer to complete than expected.
An incident has been opened with IBM support to assist with resolution of this issue.
Following discussion with IBM support, configuration changes were applied to allow faster processing of batch event history necessary for restoring state of LSF. Based on progress processing the history it appears that it may complete before midnight Wednesday 5 July.
Rate of recovery slowed considerably and only 24M of 55M lines of the event log had been read by morning. After additional work with IBM we are attempting to recover only the last 12M lines of event log. Some job information may be lost due to this approach if job has been queued or running for many weeks.
Shortened event log was able to be loaded and LSF b* commands restored to operation about 13:30 6 July
Planned maintenance - High Performance Computing (HPC)
About 5pm on Friday February 17 LSF will be temporarily unavailable as a patch is applied in attempt to address GPU scheduling issue
Planned maintenance - High Performance Computing (HPC)
IBM has recommended a patch be applied to LSF in order to potentially address issue with scheduling GPU jobs. This will occur between 5-6pm on Feb 15. LSF will be unavailable for several minutes during the window while it restarts after the patch is applied.
Degradation - High Performance Computing (HPC)
Network configuration adjustment was made on LSF manager node and communication with compute nodes appears to be working normally again.
Currently compute nodes on Hazel cluster are closed in LSF due to internal LSF communication issue.
Jobs can be submitted, but will pend until compute nodes are open and available.
Running jobs have continued running
Planned maintenance - High Performance Computing (HPC)
As of about 8am Jan 9th Hazel is available for use.
There are limited number of cores currently available. Additional cores will be added over the coming days/weeks. As partner nodes are integrated into cluster partner queues will return. Also currently there are no GPU nodes.. as GPU nodes are moved and integrated GPU queues will return.
Henry2 HPC Linux Cluster will be permanently retired at 5pm January 4th. The HPC service is expected to be available for use on new Hazel HPC Linux Cluster by 8am January 9th.
No HPC service will be available between 5pm Jan 4th and 8am Jan 9th.
Any jobs running at 5pm Jan 4th will be lost
Data in /home, /usr/local, /share, and /gpfs_archive file systems will be relocated from Henry2 to Hazel during the transition.
Degradation - High Performance Computing (HPC)
Following database move yesterday the group information on Henry2 cluster did not get created correctly. Staff are working to correct the issue
Group information from Saturday has been restored on the cluster. Normal update should occur at 6:30pm today.
Degradation - High Performance Computing (HPC)
The group file generated for the Henry2 cluster over the weekend was incorrect. Staff are working to build and sync a correct group file.
Corrected file has been synced to cluster - group issues should be resolved.
Degradation - High Performance Computing (HPC)
Hardware error has been corrected and /gpfs_share is available again on Henry2 cluster
Storage array providing /gpfs_share is experiencing issue resulting in /gpfs_share being unavailable. Staff are investigating
Planned maintenance - High Performance Computing (HPC)
During this maintenance some network configuration changes to HPC network will be applied, which might cause very short network disruption. It shouldn't, but users should be aware of this possibility.
Planned maintenance - High Performance Computing (HPC)
Sunday March 21st between 9am and noon the database that supports the OIT Research Services web applicatioin will be upgraded. The web application will not be available during this time.