July 2021 | High Performance Computing

Submitted by wei2dt on Fri, 08/27/2021 - 10:32

Docker Support

Docker support has been added to the HPC production cluster, with four nodes ready to handle these requests.
To use docker on the cluster, add:
"#BSUB -q docker" on your script or "-q docker" on your bsub command.
The images are kept for a week, so it is recommended to pull every time you run a container.

Example, to request a docker job with 16GB and 4 cores for 1 hour in a script

#BSUB -M 16000
#BSUB -W 1:00
#BSUB -n 4
#BSUB -q docker
docker pull hello
docker run hello

or using the bsub command:

bsub -M 16000 -W 1:00 -n 4 -q docker docker pull hello && docker run hello

LibreOffice

LibreOffice has been installed on the login nodes for all users. This is a open source office suite which replaces Microsoft Office, where Word, Excel and Powerpoint documents are fully supported. It is recommended to use a graphic session such as citrix (https://connect.research.cchmc.org/vpn/index.html) for a better experience.

Scratch Space

Traditionally scratch space for the cluster (/scratch/) had been allocated as a single large shared volume where all users could write temporary data while running active jobs. This volume is then monitored for files older than 60 days and automatically purges anything meeting that criteria.

While this model has worked well in the past, as scratch needs have expanded we have been encountering issues where a single individuals usage can fill this volume and cause failures for other cluster users. In an effort to remediate cross-user impact the HPC cluster team will be introducing additional controls around the use of the /scratch/ space as outlined below.

Individual user folders (/scratch/) will be applied a default 100GB limit for space usage.
Individual user folders will allow you to bypass the 100GB limit and burst up to 5TB of usage for up to 7 days.
After 7 days above the 100GB limit, writes will be blocked on the individual user folder.
The burst limit clock will reset once your usage drops below the 100GB limit for 12 hours.
The 60 day cleanup policy still remains in effect for files in your scratch space.

As has always been the case, the /scratch space should only be used for temporary files used during processing, any longer term files should be either kept in home a user directory or dedicated project drives hosted on enterprise storage. If you feel that this new model will not meet the requirements of your workflow, please reach out to the HPC cluster team at help-cluster@bmi.cchmc.org with details of your needs.

This change is scheduled to be put in place on Monday, August 16th.

Memory and CPU limits for Login Nodes

The nodes bmiclusterp2(bmiclusterp) and bmiclusterp3 serve as a central login for all users to submit jobs to the cluster. It is important that users not run high CPU/memory processes on these nodes as it will affect the operations of other users. In order to prevent such disruptions, there is a script that checks for such processes, terminates them, and notify the users of the process. The current threshold is 75% of one CPU and 10% of total system memory (10GB of RAM as of current).

Job Submission Issues

On July 17, there were several reports of users receiving permissions denied errors when submitting jobs to the cluster. After restarting services, it seemed that the issue had been resolved. However, the next morning similar behavior was encountered and additional investigation was performed. The issue was finally tracked down to a misconfiguration with the network interface on bmiclusterp which was triggered due to a system patch cycle. The configuration change has been made and added as a checklist for future reboots/system changes. We apologize for the inconvenience and appreciate your patience throughout the ordeal.