August 2021 | High Performance Computing

Submitted by wei2dt on Wed, 09/01/2021 - 15:31

Quota Query Tool

Following last months change to the scratch quota usage limits, it was apparent that the standard df/du command set defined in linux was insufficient at being able to communicate the more complex scratch quota structure. A new tool has been added to the HPC cluster to allow you to query current scratch usage in real time and get details on base quota usage, burst usage and grace periods. From the login nodes, simply issue the scratchusage command to query your current utilization , if you see "Base quota is not currently in violation" it indicates that any previous grace periods are no longer in effect and you are free to burst above the base limit for up to 7 days.

For a usage under the 100GB base Quota:

bash-4.2$ scratchusage
Scratch Usage:  0GB
Hard Limit:     5120GB
100GB Base quota is not currently in violation.

For a usage over the 100GB base Quota:

bash-4.2$ scratchusage
Scratch Usage:   108GB
Hard Limit:      5120GB
Base quota of 100GB has been exceeded:
-Base quota was last exceeded on:        Tue Aug 17 09:03:15 EDT 2021
-Scratch locked for further writes on:   Tue Aug 24 09:03:15 EDT 2021
**FYI: You must reduce usage on your scratch volume below 100GB for 12 hours to reset your usage clock.
Details on the policies and workings of scratch volume can be found at: https://hpc.research.cchmc.org/node/14

Effective Use of a Compute Node

When submitting a job to the cluster, the most effective way is to utilize cores in a single compute host. This means "Shared Memory mode" where the RAM is shared evenly across all requested cores and there is no delay in sharing data between other cores.

For example, some jobs looks like this:

16*bmi-200m5-10, this means that all 16 cores are in a single compute host and the communication between cores when processing information is the most effective possible as it all occurs within a single node.

4*bmi-200m5-01,4*bmi-200m5-02,4*bmi-200m5-03, this indicates 4 cores are requested on three separate compute hosts and there will be a delay caused by the network in processing information, and sometimes the last 2 resources will be ignored by some programs. Therefore, the processing will be by 4 processor instead of 16.

the best practice to ensure you are requesting the right mode is to use -R "span" flag on the bsub or #BSUB, these are examples:

-R "span[hosts=1]"

This will ensure all requested cores lands on one single node for the most efficient processing. Please note, host spanning specification can not exceed the largest node on the cluster or the job will not dispatch. To view node sizes on the cluster, review the "max" column of the bhosts command from the login node. Should you require more cores than any single node requires you would then set your -R "span[hosts=?]" to a value that will spread the job over the fewest number of hosts.

Reminder! Password Change and the RDS share

When you change your CCHMC password, ensure you also update your credentials in the research cluster as it can cause your account to become locked due to repeated attempts to use your previous password. To avoid this issue, simply run the command "rds credentials" from the login nodes to refresh the user credentials and disconnect/connect again the RDS share.

Job Submission Delays

The cluster team is aware of an intermittent issue where job submissions may take an extended amount of time to be accepted and scheduled by the system. We are collecting data and working with IBM on identifying where this issue may stem from. We will provide additional updates as they are available.

Monthly Newsletters Archived on hpc.research.cchmc.org

These newsletters will be archived onto the hpc.research.cchmc.org support site to allow folks whom may have missed previous mailings to review them. Should you want to review past topics just point your browser over to https://hpc.research.cchmc.org/node/57