Submitted by wei2dt on Mon, 01/31/2022 - 13:25

GPU Accelerated AI/ML Now Available

A new queue and node type is available on the HPC cluster for AI/ML training workloads. The queue consists of nodes with four 80GB Tensor Core GPU cards (NVIDIA A100), 128 cores and a 1TB NVME tmp drive to provide the optimal platform for running training models. These systems are highly specialized and run a different OS version and can not leverage the existing cluster modules as such queue access is upon request. Please submit a ticket to help-cluster@bmi.cchmc.org for access to the queue (amdgpu) and to collaborate in preparing the environment for your GPU accelerated training needs.

These are the instructions to use it:

for interactive jobs:

bsub -q amdgpu -W 2:00 -n 8 -M 32000 -gpu "num=x" -Is bash

For Scripts:

#BSUB -q amdgpu

#BSUB -gpu "num=x"

Note: if more gpus are required "num=x" can be up to 4

In addition to this new glass of GPU nodes, please also leverage the existing gpu-v100 queue for any inferencing or smaller GPU accelerated workloads needs as they are open for general submission without consult.

Proxyless Data Uploads/Downloads

Successful and streamlined data transfers in the HPC has been a challenge due to complexity and authentication needs of the proxy infrastructure. In working with the CCHMC security team we have devised a solution to allow streamlined data collaboration on common protocols without the need for managing proxy configs. A change to accommodate this is scheduled for the morning of February 2nd and all HPC users will be able to take advantage of the proxy-less data transfers.

To leverage direct access over the protocols listed below, the target location must be in the HPC "whitelist" before you will be able to establish the connection. To verify if a site is in the whitelist use the command "iswhitelisted <url>" to verify (this command is only accurate starting Feb 2nd once the changes are put in place). If the site you are trying to access is found not to be in the whitelist, then please submit a request to help-cluster@bmi.cchmc.org with the site you would like to access.

The HPC team has pre-populated the whitelist with all sites used via the proxy in the last 6 months along with verifying all of the most popular collaboration sites are included (nih.gov, osc.edu, etc).

The following common data exchange protocols have will be opened for proxyless access:

SFTP 22
FTP (21 & 20)
HTTP (80)
HTTPS (443)
Aspera (3001-30020) (UDP & TCP)

*Note: Existing proxy based access will remain in place and can continue to be used into the future without having whitelist the site.

Reduced Cluster Core Count Continues

The HPC cluster continues to operate at a reduced core count since seeing degraded performance related to storage contention at the beginning of December. The decision was made to keep a consistent and reliable data access experience both in and out of the cluster in exchange for the lower core count. We continue to monitor the submission queues and adjust node availability as needed to ensure job submission times remain within an acceptable range.

Over the last year, the workloads being run in the HPC has outpaced the performance characteristics of the network attached storage (bmiisi). While the cluster core count has doubled, core speeds increased, and GPU capable computation has been added, the storage infrastructure has not seen a similar increase in performance capabilities. Most of our focus has been on providing economical capacity tiers to contain the ever larger datasets needing to be maintained for the research community here at CCHMC.

For the next 6 months the HPC team will continue to monitor and tweak the environment within these existing constraints to provide the maximum throughput to the community. Starting next financial year (July 2022) we expect a significant investment in the storage infrastructure which will accelerate the existing capabilities and support next generation computation and AI. We appreciate your continued understanding.