Policies

The High Performance Computational (HPC) cluster maintained by the Division of Biomedical Informatics is for authorized users only. If you have any questions regarding usage or would like to submit any feedback, please email the BMI Help Desk.

Also, please make sure you are subscribed to the bmi-cluster-users mailing list so that you receive important announcements regarding downtimes, policy changes, etc. You can subscribe yourself by browsing the following web page: http://mailman.cchmc.org/mailman/listinfo.

By using this cluster, you acknowledge that you understand and accept the the policies and scheduling limits outlined below.

  • Cluster walltime quota
  1. Every user in the cluster gets a default wall time quota of 10000 hours for each quarter.
  2. Any additional wall time hours needed should be requested ahead of time (by emailing help@bmi.cchmc.org). They will be charged at 1¢/hour. The additional purchased walltime can be used at anytime.
  3. All compute nodes and queues are currently being charged at the same rate, but this may change in the future as the QoS requirements varies.
  4. Gold allocation system has been integrated with LSF so user jobs will go to PENDING (Queued) state if he/she has exhausted their walltime allocation for the current quarter for the project that they are charging against.
  5. System sends hourly reports on PENDING jobs to BMI RT ticket system.
  6. Documentation on how to use Gold to get your current walltime usage and other things are available here.
  • We have implemented a set of scripts for some commonly-used interactive applications, which will transparently perform job submission.

Current Limits

  • Currently only users who have a job running on a node can gain SSH access to that node. This access policy prevents users from submitting jobs outside of the batch controller. This necessitates the following changes for you as a user:
    • When you submit a job, please make sure that within your job, all your temporary files are copied back to either your home directory or somewhere where you can access them after your job completes.
  • All users have a default CPU limit of 125. If you have jobs running currently in the cluster that account for 125 procs, any new jobs you submit will go into Q status and not be scheduled to run until one or more of your current jobs complete. For example, if you have five parallel jobs running in the cluster each with 20 CPUs, you have 125 CPUs allocated for your jobs. And if you request a serial job requesting one CPU, your job will not run until one of your parallel jobs completes.
    • Reasoning: This policy is set to prevent a single user from flooding the entire cluster with jobs. Even though 125 CPUs is the limit, please be considerate to others and break down your jobs into smaller (CPU time) jobs.
    • If you plan to use more than 125 CPUs at one time, please contact us to make a reservation.
    • If you want to run jobs on more than 125 CPUs (subject to availability) without going through a reservation policy, follow the procedure below.
    • First off, know that the following condition applies to these additional jobs:
    • If you are over your 125 CPU limit AND the cluster is full, AND if another user comes in and wants to run jobs in their original quota of 200 processors, their job will stop and requeue your over-quota job. Now that you know the policy, please read on to understand how to submit additional jobs over your quota.

Queue policies

  • By default, all requests from users go to "normal" queue.

Node/Resource requests

  • By default, all jobs are assigned a default resource requirement of 384 MB of memory and one CPU hour. If your job exceeds these limits and you did not ask for different values for these resources explicitly, your job will potentially be terminated (some exceptions apply).
  • We have nodes with different resources (some have faster CPUs, some slower; memory ranges from 4GB in some nodes to 32 GB in some). This has a direct effect in the scheduling of resources to jobs.
  • If all you want is 4 CPUs in the cluster, and if it does not matter how those processors are allocated (all in 1 node, or 1 CPU per node in 4 nodes, or 2 per node in 2 different nodes etc.), please use

user@bmiclusterp:~> bsub -n 4

  • If you need a specific number of CPUs per node, then you need to specify like

user@bmiclusterp:~> bsub -n 8 -R "span[ptile=2]"

In this case, you will get a total of 8 CPUs, 2 processors per node.

user@bmiclusterp:~> bsub -n 8 -R "span[hosts=1]"

In this case, you will get a total of 8 CPUs, 8 processors per node.

Time-shared node

  • Justification

This should be used when all of the following are true.

  • You want to run a quick job for a project.
  • You find that the cluster is 100% reserved (both for jobs and users).
  • You know your job would only take 10 minutes to complete and it is not feasible to wait a long time to run a short job.
  • You are ready to run the job in a node that is over committed (you might be fighting for resources on the node with other similar users)

Solution

We will have one node with 8-cores which will be over committed to the scheduler as having 24 cores.

Usage

  • No more than 2 procs can be requested per job from the time-shared node.
  • The walltime of the job cannot exceed 2 days.