Job Submission Examples (LSF) | High Performance Computing

Introduction

Users submit jobs to the server using the bsub command. The current state of the queue in the server can be viewed using bjobs. There are a host of other utilities that can be used by the users like: bkill, bmod, bstop, bmig,bresume etc. bsub can be used for batch as well as interactive submission of jobs. Interactive job submission should be used only when a user needs to run and debug his code and for short-duration jobs.

Examples

The basic syntax for bsub is simply

bsub < batchfilename.bat

where batchfilename.bat is a file with shell commands that are to be executed. The first few lines of the batch file should contain BSUB directives (lines starting with #BSUB) that specify the resources that the job requires (e.g., number of nodes, number of processors, memory required, etc.).

A simple batch job example

Suppose you have an R program runme.R in your home directory that runs for a long time and that you would like to run on the cluster. It requires a single cpu for, say, no more than 10 hours. Here's a batch file that would do the trick.

#BSUB -W 10:00
#BSUB -n 1
#BSUB -M 16000
#BSUB -e <some directory>/%J.err
#BSUB -o <some directory>/%J.out
module load R
cd ~
# execute program

R CMD BATCH runme.R

Here's a break down of what the lines in this batch file mean:

#BSUB -W 10:00 tells LSF that your jobs will require no more than 10 hours of walltime to complete. The time format is HH:MM. Some schedulers will prioritize short jobs over long jobs, so the less time you ask for, the more likely it is your job will get scheduled sooner rather than later. Should the actual job length exceed what you requested then your job will be killed. (this feature is currently not used in our implementation, but a default running time will likely be implemented at some point)
#BSUB -n 1 asks LSF for one CPU core. This means that when your jobs starts you will have exclusive access to one CPU. But if you want something like 4 nodes each with exactly 2 CPU cores (total of 8 cores), then you would use something like -n 8 and -R "span[ptile=2]". Instead, if you just want any 8 cores in the cluster, you would request like just -n 8.
#BSUB -e ~/lsf_logs/%J.err tells LSF to store all output that would normally be put in stderr into a file in your lsf_logs directory. This file's name will contain the LSF job number and will have suffix .err. This enables you to check whether there were any errors running your R program.
#BSUB -o ~/lsf_logs/%J.out tells LSF to redirect all output to a .out file in your lsf_logs directory, similarly to the location of the error file in the previous line.
comment lines: The other lines in the sample script that begin with '#' are comments. The '#' for comments and PBS directives must be in column one of your script file. The remaining lines in the sample script are executable commands.

A parallel batch job example

Suppose now you have an parallel mpi job that needs 4 processors and you would like to have 2 processors on 2 nodes. Here's the corresponding batch file you can submit with bsub:

#BSUB -W 10:00
#BSUB -n 4
#BSUB -M 16000
#BSUB -R "span[ptile=2]"
#BSUB -e <some directory>/%J.err
#BSUB -o <some directory>/%J.out

module load mpich1/gnu
cd ~
# execute program

mpiexec -np 4 myprogramname

The line #BSUB -n 4 together with -R "span[ptile=2]" requests 2 nodes with 2 processors per node. You could also have requested -n 4 if you didn't care about their location.

The line bsub -n 8 -R "span[host=1]" is to set a certain numbers of processors to one host.

#BSUB -W 10:00
#BSUB -n 4  # 4 cores
#BSUB -M 8000
#BSUB -R rusage[mem=2048] # ask for 2GB per job/core, or 8GB total
#BSUB -R "span[hosts=1]"
#BSUB -e <some directory>/%J.err
#BSUB -o <some directory>/%J.out
module load mpich1/gnu
cd ~
# execute program
mpiexec -np 4 myprogramname

Large-memory computing

LSF is configured to measure memory in Megabytes. So, if you know that your job requires a lot (say, 16GB) of memory, then you can request it with the #BSUB -M 16000 directive.

An interactive job example

There are several types of interactive access when using LSF. If you just need an interactive shell access, use bsub -Is <shell>. If you need to run an interactive batch job, then do bsub -Is <script>. This is useful for small debugging or test runs.See the discussion of the format for bsub -I in manpages for additional information. You should use this only for short, interactive runs. If there are no nodes free, the bsub command will wait until they become available. This can be a long wait, even hours, depending on the mix of running and queued jobs. Please check the system to be sure that there are available nodes before issuing bsub -I. You can determine if there are free nodes by using the bslots command.

bsub -Is -n 2 -W 30 myscript

This requests interactive access to 2 processors for thirty minutes. Change the number of nodes and processors and the time to suit your needs.

An interactive session on a node will be:

bsub -W 2:00 -n 8 -M 32000 -R "span[hosts=1] -Is bash

This requests interactive access to 8 processors for 2 hours with 32GB of RAM in a single host.

X-Forwarding in interactive jobs

LSF also provides X-Forwarding ability through a special -X switch that you can use in an interactive job. Please note that -X switch always needs a -I switch. LSF will complain, correctly so, if your DISPLAY environment variable is not set properly for a job request that has -X switch. An example job request with X-Forwarding would look like this.

bsub -Is -XF -n 2 <shell>

bsub -Is -XF -n 2 <script>

Once nodes are allocated to you, you will receive a command prompt. Type ^C (control-c) or "exit" to exit the job.

GPU Jobs

To run a job on a GPU node you need make sure you select the gpu-v100 queue and tell LSF how many GPU's you would like to use. For instance in your wrapper batch file you will need the following 2 lines:

#BSUB -q gpu-v100 or #BSUB -q gpu-a100

#BSUB -gpu "num=1"

or:

#BSUB -W 10:00
#BSUB -n 4
#BSUB -M 64000
#BSUB -R "span[hosts=1]"
#BSUB -e %J.err
#BSUB -o %J.out
#BSUB -q gpu-v100
#BSUB -gpu "num=1"

#load modules
module load <module>

# execute program
my_program

bsub -q gpu-v100 -gpu num=2 -R "span[hosts=1]" -n 8 -W 2:00 -M 16000 -Is bash

An interactive session on a gpu node will be:

bsub -q gpu-v100 -W 2:00 -n 8 -M 32000 -gpu "num=1" -R "span[hosts=1]" -Is bash

bsub -q gpu-a100 -W 2:00 -n 8 -M 32000 -gpu "num=1" -R "span[hosts=1]" -Is bash

for the A100 GPU's

Other Notes:

The default wallclock time for jobs is 1 hour and this includes jobs submitted using bsub -I. When you use bsub -I you hold your processors whether you compute or not. Thus, as soon as you are done with your job commands you should type ^ D to end your interactive job. If you submit an interactive job and do not specify a wall clock time you will hold your processors for 1 hour or until you type ^D.