Batch Jobs - bwForCluster Chemistry Features - bwHPC Wiki Batch Jobs - bwForCluster Chemistry Features - bwHPC Wiki

Batch Jobs - bwForCluster Chemistry Features

From bwHPC Wiki
Jump to: navigation, search

This page describes batch job options and properties of the applicable to the bwForCluster Justus (Computational and Theoretical Chemistry).

A general description on options that should work on all bwHPC clusters can be found on the Batch Jobs page.

1 Job submission on bwForCluster for Chemistry

Processes Per Node The number of physical cores is 16 on all nodes. If the requested ppn count exceeds this limit, the job will not start running.

Jobs run user node-exclusive: Several jobs (with sum(ppn)<=16) from one user can run on one node at the same time, but only from that user.

ssh Access: Users have ssh-access to the nodes on which their jobs run.

1.1 Disk Space and Resources

Disk space is only available on some of the nodes. It has to be requested in the Moab options or the job may run on a diskless node.

  • disk space content will be erased when job is finished.
  • scratch - disk space allocated per one process (ppn), must be set in gigabytes (GB)

    $ msub -l gres=scratch:8 myjobscript.sh

  • "gres" is a Moab term for "generic resources";
  • "scratch" - name of the resource for disk space
  • "8" - size of disk space in gigabytes (GB). This size is per process

Scratch and available resources:

Nodes count ppn MAX Disk Space (scratch) RAM-Disk Space RAM useable by job
202 16 no scratch, only RAM-disk 64GB (half of total RAM) up to 125GB
204 16 960GB (~1TB) half of total RAM 125GB
22 16 1920GB (~2TB) half of total RAM 251GB
16 16 1920GB (~2TB) half of total RAM 503GB


  • RAM: taking memory reserved by the operating system into account the memory available for jobs is 125GB, 251GB and 503GB.

"RAM-disk" means, that part of virtual memory (RAM) can be used for some temporary jobs files. Size of RAM-disk grows up automatically up to 50% of the RAM size. The rest RAM can be used as a traditional virtual memory.

  • scratch: 960 and 1920 (GB) are the amounts you can request from Moab.

"Scratch" - is the disk space per process (ppn). Example: 100GB of disk space and uses 4 processes. You will have to describe "scratch" as 100/4=25 (GB).
This example requests 100GB (4x25GB) of disk space:


$ msub -l nodes=1:ppn=4,gres=scratch:25 <jobscript>

1.2 Diskspace Environment Variables

  • $TMPDIR and $SCRATCH point to
    • /scratch/<username>_job_<jobid>on nodes with disk space

    • /ramdisk/<username>_job_<jobid> on diskless nodes
  • $RAMDISK points to
    • /ramdisk/<username>_job_<jobid> on any node

1.3 Default Values

The default values for a job are:

  • walltime=48:00:00 - MAX run-time of job
  • nodes=1:ppn=1 - one node with one process

If you do not specify any walltime or amount of nodes, this is what will be set for the job.

1.4 Queues

There is no need to explicitly specify a queue. Jobs will automatically be assigned to a queue depending on the resources they request.
Compute resources such as walltime and nodes are restricted and must fit into the allowed resources of at least one of the queues for job to start. The available queues are:

Queue name Walltime MIN Walltime MAX -l walltime=* MAX nodes
(total per user)
MAX run/idle jobs
(total per user)
quick 5 min 00:00:05:00 2 1/1
short > 5 min 2 days 00:48:00:00 64
normal > 2 days 7 days 07:00:00:00 32
long > 7 days 14 days 14:00:00:00 4
verylong** > 14 days 28 days 28:00:00:00 2
* syntax for maximum allowed time of the queue, try to estimate needed time more accurately, e.g. 04:00:00:00 for a job running ~3.2 days
** to use "verylong" queue please contact administrators

Example how to submit a job in the 'normal' queue:

$ msub -l walltime=72:00:00 <jobscript>

The job runs for three days and hence will start in the "normal" queue (because of the walltime of 72 hours).
Per default, a job starts in the queue "short".

1.5 Other job limitations and features

  • MAX 32 nodes per one job
  • Only 1 user per node - on each node can run jobs ONLY from one user
  • ssh-access to the compute nodes where job is running. When the job is finished/cancelled connection will be closed automatically
  • Opportunity to check job's output file in real time (e.g. for default job - STDIN.o<JOB_ID>, STDIN.e<JOB_ID>)
  • The job will be cancelled automatically when it can not be started because the requested resources do not exist in the cluster

1.6 Job Feedback

As of 2016-04-07, next to the regular output file an additional file containing a feedback on resource usage and job efficiency is created when the job is finished:

  • Name of the file: <job-output-file>_feedback
  • Location: same directory as the job output file

e.g. per default the output file is "<jobname>.o<jobid>" and feedback file - "<jobname>.o<jobid>_feedback"..

Information presented in the feedback file:

  • Main job parameters (job name and state, time, requested and used resources, host list)
  • Job analysis - possible problems and solutions, advice how to increase the efficiency
  • Error messages associated with the job (if present)
  • Link to a web page with graphical information of resource usage during the job (only for jobs with runtime >5 minutes)

2 Environment Variables for Batch Jobs

The bwForCluster for computational and theoretical Chemistry has the following variables in addition to the ones described under

Specific Moab environment variables
Environment variable Description
MOAB_NODELIST List of nodes separated by ampersands (&), e.g.: node1&node2
MOAB_TASKMAP Node list with procs per node separated by ampersands, e.g.: node1:16&node2:16

3 Interactive jobs

By starting an interactive session, a user automatically gets access to compute nodes and run programs interactively.
To submit interactive job with default parameters execute the following:


$ msub -I -V

or

$ msub -I -V -X

where

  • -I: interactive job
  • -V export of all environment variables to the compute node
  • -X X11-forwarding; use if you need to use a program with a graphical interface. Also see VNC for remote visualization.

See the manual page with "man msub" for more details on options.

To use one node interactively for 5 hours run:


$ msub -I -V -X -l nodes=1:ppn=16,walltime=05:00:00


After running the command, msub will not return you to the shell, but wait for a node to become free and the job to start. Do not close the terminal, msub will connect you to a shell session on the node. Once your jobs starts, you will be automatically logged on the dedicated resource. Now you can run any application interacively (nearly) as on a normal shell.

There are some differences to a normal ssh connection (column handling of shell/terminal). If those are a problem connect to the node via ssh or ssh -X.

Once the walltime limit has been reached you will be automatically logged out (including any ssh connection) from the compute node and on the scratch disks and ramdisks all local data will be erased.

4 Chain jobs

It is possible to submit a chain of jobs, i.e. each job runs after the previous job has completed. You can choose between several possible conditions, when the next job in the chain can run. Here is an example script:

#!/bin/bash
##################################################
#
# Script to submit a chain of jobs with dependencies
#
##################################################

# count of jobs to submit (e.g. "5")
MAX_JOBS_COUNT=5

# define your jobscript (e.g. "~/chain_job")
JOB_SCRIPT=~/chain_job

# type of dependency
DEPENDENCY="afterok"
# possible dependencies for this script:
#
# after            after:<job>[:<job>]...           Job may start at any time after specified jobs have started execution.
# afterany      afterany:<job>[:<job>]...     Job may start at any time after all specified jobs have completed regardless of completion status.
# afterok        afterok:<job>[:<job>]...       Job may start at any time after all specified jobs have successfully completed.
# afternotok   afternotok:<job>[:<job>]...  Job may start at any time after all specified jobs have completed unsuccessfully.
#
# list of all dependencies:
# http://docs.adaptivecomputing.com/suite/8-0/enterprise/help.htm#topics/moabWorkloadManager/topics/jobAdministration/jobdependencies.html

count=1
echo "msub $JOB_SCRIPT"
JOBID=$(msub $JOB_SCRIPT 2>&1 | grep -v -e '^$')
echo "$JOBID"
while [ $count -le $MAX_JOBS_COUNT ]; do
    echo "msub -W depend=$DEPENDENCY:$JOBID $JOB_SCRIPT"
    JOBID=$(msub -W depend=$DEPENDENCY:$JOBID $JOB_SCRIPT 2>&1 | grep -v -e '^$')
    echo "$JOBID"
    let count=$count+1
done

where user can change dependency when the next job can run (user can modify script to make a job dependent from more then one jobs):

  • after - job may start at any time after specified jobs have started execution
  • afterany - job may start at any time after all specified jobs have completed regardless of completion status
  • afterok - job may be start at any time after all specified jobs have successfully completed
  • afternotok - job may start at any time after all specified jobs have completed unsuccessfully


5 Job arrays

A user may have to run the same script many times, each time with different data (e.g. modelling of some process with different initial values). Moab has a feature called "job arrays" to help with tasks of that type. To submit a job array, you can use the following syntax:

msub -t [<jobname>]<indexlist>[%<limit>] jobarray.sh

It is possible to use additional options to"msub" to describe the parameters of each job in an array (e.g. each sub-job has a walltime of 30 minutes and uses 2 nodes with 1 virtual processor)

msub -l walltime=00:30:00,nodes=2:ppn=1 -t [<jobname>]<indexlist>[%<limit>] jobarray.sh

The parameter <indexlist> specifies the amount and order of the submitted sub-jobs. For example user wants to submit 10 jobs using 2 msub-commands, one to submit the five odd-numbered jobs (job1) and one to submit the five even-numbered jobs (job2). The commands should be:

msub -t job1.[1-10:2] jobarray.sh


msub -t job2.[2-10:2] jobarray.sh

To specify that only a certain number of sub-jobs in the array can run at a time, use the percent sign (%) delimiter (e.g. %2):

msub -t job.[1-10]%2 jobarray.sh

Each sub-job has 2 specific environment variables:

  • MOAB_JOBARRAYINDEX - index of job in array (e.g. for five even-numbered jobs - 1, 3, 5, 7, 9; for five odd-numbered jobs - 2, 4, 6, 8, 10)
  • MOAB_JOBARRAYRANGE - count of jobs in array (e.g. for all jobs above - 10)


The user can use the variable inside the job-array scripts e.g. to describe different input/output files for each sub-job. Here is an example script "jobarray.sh" with instruction in comments at the end, how to check it:

#!/bin/bash
##################################################
#
# Simple job-array script
# Read some data from input-file and write it to output-file
#
##################################################

#MSUB -l walltime=00:01:00    # walltime
#MSUB -N "array"                      # name of sub-job

cd ${MOAB_SUBMITDIR}

# Input file
INFILE=job.${MOAB_JOBARRAYINDEX}.in

# Output file
OUTFILE=job.${MOAB_JOBARRAYINDEX}.out

echo "Count of jobs in array: ${MOAB_JOBARRAYRANGE}">${OUTFILE}
echo "Index of this subjob: ${MOAB_JOBARRAYINDEX}" >>${OUTFILE}

# Read input and append to output file
cat $INFILE >>$OUTFILE

##################################################
#
# Check how does it work:
#
# 1. Create different input-files (e.g. 4)
# 
#     $ for i in `seq 4`; do echo $i >job.$i.in ; done
#
# 2. Submit a job-array (e.g. with 4 jobs)
#
#     $ msub -t array[1-4] jobarray.sh 
#
# After submitting user sees only one JOBID number. As output-files user can find:
#
# * 4 files job.[1-4].out
# * 4 traditional job output files - array.o<JOBID>-[1-4]
# * 4 traditional job error files - array.e<JOBID>-[1-4]
#
##################################################

After submitting the job, the user sees only one JOBID number.
You can get information about whole job array be typing:

checkjob <JOBID>

It is possible to get full information about each sub-job:

checkjob <JOBID>[<index>]

e.g. to get information about the sub-job 5 of the job 1234 you should type "checkjob 1234[5]"
Each sub-job has own output files:

  • sub-job output files - <jobname>.o<JOBID>-<index>
  • sub-job error files - <jobname>.e<JOBID>-<index>



6 Multiple MPI Jobs Sharing the Same Node

There is a problem on JUSTUS when different jobs share the same node and the intra-node communication is not shared memory. In this case the following error can occur in your logfile:

can't open /dev/ipath, network down (err=26)

This typically happens on multinode setups but some programs (e.g. Turbomole) can also show this behaviour on single node usage. The problem can be solved by setting the environment variable PSM_SHAREDCONTEXTS_MAX, which should satisfy

$PSM_SHAREDCONTEXTS_MAX * (#jobs_on_node) <= 13

For example you could run two of the following jobs on the same node:

#!/bin/bash
#MSUB -l nodes=1:ppn=8
#MSUB -l walltime=00:10:00
# ...
export PSM_SHAREDCONTEXTS_MAX=6
export PARA_ARCH="MPI"

module load chem/turbomole
# ...
dscf > dscf.out
# ...