NEMO/Moab

From bwHPC Wiki
Jump to navigation Jump to search

This article contains information on features of the batch job system only applicable on the "bwForCluster NEMO" in Freiburg.


Submitting Jobs on the bwForCluster NEMO

This page describes the details of the queuing system specific to the bwForCluster NEMO.

A general description on options that should work on all bwHPC clusters can be found on the Batch Jobs page.

Currently all worker nodes have 20 physical cores. Do not request more than 20 processes per node with the ppn flag (e.g. ppn=20).

If the requested ppn count exceeds this limit, the job will remain in idle state, i.e. it will not start running.

Jobs run in a shared mode. This means that more than one user can run jobs on the same worker node, if there are spare resources.


Monitor Running Jobs

Once your jobs are running you can log in to the nodes where your jobs were submitted to. The nodes are listed in checkjob.

$ checkjob 12345
...
Allocated Nodes:
[n3101.nemo.privat:20][n3102.nemo.privat:20]
...

Then you can ssh into these nodes. The short host name is sufficient. Please logout after you are finished. And you can use the program pdsh to monitor your jobs non-interactively. pdsh checks where your job is running and performs a task on all nodes where your job is running.

Interactive:

$ ssh n3101

Non-interactive with ssh:

# run 'ps aux | grep <myjob>' on node n3101
$ ssh n3101 'ps aux | grep <myjob>'

Non-interactive with pdsh:

# run 'ps aux | grep <myjob>' on all nodes corresponding to jobid '12345'
$ pdsh -j 12345 'ps aux | grep  <myjob>'
n3101: fr_uid   125068  101  0.0  39040  1684 ?        Sl   12:15   0:25 <myjob>
n3102: fr_uid   125068  101  0.0  39040  1684 ?        Sl   12:15   0:25 <myjob>

# run kill '<myjob>' on all nodes corresponding to jobid '12345'
$ pdsh -j 12345 killall <myjob>

# works with array jobs as well
$ pdsh -j 12346[1] 'ps aux | grep <myjob>'
n3103: fr_uid   125068  101  0.0  39040  1684 ?        Sl   12:15   0:25 <myjob>

Interactive Jobs

Interactive jobs must NOT run on the logins nodes, however resources for interactive jobs can be requested using msub. The following example starts an interactive session on one compute node with one core for one hour:

$ msub  -I -V -l nodes=1:ppn=1 -l walltime=1:00:00

The option "-I" means "interactive job" and the option "-V" exports all environment variables to the compute node of the interactive session. After execution of this command wait until the queuing system has granted you the requested resources. Once granted you will be automatically logged on the allocated compute node. The full node with all cores will be available for you.

If you use applications or tools which provide a GUI, enable X-forwarding for your interactive session with:

# use -Y for ssh X-forwarding
$ ssh -l <uid> -Y login.nemo.uni-freiburg.de
# use -X for X-forwarding
$ msub -I -V -X -l nodes=1:ppn=1,walltime=1:00:00

Once the walltime limit has been reached you will be automatically logged out from the compute node.

msub -q queues

On the bwForCluster NEMO, the queue does not need to be explicitly specified.

Since all compute nodes are identical, the bwForCluster NEMOcurrently only has the the single queue "compute", which is also the default.