DACHS/Queues: Difference between revisions
Line 29: | Line 29: | ||
| gpu401 |
| gpu401 |
||
| time=30, mem-per-cpu=5000mb |
| time=30, mem-per-cpu=5000mb |
||
| time=72:00:00, nodes=1, mem= |
| time=72:00:00, nodes=1, mem=500000mb, ntasks-per-node=96 |
||
|- style="vertical-align:top; text-align:left" |
|- style="vertical-align:top; text-align:left" |
||
| gpu8 |
| gpu8 |
||
Line 46: | Line 46: | ||
#SBATCH --gres=gpu:1 |
#SBATCH --gres=gpu:1 |
||
#SBATCH --time=1:00:00 |
#SBATCH --time=1:00:00 |
||
#SBATCH --mail-type=all |
|||
#SBATCH --mail-user=my_email@hs-esslingen.de |
|||
module load devel/cuda/12.4 |
module load devel/cuda/12.4 |
||
cd $TMPDIR |
cd $TMPDIR |
||
Line 55: | Line 57: | ||
Submitting <code>sbatch python_run.slurm</code> will allocate one compute node and allocate the one available GPU for 1 hour. Furthermore, this will load the CUDA module version 12.4. It will then change to the '''fast''' scratch directory specified in the environment variable <code>TMPDIR</code>. |
Submitting <code>sbatch python_run.slurm</code> will allocate one compute node and allocate the one available GPU for 1 hour. Furthermore, this will load the CUDA module version 12.4. It will then change to the '''fast''' scratch directory specified in the environment variable <code>TMPDIR</code>. |
||
You '''have''' to allocate the GPU, otherwise You may not use it. |
|||
It will then follow Python's best practices and create a new Virtual Environment in that directory, then installing the dependencies of the projects detailed in <code>my_requirements.txt</code> |
It will then follow Python's best practices and create a new Virtual Environment in that directory, then installing the dependencies of the projects detailed in <code>my_requirements.txt</code> |
||
It then copies the data directory in <code>my_data_dir</code> to this directory using <code>rsync</code>. |
It then copies the data directory in <code>my_data_dir</code> to this directory using <code>rsync</code>. |
||
Finally, it executes your main python script, using the time command to figure out, how much time actually was used. |
Finally, it executes your main python script, using the time command to figure out, how much time actually was used. |
||
Alternatively you may time all the commands to get an estimate for Your next batch job. |
Alternatively you may time all the commands to get an estimate for Your next batch job. |
||
Here, Slurm will email to the specified address upon start and completion of the job with a summary. |
|||
The '''better''' your approximation, the better the Slurm scheduler may allocate resources to all users. |
The '''better''' your approximation, the better the Slurm scheduler may allocate resources to all users. |
Revision as of 13:46, 18 December 2024
Partitions
DACHS offers three partitions in Slurm, which map directly to the node types: nodes with one NVIDIA L40S GPU, a node with 4 AMD MI300A APUs and the node with 8 NVIDIA H100 GPUs.
sinfo_t_idle
To see the available nodes, DACHS offers the tool sinfo_t_info, which any user may call.
sbatch -p partition
Batch jobs specify compute requirements, which must fit the resources as in maximum (wall-)time, memory and GPU resources.
If You require a GPU, You must specify this with your request.
These are restricted and must fit the available partitions.
Since requested compute resources are NOT always automatically mapped to the correct queue class, you must add the correct queue class to your sbatch command .
As with BwUniCluster 2.0, the specification of a partition is required.
Details are:
DACHS sbatch -p partition | ||||
---|---|---|---|---|
queue | node | default resources | maximum resources | |
gpu1 | gpu1[01-45] | time=30, mem-per-node=5000mb | time=72:00:00, nodes=16, mem-per-node=300000mb, res=gpu:1 | |
gpu4 | gpu401 | time=30, mem-per-cpu=5000mb | time=72:00:00, nodes=1, mem=500000mb, ntasks-per-node=96 | |
gpu8 | gpu801 | time=30, mem-per-cpu=5000mb, cpus-per-gpu=8 | time=48:00:00, mem=752000mb, ntasks-per-node=96 |
Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms --time
, --ntasks
, --nodes
, --mem
and --mem-per-cpu
.
A typical Slurm batch script (called for brevity python_run.slurm
) for 1-node requiring one NVIDIA L40S GPU:
#!/bin/bash #SBATCH --partition=gpu1 #SBATCH --ntasks-per-gpu=48 #SBATCH --gres=gpu:1 #SBATCH --time=1:00:00 #SBATCH --mail-type=all #SBATCH --mail-user=my_email@hs-esslingen.de module load devel/cuda/12.4 cd $TMPDIR python3 -m venv my_environment . my_environment/bin/activate python3 -m pip install -r $HOME/my_requirements.txt rsync -avz $HOME/my_data_dir/ . time python3 $HOME/python_script.py
Submitting sbatch python_run.slurm
will allocate one compute node and allocate the one available GPU for 1 hour. Furthermore, this will load the CUDA module version 12.4. It will then change to the fast scratch directory specified in the environment variable TMPDIR
.
You have to allocate the GPU, otherwise You may not use it.
It will then follow Python's best practices and create a new Virtual Environment in that directory, then installing the dependencies of the projects detailed in my_requirements.txt
It then copies the data directory in my_data_dir
to this directory using rsync
.
Finally, it executes your main python script, using the time command to figure out, how much time actually was used.
Alternatively you may time all the commands to get an estimate for Your next batch job.
Here, Slurm will email to the specified address upon start and completion of the job with a summary.
The better your approximation, the better the Slurm scheduler may allocate resources to all users.
Interactive usage
To get a good estimation of runtime, You may first want to try the resource interactively:
srun --partition=gpu1 --ntasks-per-gpu=48 --gres=gpu1 --pty /bin/bash
Then You may execute the steps in python_run.slurm
script interactively, noting differences and amend your Slurm batch script.
Please note the --pty
which forwards the standard output and takes standard input to allow working with the Shell.
Multiple nodes
Of course You may allocate multiple GPUs across nodes running:
sbatch --nodes 4 ./python_run.slurm
Please be aware, that TMPDIR is still local. For the time being run from Your $HOME.
Nodes with multiple GPUs
The partitions gpu4
and gpu8
feature multiple GPUs.
The gpu4
partition contains the node gpu401
featuring 4 AMD MI300A APUs with 128GB of memory each using ROCm.
Please refer to the documentation on this node.
The gpu8
partition contains the node gpu401
featuring 4 AMD MI300A APUs with 128GB of memory each using ROCm.
Please refer to the documentation on this node.