HPC Glossary/Batch system: Difference between revisions
| K Siegmund (talk | contribs)  (Aus Kurs übertragen) | K Siegmund (talk | contribs)   (Aus Kurs übertragen) | ||
| Line 11: | Line 11: | ||
| * Compute jobs should start as soon as possible | * Compute jobs should start as soon as possible | ||
| * Full load and efficient usage of all resources | * Full load and efficient usage of all resources | ||
| [[image:distributing_jobs1.svg]] | |||
| == How does a Resource Management System work? == | |||
| A resource management system or batch system manages the compute nodes, jobs and queues and basically consists of two components: | |||
| * A resource manager which is responsible for the node status and for the distribution of jobs over the compute nodes. | |||
| * workload manager (scheduler) which is in charge of job scheduling, job managing, job monitoring and job reporting. | |||
| A resource management system works as follows: | |||
| * The user creates a job script containing requests for compute resources and submits the script to the resource management system. | |||
| * The scheduler parses the job script for resource requests and determines where to run the job and how to schedule it. | |||
| * The scheduler delegates the job to the resource manager. | |||
| * The resource manager executes the job and communicates the status information to the scheduler. | |||
Revision as of 10:58, 1 July 2025
When we speak of a batch system on compute clusters, we mean the system that knows which compute nodes are used by whom and when they will become available. It also knows about all waiting jobs and determines which job are going to start next on which node whenever a node bekomes available.
Why do we need a Resource Management System?
An HPC cluster is a multi-user system. Users have compute jobs with different demands on number of processor cores, memory, disk space and run-time. Some users run a program only occasionally for a big task, other users must run many simulations to finish their projects.
The cluster only provides a limited number of compute resources with certain features. Free access for all users to all compute nodes without time limit will not work. Therefore we need a resource management system (batch system) for the scheduling and the distribution of compute jobs on suitable compute resources. The use of a resource management system pursues several objectives:
- Fair distribution of resources among users
- Compute jobs should start as soon as possible
- Full load and efficient usage of all resources
How does a Resource Management System work?
A resource management system or batch system manages the compute nodes, jobs and queues and basically consists of two components:
- A resource manager which is responsible for the node status and for the distribution of jobs over the compute nodes.
- workload manager (scheduler) which is in charge of job scheduling, job managing, job monitoring and job reporting.
A resource management system works as follows:
- The user creates a job script containing requests for compute resources and submits the script to the resource management system.
- The scheduler parses the job script for resource requests and determines where to run the job and how to schedule it.
- The scheduler delegates the job to the resource manager.
- The resource manager executes the job and communicates the status information to the scheduler.
