Slurm
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Full documentation is available on slurm.schedmd.com
1. User manual
Executable | Description |
---|---|
|
View information about nodes and partitions |
|
View information about jobs |
|
Obtain a job allocation |
|
Submit a batch script for later execution |
|
Obtain a job allocation (as needed) and execute an application |
|
Signal jobs, job arrays, and/or job steps |
1.1. Run job using sbatch
#!/bin/bash #SBATCH --job-name=job_test # Job name #SBATCH --ntasks=4 # Run on 4 CPU #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=test_%j.log # Standard and error output pwd; hostname; date echo "Running on 4 CPU core" mpiexec feelpp_toolbox_fluid --license date
If hyperthreading is enabled and you do not want to use it : #SBATCH --ntasks-per-core 1
|
In the previous script, we save in log file the standard output and error outup. We can can extract the error output in another file by adding --error=<FILE>
option.
Also, you can be notified by an email when the job is finished or have generated a erro by using --mail-type=<EVENTS>
and --mail-user<EMAIl>
.
#!/bin/bash #SBATCH --job-name=job_test # Job name #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=toto@mail.com # Where to send mail #SBATCH --ntasks=4 # Run on 4 CPU #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=test_%j.log # Standard output #SBATCH --error=test_%j.err # Error output pwd; hostname; date echo "Running on 4 CPU core" mpiexec feelpp_toolbox_fluid --license date
In the slurm scrit, some env variables is automatically defined by SLURM and can be used in the script :
Variable | Description |
---|---|
SLURM_NTASKS SLURM_NPROCS |
Number of task |
SLURM_JOB_ID |
The ID of the job allocation |