Job management
CS 300 (PDC)
A job is a request for parallel computation that is treated as a unit. This term is ordinarily associated with batch computing.
For example, a single BLAST request by a bioinformatics researcher may be considered as a single job. That request may be carried out on a distributed system running a parallel implementation of BLAST, or may be carried out on a standalone system with only computer-level or processor-level parallelism, but from the user's viewpoint, it is a single request for computation. Likewise, most systems would treat a map-reduce operation as one job, which happens to involve three stages running on potentially thousands of computers.
If a cluster is only used by one person or group, there is presumably no problem managing cluster jobs. But if more than one independent entity wants to use the same cluster, some software for coordinating job submissions and observing their progress becomes a priority.
A cluster management system provides both of the following:
SLURM (Simple Linux Utility for Resource Management) is an open-source batch queueing system for managing cluster or grid computations, developed by Lawrence Livermore National Labs, a research facility. Designed to Linux clusters of all sized, we use SLURM to manage jobs on the St. Olaf Beowulf clusters (2013).
SLURM is capable of managing standalone or distributed batch jobs, and also interactive jobs.
Condor is an open-source framework designed for htc. It focuses on heterogeneous grids, combining support for diverse computing scenarios, including dedicatedclusters and idle user workstations.
In the case of idle user workstations, Condor performs cycle scavenging: it detects when a workstation is "not in use," according to policies that the workstation's primary user agrees to; then it runs jobs (that may have been requested by other users) on that idle workstation. In order not to inconvenience the primary user, Condor stops jobs quickly as soon as the primary user begins using that workstation.
Since Condor is designed to support jobs that may take months to complete, and since it stops jobs as soon as a workstation is no longer idle, Condor supports a mechanism called checkpointing in which software markers are recorded that enable a job's computation to be resumed (probably on a different computer). Condor actually records checkpoints throughout a jobs's computation, rather than waiting for the primary user to return. This reduces the inconvenience for the primary user, because long runs between checkpoints lead to more time delay waiting for a checkpoint to take place. It also increases the reliability of the system, because less of a job's computation will be lost if the user's workstation crashes for any reason.
Some tradeoffs with frequent checkpointing are that (1) the computational overhead of initiating and completing a checkpoint must be incurred with every checkpoint, and (2) new checkpoints largely supercede old ones, so that the computation of producing old checkpoints is wasted whenever a new checkpoint is produced. However, these tradeoffs are acceptable for Condor, since it is focused on long-term computations, not operations per second, and because policies that are friendly to primary uses of workstations encourage more of those users to participate in Condor, thus providing more resources for htc.
Some features of a cluster job management system
Accepting job requests
Scheduling job requests, i.e., choosing start times and determining which resources each job will use, e.g., which processors each job will run on.
Dispatching jobs, i.e., starting them running
Managing jobs while they are running on their nodes. This may include reporting on a job's status (collectively and/or per process), observing when those jobs have completed, stopping jobs before completion if necessary, etc.
Allocating computational resources
Implementation
Daemons. sshd, httpd, lpd
.
SLURM sbatch
batch scripts
sbatch -n count -o outfile script
myscript.sh
contains
#!/bin/sh mpirun trap.cthen the command
% sbatch -n 4 -o myscript.out myscript.shsubmits a job to execute the MPI program
trap.c
with four
processors, and to place the output in a file
myscript.out
. Any error output will go to the standard
error; the -e flag can be used to place error output into a file.
Another command-line option:
-e errfile
place
standard error from the job in a filescript
in shell-script comments that begin with
#SBATCH
. For example, if myscript2.sh
contains
#!/bin/sh #SBATCH -n 4 -o myscript.out mpirun trap.cthen the command
% sbatch myscript2.shperforms a computation equivalent to the
sbatch
command
above. More than one #SBATCH
line may be used, e.g.,
#!/bin/sh #SBATCH -n 4 -o myscript.out #SBATCH -e myscript.err mpirun trap.c
squeue
A command-line option:
-u username
View only username
's jobssinfo
scancel job-id
squeue
to find
job-id values
scancel -u username
will cancel all your jobs