Job management (CS 300 (PDC))

Job management

CS 300 (PDC)

Cluster job management

A job is a request for parallel computation that is treated as a unit. This term is ordinarily associated with batch computing.

For example, a single BLAST request by a bioinformatics researcher may be considered as a single job. That request may be carried out on a distributed system running a parallel implementation of BLAST, or may be carried out on a standalone system with only computer-level or processor-level parallelism, but from the user's viewpoint, it is a single request for computation. Likewise, most systems would treat a map-reduce operation as one job, which happens to involve three stages running on potentially thousands of computers.
If a cluster is only used by one person or group, there is presumably no problem managing cluster jobs. But if more than one independent entity wants to use the same cluster, some software for coordinating job submissions and observing their progress becomes a priority.
A cluster management system provides both of the following:
- mechanisms for implementing management features in software and hardware, and
- policy rules for how and when to apply those mechanisms.
SLURM (Simple Linux Utility for Resource Management) is an open-source batch queueing system for managing cluster or grid computations, developed by Lawrence Livermore National Labs, a research facility. Designed to Linux clusters of all sized, we use SLURM to manage jobs on the St. Olaf Beowulf clusters (2013).

SLURM is capable of managing standalone or distributed batch jobs, and also interactive jobs.
Condor is an open-source framework designed for htc. It focuses on heterogeneous grids, combining support for diverse computing scenarios, including dedicatedclusters and idle user workstations.

In the case of idle user workstations, Condor performs cycle scavenging: it detects when a workstation is "not in use," according to policies that the workstation's primary user agrees to; then it runs jobs (that may have been requested by other users) on that idle workstation. In order not to inconvenience the primary user, Condor stops jobs quickly as soon as the primary user begins using that workstation.

Since Condor is designed to support jobs that may take months to complete, and since it stops jobs as soon as a workstation is no longer idle, Condor supports a mechanism called checkpointing in which software markers are recorded that enable a job's computation to be resumed (probably on a different computer). Condor actually records checkpoints throughout a jobs's computation, rather than waiting for the primary user to return. This reduces the inconvenience for the primary user, because long runs between checkpoints lead to more time delay waiting for a checkpoint to take place. It also increases the reliability of the system, because less of a job's computation will be lost if the user's workstation crashes for any reason.

Some tradeoffs with frequent checkpointing are that (1) the computational overhead of initiating and completing a checkpoint must be incurred with every checkpoint, and (2) new checkpoints largely supercede old ones, so that the computation of producing old checkpoints is wasted whenever a new checkpoint is produced. However, these tradeoffs are acceptable for Condor, since it is focused on long-term computations, not operations per second, and because policies that are friendly to primary uses of workstations encourage more of those users to participate in Condor, thus providing more resources for htc.
Some features of a cluster job management system
- Accepting job requests
- Scheduling job requests, i.e., choosing start times and determining which resources each job will use, e.g., which processors each job will run on.
- Dispatching jobs, i.e., starting them running
- Managing jobs while they are running on their nodes. This may include reporting on a job's status (collectively and/or per process), observing when those jobs have completed, stopping jobs before completion if necessary, etc.
- Allocating computational resources
Implementation
- Daemons. sshd, httpd, lpd.
  SLURM: slurmctld, slurmdbd, slurmd
- SLURM sbatch batch scripts

Example: some SLURM commands

sbatch -n count -o outfile script

Submit a script for batch computation. For example, if myscript.sh contains

#!/bin/sh
mpirun trap.c

then the command

  % sbatch -n 4 -o myscript.out myscript.sh

submits a job to execute the MPI program trap.c with four processors, and to place the output in a file myscript.out. Any error output will go to the standard error; the -e flag can be used to place error output into a file.

Another command-line option:

-e errfile place standard error from the job in a file

You can specify command-line options within the batch file script in shell-script comments that begin with #SBATCH. For example, if myscript2.sh contains

#!/bin/sh
#SBATCH -n 4 -o myscript.out

mpirun trap.c

then the command

  % sbatch myscript2.sh

performs a computation equivalent to the sbatch command above. More than one #SBATCH line may be used, e.g.,

#!/bin/sh
#SBATCH -n 4 -o myscript.out
#SBATCH -e myscript.err

mpirun trap.c

squeue

View all running jobs being managed by SLURM

A command-line option:

-u username View only username's jobs

sinfo

Display which nodes are idle, down, or allocated

scancel job-id

Cancel a job. Use squeue to find job-id values
scancel -u username will cancel all your jobs