Many labs have converged on using Slurm for managing their shared compute resources. It is fairly easy to get going with Slurm, but it quickly gets unintuitive when wanting to run a hyper-parameter search. In this repo, I provide some scripts to make starting many jobs painless and easy to control.
For this use case, Slurm has introduced Job Arrays. Slurm assigns separate jobs a simple job array id, which is an integer that starts counting from 1. This does not map well onto the usual machine learning jobs that requires running over a grid of hyperparameters.
For this use case, I present an easy flow:
Define a grid and go through all the resulting jobs (and skip jobs if you later extend the grid and rerun)
Robust against failures (e.g. server crashing, kill jobs mid run etc.)
Easily limit parallelism - simply set max number of GPUs to use
The solution involves creating a file with all jobs you want to run (could be created by a Python/Bash script itself). We then iterate through this file by using the Slurm job array id to index a line! When iterating, we check if a job finished (results.json found in the output folder) and skip it if so.