Published 9/8/2023
Quick & Helpful Slurm Commands
A quick guide to using Slurm for distributed machine learning.
In the lab I work in, we have access to a High Performance Computing (HPC) environment that uses the Slurm Workload Manager.
I've been using it for a while now, and I've found a few commands that I use all the time. I thought I'd share them here in case they're useful to anyone else.
Checking Job Status
Cancelling Jobs
Requesting a Node Interactively
Submitting a Job
What if all of your compute nodes are allocated, or you don't want your job to exit as soon as your terminal connection is closed? In that case, you can use sbatch
to submit a job to the queue. It will automatically run as soon as it can allocate the resources.
This will take slightly more setup. Assume that the job we actually want to run is contained in myjob.sh
. In order to submit that script as a job, we'll first create a Bash script that will be run by Slurm. Let's call it run.sh
:
Note that we're using the #SBATCH
processing directive to pass in the parameters that we would have passed to salloc
before. We're also using srun
to run our actual job; it will handle running the script across multiple nodes, if we so desire.
Finally, to launch our script, we'll run:
Conclusion
That's it! I hope this was helpful. If you have any questions, you can ask ChatGPT or Bard (they'll give either incredibly helpful or completely incorrect answers, but it's worth a shot!)
You can also look through the Slurm documentation or the Leo's notes page on Slurm for more information.
If you liked this article, don't forget to share it and follow me at @nebrelbug on X!