Published 9/11/2023
Enroot on Slurm for Distributed ML: Part 2
How to use Enroot on Slurm for containerized multi-node training.
UPDATE 2024: I no longer recommend this method and have experienced several issues with it. Instead, I recommend using Pyxis, a tool developed by NVIDIA that simplifies the process of running containers on HPC systems._
This is part 2 of a 2-part series. Part 1 is available here.
In part 1, we covered how to use Enroot on Slurm for containerized single-node training using salloc
. In this post, we'll cover how to use Enroot on Slurm for containerized multi-node training, and transition to using sbatch
.
Step 1: Slurm Launch Script
We'll end up creating several Bash files, all of which should be in the same directory as your training script. The first will be a Slurm launch file that we'll run with sbatch
. This file will contain the same commands we ran with salloc
in part 1, but declared using #SBATCH
processing directives.
launch.sh
Note that we create a variable CUR_DIR
to store the current working directory (the directory where the sbatch
command was run). I use this variable to share the location of my training directory between scripts, so I don't have to hard-code paths. But it's not required.
Slurm will automatically pass local environment variables through to the srun
command, which will run the stage1.sh
script on each node.
Step 2. Enroot Launch Script
Next, we'll create a script that will be run on each node. This script will be responsible for launching the container and running the training script. We'll call this script stage1.sh
.
stage1.sh
Note that we pass several important environment variables provided by Slurm, along with CUR_DIR
, into the container. The MASTER_ADDR
and MASTER_PORT
variables are used by PyTorch's distributed training backend to coordinate communication between nodes.
We also mount a local file path into the container (make sure it contains your training script!).
Step 3. Training Script
Finally, we'll create a training script that will be run inside the container. We'll call this script stage2.sh
.
stage2.sh
Here I've used accelerate as a launcher for my distributed training script, but you can use whatever launcher you want. Just make sure you pass relevant environment variables through!
For the sake of completeness, here's my accelerate_config.yaml
file. It utilizes FSDP (Fully Sharded Data Parallel) to split model parameters and gradients across processes. This is a great way to train large models that won't fit on just one GPU.
Step 4. Submit the Job
Now that we've created all the necessary scripts, we can submit the job to Slurm using sbatch
! From the directory containing the scripts, run:
Your job will be submitted to Slurm and run as soon as resources are available. Output logs will be stored at slurm-<jobid>.out
in the current directory.
Conclusion
I hope this was helpful! There are many parts involved in getting distributed training working, but it's not too difficult once you get over the initial learning curve.
If you liked this article, don't forget to share it and follow me at @nebrelbug on X!