-
containerizing job environment,
apptainer
is recommended by Hyak team for both speeding up python startup time and reproducibility -
copying frequently used data to
/tmp
dir on the node,/tmp
as described by Hyak team has around 400GB isolated fast SSD storage, and loading/saving data there won't affect others' jobs or slowdown hyak
using alloc to create an interactive session, e.g. salloc -c 8 -p ckpt --time=5-00:00 -n 1 --mem=64G --gpus=a40:1
then module load apptainer
put the following content into a file called app.def
(see official docs more info):
Bootstrap: docker
From: nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu20.04
%setup
echo "SETUP"
echo "$HOME"
echo `pwd`
%files
/tmp/requirements.txt
%post
# Downloads the latest package lists (important).
apt-get update -y
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
python3 \
python3-pip \
python3-setuptools
# set python3 to be default python
update-alternatives --install /usr/bin/python python /usr/bin/python3 1
# Install Python modules.
pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install -r /tmp/requirements.txt
# Reduce the size of the image by deleting the package lists we downloaded,
# which are useless now.
apt-get autoremove -y && \
apt-get clean && \
rm -rf /root/.cache && \
rm -rf /var/lib/apt/lists/*
# see if torch works
python -m torch.utils.collect_env
python -c 'import site; print(site.getsitepackages())'
%environment
export PATH="$HOME/.local/bin:$PATH"
then apptainer build --nv --fakeroot /tmp/app.sif app.def
(--nv
is for exposing GPU env, --fakeroot
is for avoiding root permission)
now you can shell into the container to finish your setup, like install python packages and run setup commands.
apptainer shell -B $(pwd) --nv app.sif
(-B
means binding folder to the container so you can read/write files inside the container.) once the environment setup is done, just exit
from the container.
next time you can just exec the container to run your jobs:
(install pkg from source): apptainer exec -B $(pwd) --nv app.sif pip install -e .
or run python jobs:
apptainer exec -B $(pwd) --nv elv-app.sif python xxx.py --args xxx
#!/bin/bash
module load apptainer
# copy frequently accessed data to /tmp dir, it's much faster!
mkdir -p /tmp/data
cp -r data/training* /tmp/data
n_gpus=$(nvidia-smi --list-gpus | wc -l)
ckpt=ckpt/fancy_model
# fire up a distributed training job
apptainer exec -B $(pwd) --nv app.sif torchrun --nnodes $SLURM_JOB_NUM_NODES --nproc_per_node ${n_gpus} cli/run.py \
task=xxx \
task.image_db_dir=/tmp/data \
num_workers=$(nproc) \
train.ckpt=${ckpt} \
train.epochs=xx \
train.scheduler.warmup_ratio=0.01 \
train.optimizer.learning_rate=1e-4 \
2>&1 | tee data/${ckpt//\//_}.log
job.sh
is the above example run script.
#!/bin/bash
#SBATCH --job-name=job
#SBATCH --output=data/slurm/job.out
#SBATCH --error=data/slurm/job.err
#SBATCH --time=1-00:00
#SBATCH --account=cse
#SBATCH --partition=gpu-rtx6k
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --gpus=4
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --signal=B:TERM@120
#SBATCH --exclude=g3007
sbatch_args=$0
echo $sbatch_args
echo "======>start job...."
echo
echo "$(date): job $SLURM_JOBID starting on $SLURM_NODELIST"
srun job.sh
-
possible home path permission issue (e.g. you modifided the default HOME path): you can prepend HOME var to apptainer build (e.g.
HOME=/mmfs1/home/$USER/ apptainer build --nv --fakeroot /tmp/app.sif app.def
) -
docker source (e.g. you need to compile CUDA src): try find suitable base image from https://hub.docker.com/r/nvidia/cuda/tags or https://catalog.ngc.nvidia.com/containers
-
python package location:
/usr/local/lib/python3.8/
(in fact, this is the key to avoid loading/import python modules on normal storage place like/mmfs1/home/$USER/.local
)