Two main workarounds for mitigating the hyak io issues

containerizing job environment, apptainer is recommended by Hyak team for both speeding up python startup time and reproducibility
copying frequently used data to /tmp dir on the node, /tmp as described by Hyak team has around 400GB isolated fast SSD storage, and loading/saving data there won't affect others' jobs or slowdown hyak

Build a container image on the gpu node

using alloc to create an interactive session, e.g. salloc -c 8 -p ckpt --time=5-00:00 -n 1 --mem=64G --gpus=a40:1

then module load apptainer

put the following content into a file called app.def (see official docs more info):

Bootstrap: docker
From: nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu20.04

%setup
    echo "SETUP"
    echo "$HOME"
    echo `pwd`

%files
    /tmp/requirements.txt

%post
    # Downloads the latest package lists (important).
    apt-get update -y
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        python3 \
        python3-pip \
        python3-setuptools

    # set python3 to be default python
    update-alternatives --install /usr/bin/python python /usr/bin/python3 1
    
    # Install Python modules.
    pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
    pip install -r /tmp/requirements.txt

    # Reduce the size of the image by deleting the package lists we downloaded,
    # which are useless now.
    apt-get autoremove -y && \
    apt-get clean && \
    rm -rf /root/.cache && \
    rm -rf /var/lib/apt/lists/*

    # see if torch works
    python -m torch.utils.collect_env

    python -c 'import site; print(site.getsitepackages())'

%environment
    export PATH="$HOME/.local/bin:$PATH"

then apptainer build --nv --fakeroot /tmp/app.sif app.def (--nv is for exposing GPU env, --fakeroot is for avoiding root permission)

now you can shell into the container to finish your setup, like install python packages and run setup commands. apptainer shell -B $(pwd) --nv app.sif (-B means binding folder to the container so you can read/write files inside the container.) once the environment setup is done, just exit from the container.

Run a job inside the container image

next time you can just exec the container to run your jobs: (install pkg from source): apptainer exec -B $(pwd) --nv app.sif pip install -e .

or run python jobs: apptainer exec -B $(pwd) --nv elv-app.sif python xxx.py --args xxx

Example run script (`job.sh`)

#!/bin/bash
module load apptainer

# copy frequently accessed data to /tmp dir, it's much faster!
mkdir -p /tmp/data
cp -r data/training* /tmp/data

n_gpus=$(nvidia-smi --list-gpus | wc -l)

ckpt=ckpt/fancy_model

# fire up a distributed training job
apptainer exec -B $(pwd) --nv app.sif torchrun --nnodes $SLURM_JOB_NUM_NODES --nproc_per_node ${n_gpus}  cli/run.py \
    task=xxx \
    task.image_db_dir=/tmp/data \
    num_workers=$(nproc) \
    train.ckpt=${ckpt} \
    train.epochs=xx \
    train.scheduler.warmup_ratio=0.01 \
    train.optimizer.learning_rate=1e-4 \
    2>&1 | tee data/${ckpt//\//_}.log

Example slurm sbatch script (support multi-nodes multi-gpus)

job.sh is the above example run script.

#!/bin/bash
#SBATCH --job-name=job
#SBATCH --output=data/slurm/job.out
#SBATCH --error=data/slurm/job.err
#SBATCH --time=1-00:00
#SBATCH --account=cse
#SBATCH --partition=gpu-rtx6k
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --gpus=4
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --signal=B:TERM@120
#SBATCH --exclude=g3007

sbatch_args=$0
echo $sbatch_args

echo "======>start job...."
echo

echo "$(date): job $SLURM_JOBID starting on $SLURM_NODELIST"
srun job.sh

FAQs

possible home path permission issue (e.g. you modifided the default HOME path): you can prepend HOME var to apptainer build (e.g. HOME=/mmfs1/home/$USER/ apptainer build --nv --fakeroot /tmp/app.sif app.def)
docker source (e.g. you need to compile CUDA src): try find suitable base image from https://hub.docker.com/r/nvidia/cuda/tags or https://catalog.ngc.nvidia.com/containers
python package location: /usr/local/lib/python3.8/ (in fact, this is the key to avoid loading/import python modules on normal storage place like /mmfs1/home/$USER/.local)

csarron/hyak_job_worflow.md