-
-
Save KBlansit/0ecffbeefca90c0457779285ba06d022 to your computer and use it in GitHub Desktop.
Hsiao_Lab COMET onboarding
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
TECHNICAL DETAILS | |
The most important specification at this time is that COMET GPUs have the nvidia 367.48 driver installed. As a result, we are | |
restricted to versions of CUDA <= 8.0. In turn, we are restricted to using tensorflow <= 1.4.1 due to using CUDA 8.0. These | |
drivers may change over time, but thankfully, adapting is as trivial as updating the Dockerfile and rebuilding the container. | |
It should be noted that since the cluster does not have Docker, we need to generate and push Docker images locally, then pull | |
them on the cluster. A bit convoluted, but not too bad. | |
Also, tensorflow 1.4.1 requires cuDNN >= 6 | |
HOW-TO | |
Refer to https://docs.docker.com/engine/reference/builder/ for syntax | |
Based heavily on https://github.com/floydhub/dl-docker | |
The first file we need to define is our Dockerfile. Make a Dockerfile (no extension) in atom/gedit/etc and fill it with | |
something like this (# denote comments): | |
### | |
# Use a baseline to build from, in this case a linux OS with CUDA 8.0 and cuDNN 6 from the nvidia-repo | |
FROM nvidia/cuda:8.0-cudnn6-devel | |
# Run the following commands as you would from Linux; install major programs/dependencies | |
RUN apt-get updates && apt-get install -y \ | |
python3 \ | |
python3-pip \ | |
git | |
# Check to make sure we have the right version of cuda pulled: | |
RUN nvcc -V | |
# Install tensorflow_gpu-1.4.1 | |
ENV TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl | |
RUN pip3 install --ignore-installed --upgrade $TF_BINARY_URL | |
# Install Keras | |
RUN pip3 install keras | |
# Expose ports of TensorBoard (6006) | |
EXPOSE 6006 | |
# set LD_LIBRARY_PATH | |
ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib64" | |
# Define the working directory; you will probably have to change this | |
WORKDIR "/home/evan/testing" | |
# Define the command to run on startup; generally leave this alone | |
CMD ["/bin/bash"] | |
### | |
Once we have our Dockerfile built, we need to build it. First make sure we are in the same dir as Dockerfile. | |
(install nvidia-docker with apt-get if you haven't already; it might work with vanilla docker but why risk it? nvidia-docker | |
is docker modified to use GPUs by nvidia) | |
Now, run the following command (explanation below): | |
>nvidia-docker build -t evmasuta/kerasflow:03 -f Dockerfile . | |
The -t flag is what the container will be named. Since we need to push this to the Docker cloud, we need to specify the user | |
account to which we will push (evmasuta in my case) followed by a name for our container (kerasflow here) followed by a version | |
tag after the colon (here I used 03 to denote version 3, but you can use numbers or strings, so up to you). The -f option | |
denotes the name of the Dockerfile we use, and the period says to look/work in the current directory. | |
It will likely take a few minutes to run this so grab coffee or trigger Kevin or something in the meantime. | |
Once finished, run the following command to upload our docker image to the cloud: | |
>docker image push evmasuta/kerasflow:03 | |
Make sure the name of the image matches the previous command's tag exactly! Also, you may need to login to docker to do this; if | |
so, simply run: | |
>docker login | |
Uploading will take quite a bit of time | |
Once everything is uploaded, login to sdsc comet via terminal. For the purposes of this example case, we will run an | |
interactive job on the shared-gpu resource and can easily modify to run an automated slurm script. For now, however, | |
request a GPU with the ofllowing on the COMET terminal: | |
>srun --pty --gres=gpu:k80:1 --partition=gpu-shared --time=01:00:00 -u bash -i | |
The above requests an interactive session on a shared k80 for 1 hour of time, which is plenty for our purposes. | |
Once the reource has been allocated, load the Singularity module: | |
>module load singularity | |
Now, we need to construct a Singularity container from our uploaded Docker image, making sure we pull the correct image!: | |
>singularity pull docker://evmasuta/kerasflow:03 | |
The output will be a rather large .img file in the current directory; in this case we will find a kerasflow-03.img file. | |
To download the sample MNIST script, run the following: | |
>wget https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py | |
Now comes the fun part. We boot up our kerasflow container in Singularity: | |
>singulariy shell --nv kerasflow-03.img | |
The --nv option specifies that we are enabling GPU support and, more specifically, that we are relying on the cluster's | |
version of the nvidia driver (e.g. 367.48) | |
We then run our MNIST sample case as we would: | |
>python3 convolutional.py | |
With any luck, python3 will detect the allocated GPU(s) and proceed to flood you with glorious MNIST results. | |
If we do not wish to run an interactive Singularity session, we can simply do: | |
>singularity exec --nv kerasflow-03.img python3 convolutional.py | |
Note we swapped out shell for exec here since we are not running an interactive case. | |
We can exit singularity: | |
>exit | |
Note that if you exit one too many times, it will log you out from your GPU allocation, so be careful! | |
SUMMARY CODE | |
### | |
localdocker.sh (following edits to Dockerfile and login to docker): | |
nvidia-docker build -t evmasuta/kerasflow:03 -f Dockerfile . | |
docker image push evmasuta/kerasflow:03 | |
### | |
serversingularity.sh (following login to interactive GPU session) | |
module load singularity | |
singularity pull docker://evmasuta/kerasflow:03 | |
singularity exec --nv kerasflow-03.img python3 convolutional.py | |
exit | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment