evmasuta · August 1, 2020 00:43
diff --git a/Hsiao_Lab COMET b/Hsiao_Lab COMET
 By: Evan Masutani
 As of 11 May 2018, the Hsiao lab has been granted computational time on XSEDE/SDSC (San Diego Supercomputer Cluster), specifically
 on the COMET shared-gpu resource, which currently boasts a number of K80s and P100s (see https://portal.xsede.org/sdsc-comet).  
 The purpose of many projects in the Hsiao lab is to apply machine learning principles to solve medical problems, specifically in 
 the realm of imaging.  At the time of writing, the preferred framework developed in-house uses keras with a tensorflow-gpu backend.
 That many dependencies are required is an understatement.  Furthermore, for reproducibility's sake and the ability to run lab 
 software across multiple platforms (notably transferring load from lab workstations to the supercomputer cluster), it became 
 necessary to seamlessly and intuitively port both custom software and associated dependencies between servers.

 For experienced users, skip the following section (BACKGROUND) and proceed to TECHNICAL DETAILS
 BACKGROUND
 One solution for porting software across multiple systems with varying architecture is Docker.  The general concept of Docker is
 to take a snapshot of a workstation's environment.  Generally, the snapshot will be a pared-down version; it is common to see
 Docker snapshots (images) to contain the OS (usually Linux) and the minimum number of files required to execute a user-specified 
 application.  For example, if, say, one wished to run a python script to label hand-drawn images of numbers with the matching 
 numerical value (MNIST example), the corresponding Docker image would require a Linux OS, CUDA, cuDNN, python3, python3-pip, git,
 and tensorflow, along with the corresponding dependencies managed by apt-get and pip3.  We specify these requirements in a simple
 Dockerfile and have Docker build a container (a detailed explanation is given here: https://www.docker.com/what-container, but
 the practical idea is that containers are somewhat similar to virtual machines but exist as a discrete application rather than
 an entirely seperate server and are therefore 'skinnier' and more efficient). 

 One issue with Docker is that users generally require superuser privileges, which makes usage ons hared resources (e.g. SDSC)
 rather difficult to say the least.  To get around this, SDSC systems use Singularity, which is sort of like a Docker that non-
 privileged users can use to deploy their applications.  In fact, Singularity can use Docker images, which is what we do in our
 use case.  A detailed comparison of Singularity and Docker can be found here: https://www.melbournebioinformatics.org.au/documentation/running_jobs/singularity/#docker-vs-singularity

 If there are other questions regarding background, ask Kevin.  Not Evan.

 TECHNICAL DETAILS
 The most important specification at this time is that COMET GPUs have the nvidia 367.48 driver installed.  As a result, we are
 restricted to versions of CUDA <= 8.0.  In turn, we are restricted to using tensorflow <= 1.4.1 due to using CUDA 8.0.  These
 drivers may change over time, but thankfully, adapting is as trivial as updating the Dockerfile and rebuilding the container.
 It should be noted that since the cluster does not have Docker, we need to generate and push Docker images locally, then pull
 them on the cluster.  A bit convoluted, but not too bad.

 Also, tensorflow 1.4.1 requires cuDNN >= 6

 HOW-TO
 Refer to https://docs.docker.com/engine/reference/builder/ for syntax

 Based heavily on https://github.com/floydhub/dl-docker

 The first file we need to define is our Dockerfile.  Make a Dockerfile (no extension) in atom/gedit/etc and fill it with 
 something like this (# denote comments):
 ###
 # Use a baseline to build from, in this case a linux OS with CUDA 8.0 and cuDNN 6 from the nvidia-repo
 FROM nvidia/cuda:8.0-cudnn6-devel

 # Run the following commands as you would from Linux; install major programs/dependencies
 RUN apt-get updates && apt-get install -y \
    python3 \
    python3-pip \
    git

 # Check to make sure we have the right version of cuda pulled:
 RUN nvcc -V

 # Install tensorflow_gpu-1.4.1
 ENV TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl
 RUN pip3 install --ignore-installed --upgrade $TF_BINARY_URL

 # Install Keras
 RUN pip3 install keras

 # Expose ports of TensorBoard (6006)
 EXPOSE 6006

 # set LD_LIBRARY_PATH
 ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib64"

 # Define the working directory; you will probably have to change this
 WORKDIR "/home/evan/testing"

 # Define the command to run on startup; generally leave this alone
 CMD ["/bin/bash"]
 ###

 Once we have our Dockerfile built, we need to build it.  First make sure we are in the same dir as Dockerfile.
 (install nvidia-docker with apt-get if you haven't already; it might work with vanilla docker but why risk it?  nvidia-docker
 is docker modified to use GPUs by nvidia)
 Now, run the following command (explanation below):

 >nvidia-docker build -t evmasuta/kerasflow:03 -f Dockerfile .

 The -t flag is what the container will be named.  Since we need to push this to the Docker cloud, we need to specify the user
 account to which we will push (evmasuta in my case) followed by a name for our container (kerasflow here) followed by a version
 tag after the colon (here I used 03 to denote version 3, but you can use numbers or strings, so up to you).  The -f option 
 denotes the name of the Dockerfile we use, and the period says to look/work in the current directory.

 It will likely take a few minutes to run this so grab coffee or trigger Kevin or something in the meantime.

 Once finished, run the following command to upload our docker image to the cloud:

 >docker image push evmasuta/kerasflow:03

 Make sure the name of the image matches the previous command's tag exactly!  Also, you may need to login to docker to do this; if
 so, simply run:

 >docker login

 Uploading will take quite a bit of time

 Once everything is uploaded, login to sdsc comet via terminal.  For the purposes of this example case, we will run an
 interactive job on the shared-gpu resource and can easily modify to run an automated slurm script.  For now, however,
 request a GPU with the ofllowing on the COMET terminal:

 >srun --pty --gres=gpu:k80:1 --partition=gpu-shared --time=01:00:00 -u bash -i
 The above requests an interactive session on a shared k80 for 1 hour of time, which is plenty for our purposes.

 Once the reource has been allocated, load the Singularity module:
 >module load singularity

 Now, we need to construct a Singularity container from our uploaded Docker image, making sure we pull the correct image!:
 >singularity pull docker://evmasuta/kerasflow:03 

 The output will be a rather large .img file in the current directory; in this case we will find a kerasflow-03.img file.

 To download the sample MNIST script, run the following:
 >wget https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py

 Now comes the fun part.  We boot up our kerasflow container in Singularity:
 >singulariy shell --nv kerasflow-03.img
 The --nv option specifies that we are enabling GPU support and, more specifically, that we are relying on the cluster's 
 version of the nvidia driver (e.g. 367.48)

 We then run our MNIST sample case as we would:
 >python3 convolutional.py

 With any luck, python3 will detect the allocated GPU(s) and proceed to flood you with glorious MNIST results.

 If we do not wish to run an interactive Singularity session, we can simply do:
 >singularity exec --nv kerasflow-03.img python3 convolutional.py
 Note we swapped out shell for exec here since we are not running an interactive case.

 We can exit singularity:
 >exit

 Note that if you exit one too many times, it will log you out from your GPU allocation, so be careful!

 SUMMARY CODE
 ###
 localdocker.sh (following edits to Dockerfile and login to docker):

 nvidia-docker build -t evmasuta/kerasflow:03 -f Dockerfile .
 docker image push evmasuta/kerasflow:03

 ###
 serversingularity.sh (following login to interactive GPU session)

 module load singularity
 singularity pull docker://evmasuta/kerasflow:03
 singularity exec --nv kerasflow-03.img python3 convolutional.py
 exit
	By: Evan Masutani
	As of 11 May 2018, the Hsiao lab has been granted computational time on XSEDE/SDSC (San Diego Supercomputer Cluster), specifically
	on the COMET shared-gpu resource, which currently boasts a number of K80s and P100s (see https://portal.xsede.org/sdsc-comet).
	The purpose of many projects in the Hsiao lab is to apply machine learning principles to solve medical problems, specifically in
	the realm of imaging. At the time of writing, the preferred framework developed in-house uses keras with a tensorflow-gpu backend.
	That many dependencies are required is an understatement. Furthermore, for reproducibility's sake and the ability to run lab
	software across multiple platforms (notably transferring load from lab workstations to the supercomputer cluster), it became
	necessary to seamlessly and intuitively port both custom software and associated dependencies between servers.

	For experienced users, skip the following section (BACKGROUND) and proceed to TECHNICAL DETAILS
	BACKGROUND
	One solution for porting software across multiple systems with varying architecture is Docker. The general concept of Docker is
	to take a snapshot of a workstation's environment. Generally, the snapshot will be a pared-down version; it is common to see
	Docker snapshots (images) to contain the OS (usually Linux) and the minimum number of files required to execute a user-specified
	application. For example, if, say, one wished to run a python script to label hand-drawn images of numbers with the matching
	numerical value (MNIST example), the corresponding Docker image would require a Linux OS, CUDA, cuDNN, python3, python3-pip, git,
	and tensorflow, along with the corresponding dependencies managed by apt-get and pip3. We specify these requirements in a simple
	Dockerfile and have Docker build a container (a detailed explanation is given here: https://www.docker.com/what-container, but
	the practical idea is that containers are somewhat similar to virtual machines but exist as a discrete application rather than
	an entirely seperate server and are therefore 'skinnier' and more efficient).

	One issue with Docker is that users generally require superuser privileges, which makes usage ons hared resources (e.g. SDSC)
	rather difficult to say the least. To get around this, SDSC systems use Singularity, which is sort of like a Docker that non-
	privileged users can use to deploy their applications. In fact, Singularity can use Docker images, which is what we do in our
	use case. A detailed comparison of Singularity and Docker can be found here: https://www.melbournebioinformatics.org.au/documentation/running_jobs/singularity/#docker-vs-singularity

	If there are other questions regarding background, ask Kevin. Not Evan.

	TECHNICAL DETAILS
	The most important specification at this time is that COMET GPUs have the nvidia 367.48 driver installed. As a result, we are
	restricted to versions of CUDA <= 8.0. In turn, we are restricted to using tensorflow <= 1.4.1 due to using CUDA 8.0. These
	drivers may change over time, but thankfully, adapting is as trivial as updating the Dockerfile and rebuilding the container.
	It should be noted that since the cluster does not have Docker, we need to generate and push Docker images locally, then pull
	them on the cluster. A bit convoluted, but not too bad.

	Also, tensorflow 1.4.1 requires cuDNN >= 6

	HOW-TO
	Refer to https://docs.docker.com/engine/reference/builder/ for syntax

	Based heavily on https://github.com/floydhub/dl-docker

	The first file we need to define is our Dockerfile. Make a Dockerfile (no extension) in atom/gedit/etc and fill it with
	something like this (# denote comments):
	###
	# Use a baseline to build from, in this case a linux OS with CUDA 8.0 and cuDNN 6 from the nvidia-repo
	FROM nvidia/cuda:8.0-cudnn6-devel

	# Run the following commands as you would from Linux; install major programs/dependencies
	RUN apt-get updates && apt-get install -y \
	python3 \
	python3-pip \
	git

	# Check to make sure we have the right version of cuda pulled:
	RUN nvcc -V

	# Install tensorflow_gpu-1.4.1
	ENV TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl
	RUN pip3 install --ignore-installed --upgrade $TF_BINARY_URL

	# Install Keras
	RUN pip3 install keras

	# Expose ports of TensorBoard (6006)
	EXPOSE 6006

	# set LD_LIBRARY_PATH
	ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib64"

	# Define the working directory; you will probably have to change this
	WORKDIR "/home/evan/testing"

	# Define the command to run on startup; generally leave this alone
	CMD ["/bin/bash"]
	###

	Once we have our Dockerfile built, we need to build it. First make sure we are in the same dir as Dockerfile.
	(install nvidia-docker with apt-get if you haven't already; it might work with vanilla docker but why risk it? nvidia-docker
	is docker modified to use GPUs by nvidia)
	Now, run the following command (explanation below):

	>nvidia-docker build -t evmasuta/kerasflow:03 -f Dockerfile .

	The -t flag is what the container will be named. Since we need to push this to the Docker cloud, we need to specify the user
	account to which we will push (evmasuta in my case) followed by a name for our container (kerasflow here) followed by a version
	tag after the colon (here I used 03 to denote version 3, but you can use numbers or strings, so up to you). The -f option
	denotes the name of the Dockerfile we use, and the period says to look/work in the current directory.

	It will likely take a few minutes to run this so grab coffee or trigger Kevin or something in the meantime.

	Once finished, run the following command to upload our docker image to the cloud:

	>docker image push evmasuta/kerasflow:03

	Make sure the name of the image matches the previous command's tag exactly! Also, you may need to login to docker to do this; if
	so, simply run:

	>docker login

	Uploading will take quite a bit of time

	Once everything is uploaded, login to sdsc comet via terminal. For the purposes of this example case, we will run an
	interactive job on the shared-gpu resource and can easily modify to run an automated slurm script. For now, however,
	request a GPU with the ofllowing on the COMET terminal:

	>srun --pty --gres=gpu:k80:1 --partition=gpu-shared --time=01:00:00 -u bash -i
	The above requests an interactive session on a shared k80 for 1 hour of time, which is plenty for our purposes.

	Once the reource has been allocated, load the Singularity module:
	>module load singularity

	Now, we need to construct a Singularity container from our uploaded Docker image, making sure we pull the correct image!:
	>singularity pull docker://evmasuta/kerasflow:03

	The output will be a rather large .img file in the current directory; in this case we will find a kerasflow-03.img file.

	To download the sample MNIST script, run the following:
	>wget https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py

	Now comes the fun part. We boot up our kerasflow container in Singularity:
	>singulariy shell --nv kerasflow-03.img
	The --nv option specifies that we are enabling GPU support and, more specifically, that we are relying on the cluster's
	version of the nvidia driver (e.g. 367.48)

	We then run our MNIST sample case as we would:
	>python3 convolutional.py

	With any luck, python3 will detect the allocated GPU(s) and proceed to flood you with glorious MNIST results.

	If we do not wish to run an interactive Singularity session, we can simply do:
	>singularity exec --nv kerasflow-03.img python3 convolutional.py
	Note we swapped out shell for exec here since we are not running an interactive case.

	We can exit singularity:
	>exit

	Note that if you exit one too many times, it will log you out from your GPU allocation, so be careful!

	SUMMARY CODE
	###
	localdocker.sh (following edits to Dockerfile and login to docker):

	nvidia-docker build -t evmasuta/kerasflow:03 -f Dockerfile .
	docker image push evmasuta/kerasflow:03

	###
	serversingularity.sh (following login to interactive GPU session)

	module load singularity
	singularity pull docker://evmasuta/kerasflow:03
	singularity exec --nv kerasflow-03.img python3 convolutional.py
	exit