Linux bvhp 5.14.0-1042-oem #47-Ubuntu SMP Fri Jun 3 18:17:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
Base env: conda create -n jax python=3.10
Docs: https://github.com/google/jax#installation
-
(CPU)
pip install --upgrade "jax[cpu]"
-
(GPU)
pip install --upgrade "jax[cuda]"
Both CPU and GPU pip installs were very quick and worked first try without any issues.
JAX has a quickstart page here:
https://jax.readthedocs.io/en/latest/notebooks/quickstart.html
All examples from the quickstart (covering jit
, grad
, and vmap
) executed properly and showed expected results on th first try using the GPU package.
A more advanced tutorial covers convolutions:
https://jax.readthedocs.io/en/latest/notebooks/convolutions.html
This chapter I did run into some issues running examples out of the box. Running the first example resulted in errors Unimplemented: DNN library is not found
. Some searching led to this issue. The solution (conda install cudnn
) was simple but the DNN requirement was not mentioned on the page.
After this hiccup, all the examples ran without issue and produced expected output (including plots) on both Linux (GPU) and MacOS (CPU-only).
Another more advanced tutorial for jax.Array
located here:
https://jax.readthedocs.io/en/latest/notebooks/jax_Array.html
These examples assume 8 devices, but my local machine only has 2. However, with minimal updates to function parameters to account for this, all examples worked out of the box on both Linux (GPU) and MacOS (CPU-only).
https://jax.readthedocs.io/en/latest/multi_process.html
Furthermore, you must manually run your JAX program on each host! JAX doesn’t automatically start multiple processes from a single program invocation.
I was not able to successfully run jax.distributed.initialize
, it seemed to hang no matter what parameters were given, and eventually errored with:
XlaRuntimeError: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: PjRT_Client_Connect
After some consultation, it was suggested that jax.distributed.initialize
does not work in the standard REPL, despite that being what is demonstrated in their docs page above. The small example below was successful:
# mpirun -np 2 python script.py
import jax
import platform
import jax.numpy as jnp
from mpi4py import MPI
rank = MPI.COMM_WORLD.Get_rank()
nproc = MPI.COMM_WORLD.Get_size()
if rank == 0:
hostname = platform.node()
MPI.COMM_WORLD.bcast(hostname, root=0)
else:
hostname = MPI.COMM_WORLD.bcast(None, root=0)
addr = f"{hostname}:1234"
jax.distributed.initialize(addr, nproc, rank, local_device_ids=rank)
print(jax.device_count())
print(jax.local_device_count())
xs = jax.numpy.ones(jax.local_device_count())
print(jax.pmap(lambda x: jax.lax.psum(x, 'i'), axis_name='i')(xs))
Docs: https://jax.readthedocs.io/en/latest/profiling.html
with jax.profiler.trace("/tmp/jax-trace", create_perfetto_link=True):
# Run the operations to be profiled
key = jax.random.PRNGKey(0)
x = jax.random.normal(key, (5000, 5000))
y = x @ x
y.block_until_ready()
This method pops up a UI to display the trace, but does block exectution until the UI is used to start execution.
NOTE: I was not able to get this method working due to this issue jax-ml/jax#13009
More explicit start/stop API. Some issues w/ warnings about missing dependencies, and the Tensorboard UI also seemed a bit flaky.
Docs mention Nsight is also an option, but only link to nsight docs (no examples, etc.)
For single process, more or less instant/immediate after the install was completed. This was true on OSX (cpu-only) as well as Linux (GPU).
The one hiccup was for multi-node execution where some research was required to find a complete working example.
The JAX docs are very nice. While there is a plain API reference, the bulk of the docs are narrative and example-based topical user guide chapters. The prevalence of easily copy-pastable code blocks with expected outputs following helps users develop confidence that things are working as they progress through a chapter.
The docs also directly address valuable practical questions directly:
How can we be sure it’s actually running in parallel? We can do a simple timing experiment:
and, in general, provide discussions of applying JAX to practical use-cases.
The docs also make experimentation accessible by providing Google Colab links to notebooks for each chapter (sign-in required).
Where applicable, the docs also include nice plots.
It's also worth noting that JAX uses rich
to afford very nice TUI output for visualizing things like device shardings:
z = jnp.sin(y)
jax.debug.visualize_array_sharding(z)
┌──────────┬──────────┐
│ TPU 0 │ TPU 1 │
├──────────┼──────────┤
│ TPU 2 │ TPU 3 │
├──────────┼──────────┤
│ TPU 6 │ TPU 7 │
├──────────┼──────────┤
│ TPU 4 │ TPU 5 │
└──────────┴──────────┘
Base env: conda create -n cn python=3.10
Docs: https://nv-legate.github.io/cunumeric/22.03/README.html#installation
conda install -c nvidia -c conda-forge -c legate cunumeric
This worked first try, ws a slightly long/large install.
"Read on for general instructions on building cuNumeric from source" is small and easy to miss, it's not completely obvious the dependencies are for source installs.
Not clear from docs where to go to see a working example.
There are no docs or examples demonstrating how to profile that I could find at https://nv-legate.github.io/cunumeric/22.03/
There could be a demonstration of running a script with --profile
enabled, the need to run python -m http.server
on the output directoy, and then a screenshot showing the expected output in a browser.
There is no quickstart [1] directly in the docs, nor any pointer to example scripts in the repo. There is one link to a CFD notebook in the Overview.
[1] There is a "quickstart repo" but it seems to be poorly names, as "Scripts for building Docker images and libraries from source" does not really seem like quickstart material
There is no explicit discussion of how to run multi-node or multi-gpu, and almost no discssion of how to run legate
. There is one note:
We encourage all users to familiarize themselves with these resource flags as described in the Legate Core documentation
but that note does not actually link to the Legate docs.
Base env: conda create -n rapids python=3.10
Docs: https://rapids.ai/start.html
Tried pip install first:
pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
Did not work:
rapids ❯ pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com ERROR: Could not find a version that satisfies the requirement cudf-cu11 (from versions: 0.0.1) ERROR: No matching distribution found for cudf-cu11
Went back to conda (new env to replace base env):
conda create -n rapids-22.10 -c rapidsai -c conda-forge -c nvidia \ rapids=22.10 python=3.9 cudatoolkit=11.5
This worked but was an enormous and long (several minutes) install.
Docs: https://docs.rapids.ai/start and https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html
"10 Minutes to cuDF and Dask-cuDF" had many example codes that could be copy-pasted directly in to the REPL. All of the examples worked without issue, and cover many Pandas features by way of direct comparison. A RHS navigation bar made it easy to jump to particular topics of interest.
There was as specific section covering Dask local CUDA cluster and observing GPU usage:
https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html#dask-performance-tips
Recommended to use Nsight. No information in main docs, but there is a blog post with detailed information:
Usage seems explicit via decorators that users can add:
@nvtx.annotate(“f()”, color="purple")
def f():
for i in range(5):
with nvtx.annotate("loop", color="red"):
time.sleep(i)
After some initial install hiccups, cudf and dask-dataframe could was immediately runnable in ipython REPL.
Base env: conda create -n cupy python=3.10
Docs: https://docs.cupy.dev/en/stable/install.html
pip install cupy-cuda11x
Quick/simple install worked first try.
Lots of inline code samples in several topical sections:
- Basics of CuPy
- User-Defined Kernels
- Accessing CUDA Functionalities
- Fast Fourier Transform with CuPy
- Memory Management
- Performance Best Practices
- Interoperability
- Differences between CuPy and NumPy
Code from these sections could generally be immediately run without issue after install. One exception was "Multi-GPU FFT" which is labeled experimental and seems to have a compat issue with latest scipy.
Decorator-based:
with cupyx.profiler.profile():
# do something you want to measure
pass
Assume this outputs traces that can be used by nvprof
/ nsys
(which are both mentioned in passing) but could not find any detailed or step-by-step demonstration of profiling.
Pretty much immediate, examples from the docs page could be run out of the box.
- went to create scratch space, learned I did not yet have a unix account, submitted to create one
Didn't get very far trying to use the quickstart before running into various issues.
-
trying https://gitlab-master.nvidia.com/legate/quickstart.internal#prometheus-nvidia
-
error with tcsh (default)
computelab-frontend-1:~/quickstart.internal> module load PrgEnv/GCC+OpenMPI/2021-05-27 cuda gcc openmpi CORRECT>modules load PrgEnv/GCC+OpenMPI/2021-05-27 cuda gcc openmpi (y|n|e|a)? yes /etc/modules: Permission denied.
-
error with bash
bvandeven@computelab-frontend-1:~/quickstart.internal$ module load PrgEnv/GCC+OpenMPI/2021-05-27 cuda gcc openmpi module: command not found
-
-
trying https://gitlab-master.nvidia.com/legate/quickstart.internal#building-and-using-docker-images
-
error with
make_image.sh
failed to dial gRPC: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: permission denied
-
-
discovered need to follow "bare metal" instructions further down, not container instructions that are at the top
- issue here is that several steps are split over several different pages
- need one top to bottom set of steps for users to follow in order
- e.g. link to full docs for "creating conda envs" for context and further info, but give a reasonable specific conda command to execute right there inline to avoid disrupting flow.
-
was not running on scratch space, had to restart
-
Would be nice to advise how to check these versions
Make sure you use an environment file with a --ctk version matching the system-wide CUDA version (i.e. the version provided by the CUDA module you load). Most commonly on clusters you will want to use the system-provided compilers and MPI implementation, therefore you will likely want to use an environment generated with --no-compilers and --no-openmpi.
-
Had originally used the conda env from the separate link above. But that specifies
--compilers
and--openmpi
Based on later instructions, backed that out and used./scripts/generate-conda-envs.py --python 3.10 --ctk 11.7 --os linux --no-compilers --no-openmpi
-
LEGION_DIR=../legion/ ../quickstart.internal/build.sh
fails:CMake Error at /home/scratch.bvandeven_research_1/miniconda3/envs/legate/share/cmake-3.25/Modules/CMakeDetermineCUDACompiler.cmake:180 (message): Failed to find nvcc. Compiler requires the CUDA toolkit. Please set the CUDAToolkit_ROOT
-
was missing https://gitlab-master.nvidia.com/legate/quickstart.internal#computelab-nvidia
- at least for my "new user trying thing things" state of mind, having this information out of band after the main "getting started steps" was counter productive. I got stuck before I ever saw this and therefore did not know it was the reason.
-
using
run.sh
on cholesky exampleORTE does not know how to route a message to the specified daemon located on the indicated node:
- manual ssh key fingerprint issue, this needs docs support if not fixed on admin site
-
finally running
Tried very simple example:
import jax
jax.distributed.initialize()
xs = jax.numpy.ones(jax.local_device_count())
result = jax.pmap(lambda x: jax.lax.psum(x, 'i'), axis_name='i')(xs)
According to Jax docs, no parameters need to be supplied in SLURM environments. Invoked as:
srun -N 2 --exclusive -p dgx-1v-multinode --pty python jax-test.py
However, this failed in obscure ways:
Traceback (most recent call last):
File "/home/scratch.bvandeven_research_1/foo.py", line 3, in <module>
jax.distributed.initialize()
File "/home/scratch.bvandeven_research_1/miniconda3/envs/legate/lib/python3.10/site-packages/jax/_src/distributed.py", line 160, in initialize
global_state.initialize(coordinator_address, num_processes, process_id, local_device_ids)
File "/home/scratch.bvandeven_research_1/miniconda3/envs/legate/lib/python3.10/site-packages/jax/_src/distributed.py", line 80, in initialize
self.client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: PjRT_Client_Connect
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1672961505.985774662","description":"Error received from peer ipv4:127.0.1.1:62724","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Barrier timed out. Barrier_id: PjRT_Client_Connect","grpc_status":4} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-01-05 15:31:46.066509: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:479] Failed to disconnect from coordination service with status: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1672961506.066464202","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3940,"referenced_errors":[{"created":"@1672961506.066461968","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":392,"grpc_status":14}]}. Proceeding with agent shutdown anyway.
srun: error: prm-dgx-02: task 0: Exited with exit code 1
srun: error: prm-dgx-03: task 1: Aborted (core dumped)
ref: https://docs.dask.org/en/stable/deploying.html
-
tried
from dask.distributed import Client```` from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=8, processes=4, memory="16GB", account="bvandeven", walltime="01:00:00", queue="all") cluster.scale(2)
got
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
ref: https://rapids.ai/hpc.html
More successful here, got the example on that page running after only ~10 minutes (there was an error in the code, reported to rapids folks). Details
(rapids-22.10) bvandeven@computelab-frontend-1:/home/scratch.bvandeven_research_1/rapids$ cat scheduler.sh
#!/usr/bin/env bash
#srun -N 1 --exclusive -p dgx-1v-multinode --pty -J scheduler.sh -t 00:10:00 bash scheduler.sh
#module load cuda/11.0.3
CONDA_ROOT=/home/scratch.bvandeven_research_1/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids-22.10
LOCAL_DIRECTORY=/home/scratch.bvandeven_research_1/rapids/tmp
mkdir $LOCAL_DIRECTORY
CUDA_VISIBLE_DEVICES=0 dask-scheduler \
--protocol tcp \
--scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" &
dask-cuda-worker \
--rmm-pool-size 14GB \
--scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json"
(base) bvandeven@computelab-frontend-1:/home/scratch.bvandeven_research_1/rapids$ cat worker.sh
#!/usr/bin/env bash
#srun -N 2 --exclusive -p dgx-1v-multinode --pty -J worker.sh -t 00:10:00 bash worker.sh ]
#module load cuda/11.0.3
CONDA_ROOT=/home/scratch.bvandeven_research_1/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids-22.10
LOCAL_DIRECTORY=/home/scratch.bvandeven_research_1/rapids/tmp
mkdir $LOCAL_DIRECTORY
dask-cuda-worker \
--rmm-pool-size 14GB \
--scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json"
bvandeven@computelab-frontend-1:/home/scratch.bvandeven_research_1/rapids$ cat example.sh
#!/usr/bin/env bash
#srun -N 1 --exclusive -p dgx-1v-multinode --pty -J example.sh -t 00:10:00 bash example.sh
#module load cuda/11.0.3
CONDA_ROOT=/home/scratch.bvandeven_research_1/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids-22.10
LOCAL_DIRECTORY=/home/scratch.bvandeven_research_1/rapids/tmp
cat <<EOF >>/tmp/dask-cudf-example2.py
import cudf
import dask.dataframe as dd
from dask.distributed import Client
client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json")
cdf = cudf.datasets.timeseries()
ddf = dd.from_pandas(cdf, npartitions=10)
res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute()
print(res)
EOF
python /tmp/dask-cudf-example2.py