Skip to content

Instantly share code, notes, and snippets.

@yejingxin
yejingxin / goodput.md
Last active November 9, 2024 00:22
Goodputdefinition.md

GPU Training Goodput Explained

A typical journey for a cloud GPU customer involves training a model with a specific goal: processing a set number of tokens with a model of a certain size.

Setup

  • Model size: $P$ billion parameters
  • Training data: $D$ tokens
  • Availability goal (SLO): $A$
  • Effective training speed: $F$ FLOPs/second across $C$ chips
@yejingxin
yejingxin / autocheckpoint.py
Created October 25, 2024 22:59
Pytorch Autocheck with SIGTERM
"""
$ python3 autocheckpoint.py
train step=1
train step=2
train step=3
train step=4
train step=5
train step=6
train step=7
train step=8
@yejingxin
yejingxin / nsight.sh
Created September 24, 2024 18:34 — forked from mcarilli/nsight.sh
Favorite nsight systems profiling commands for Pytorch scripts
# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.
# https://developer.nvidia.com/nsight-systems
# https://docs.nvidia.com/nsight-systems/profiling/index.html
# My preferred nsys (command line executable used to create profiles) commands
#
# In your script, write
# torch.cuda.nvtx.range_push("region name")
# ...
@yejingxin
yejingxin / ckpt_unittest.py
Created November 7, 2023 17:19
Debug ckpt loading time
import unittest
from absl import logging
import time
import checkpoint_utils
import time
import orbax.checkpoint as ocp
class TestStringMethods(unittest.TestCase):
def test_upper(self):
import ray
from ray.cluster_utils import Cluster
from collections import deque
# Starts a head-node for the cluster.
cluster = Cluster(
initialize_head=True,
head_node_args={
"resources": {"host": 1},
})
@yejingxin
yejingxin / soft_slicing.md
Last active May 31, 2023 23:26
Soft Slicing
  • create tpu
gcloud alpha compute tpus tpu-vm create yejingxin-tpu-v2     --zone us-central2-b     --project tpu-prod-env-one-vm     --accelerator-type v4-32     --version tpu-vm-v4-base
  • install jax
gcloud compute tpus tpu-vm ssh yejingxin-tpu-v2 \
 --zone=us-central2-b --worker=all --command="pip install --upgrade 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html" 
  • find ip and topology correspondence
@yejingxin
yejingxin / maxtext_ray_runner.py
Created May 8, 2023 17:30
MaxText Ray Example
"""
Copyright 2023 Google LLC
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
@yejingxin
yejingxin / run_multislice_device_count.py
Last active April 18, 2024 07:23
Multislice TPU device count example
# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@yejingxin
yejingxin / ray_tf_tpu.md
Created April 3, 2023 17:47
how_to_queue_tf_jobs_on_tpu

How to queue Tensorflow Jobs on TPU

Overivew

Regarding tensorflow on TPU uses single controller job, it is easy to queue tensorflow jobs on TPU using Ray job CLI

Setup

  • Install Ray
pip install "ray[default]"

Notes to understand indeterministic output from different reduce order more:

A simple concrete reduce example is summation, if you sum up numbers in different order, it will produce different results. i.e. (n_1 + ((n_2 + n_3) + n_4)) != (((n_1 + n_2) + n_3) + n_4). The following simple code snippet simulates matrix multiplication with reduce sum operation, let us assume X is the input batch with all rows the same, and W is the weight matrix, and Z = XW is the model output, and input batch X is partitioned by column, and weight W is partitioned by row, and both of them are partitioned into four shards,

# this cell runs on CPU, it takes about a few mins to finish.
import numpy as np

x = np.random.normal(size=[8192])