yejingxin

GPU Training Goodput Explained

A typical journey for a cloud GPU customer involves training a model with a specific goal: processing a set number of tokens with a model of a certain size.

Setup

Model size: $P$ billion parameters
Training data: $D$ tokens
Availability goal (SLO): $A$
Effective training speed: $F$ FLOPs/second across $C$ chips

create tpu

gcloud alpha compute tpus tpu-vm create yejingxin-tpu-v2     --zone us-central2-b     --project tpu-prod-env-one-vm     --accelerator-type v4-32     --version tpu-vm-v4-base

install jax

gcloud compute tpus tpu-vm ssh yejingxin-tpu-v2 \
 --zone=us-central2-b --worker=all --command="pip install --upgrade 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html"

find ip and topology correspondence

How to queue Tensorflow Jobs on TPU

Overivew

Regarding tensorflow on TPU uses single controller job, it is easy to queue tensorflow jobs on TPU using Ray job CLI

Setup

Install Ray

pip install "ray[default]"

Notes to understand indeterministic output from different reduce order more:

A simple concrete reduce example is summation, if you sum up numbers in different order, it will produce different results. i.e. (n_1 + ((n_2 + n_3) + n_4)) != (((n_1 + n_2) + n_3) + n_4). The following simple code snippet simulates matrix multiplication with reduce sum operation, let us assume X is the input batch with all rows the same, and W is the weight matrix, and Z = XW is the model output, and input batch X is partitioned by column, and weight W is partitioned by row, and both of them are partitioned into four shards,

# this cell runs on CPU, it takes about a few mins to finish.
import numpy as np

x = np.random.normal(size=[8192])

	"""
	$ python3 autocheckpoint.py
	train step=1
	train step=2
	train step=3
	train step=4
	train step=5
	train step=6
	train step=7
	train step=8

	# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.

	# https://developer.nvidia.com/nsight-systems
	# https://docs.nvidia.com/nsight-systems/profiling/index.html

	# My preferred nsys (command line executable used to create profiles) commands
	#
	# In your script, write
	# torch.cuda.nvtx.range_push("region name")
	# ...

	import unittest
	from absl import logging
	import time
	import checkpoint_utils
	import time
	import orbax.checkpoint as ocp

	class TestStringMethods(unittest.TestCase):

	def test_upper(self):

	import ray
	from ray.cluster_utils import Cluster
	from collections import deque

	# Starts a head-node for the cluster.
	cluster = Cluster(
	initialize_head=True,
	head_node_args={
	"resources": {"host": 1},
	})

	"""
	Copyright 2023 Google LLC
	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at
	https://www.apache.org/licenses/LICENSE-2.0
	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and

	# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,