How to queue Tensorflow Jobs on TPU

Overivew

Regarding tensorflow on TPU uses single controller job, it is easy to queue tensorflow jobs on TPU using Ray job CLI

Setup

Install Ray

pip install "ray[default]"

Start Ray Cluster by setting resource to be {"tpu": 1}

ray start --head --resources='{"tpu": 1}'

Check it is configured correctly by running $ray status, you can find 0.0/1.0 tpu in the resource tab

======== Autoscaler status: 2023-04-03 17:18:53.291394 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_3fcb2d9a878fc1ed698b05ccd087e3ee1dcc5d587248d5b7cffdeb69
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/48.0 CPU
 0.00/223.724 GiB memory
 0.00/99.873 GiB object_store_memory
 0.0/1.0 tpu

Submit and Monitor Jobs

submit you jobs, by running the following example scripts, with python3 job_submit.py

# job_submit.py
import ray
import time

from ray.dashboard.modules.job.sdk import JobSubmissionClient
from ray.job_submission import JobStatus  # , JobStatusInfo
from datetime import datetime

now = datetime.now()

client = JobSubmissionClient("http://127.0.0.1:8265")

job_id = client.submit_job(
        # job id
        submission_id=f"job_name_test_{now.strftime('%H:%M:%S')}",
        # Entrypoint shell command to execute
        entrypoint="python3 tpu-test.py",
        # Working dir
        runtime_env={
          "working_dir": "./tpu_test",
          "env_vars": {"TPU_NAM": "yejingxin-test", "TPU_LOAD_LIBRARY":"0"}
        },
        entrypoint_resources={"tpu": 1},
)
print(job_id)

Note: please configure you job with resource entrypoint_resources={"tpu": 1}, so that each job will occupy the whole TPU pod slice.

You can submit multiple jobs, all the jobs will be queued and executed in the submission order, forward 8265 port -L8265:localhost:8265 , you can view the job list in Ray dashboard

yejingxin/ray_tf_tpu.md

How to queue Tensorflow Jobs on TPU

Overivew

Setup

Submit and Monitor Jobs