Skip to content

Instantly share code, notes, and snippets.

@yejingxin
Created April 3, 2023 17:47
Show Gist options
  • Save yejingxin/dad4234cdfdbfe13ab3971e37760c2cc to your computer and use it in GitHub Desktop.
Save yejingxin/dad4234cdfdbfe13ab3971e37760c2cc to your computer and use it in GitHub Desktop.
how_to_queue_tf_jobs_on_tpu

How to queue Tensorflow Jobs on TPU

Overivew

Regarding tensorflow on TPU uses single controller job, it is easy to queue tensorflow jobs on TPU using Ray job CLI

Setup

  • Install Ray
pip install "ray[default]"
  • Start Ray Cluster by setting resource to be {"tpu": 1}
ray start --head --resources='{"tpu": 1}'
  • Check it is configured correctly by running $ray status, you can find 0.0/1.0 tpu in the resource tab
======== Autoscaler status: 2023-04-03 17:18:53.291394 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_3fcb2d9a878fc1ed698b05ccd087e3ee1dcc5d587248d5b7cffdeb69
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/48.0 CPU
 0.00/223.724 GiB memory
 0.00/99.873 GiB object_store_memory
 0.0/1.0 tpu

Submit and Monitor Jobs

  • submit you jobs, by running the following example scripts, with python3 job_submit.py
# job_submit.py
import ray
import time

from ray.dashboard.modules.job.sdk import JobSubmissionClient
from ray.job_submission import JobStatus  # , JobStatusInfo
from datetime import datetime

now = datetime.now()

client = JobSubmissionClient("http://127.0.0.1:8265")

job_id = client.submit_job(
        # job id
        submission_id=f"job_name_test_{now.strftime('%H:%M:%S')}",
        # Entrypoint shell command to execute
        entrypoint="python3 tpu-test.py",
        # Working dir
        runtime_env={
          "working_dir": "./tpu_test",
          "env_vars": {"TPU_NAM": "yejingxin-test", "TPU_LOAD_LIBRARY":"0"}
        },
        entrypoint_resources={"tpu": 1},
)
print(job_id)

Note: please configure you job with resource entrypoint_resources={"tpu": 1}, so that each job will occupy the whole TPU pod slice.

  • You can submit multiple jobs, all the jobs will be queued and executed in the submission order, forward 8265 port -L8265:localhost:8265 , you can view the job list in Ray dashboard

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment