Regarding tensorflow on TPU uses single controller job, it is easy to queue tensorflow jobs on TPU using Ray job CLI
- Install Ray
pip install "ray[default]"
- Start Ray Cluster by setting resource to be
{"tpu": 1}
ray start --head --resources='{"tpu": 1}'
- Check it is configured correctly by running
$ray status
, you can find0.0/1.0 tpu
in the resource tab
======== Autoscaler status: 2023-04-03 17:18:53.291394 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_3fcb2d9a878fc1ed698b05ccd087e3ee1dcc5d587248d5b7cffdeb69
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/48.0 CPU
0.00/223.724 GiB memory
0.00/99.873 GiB object_store_memory
0.0/1.0 tpu
- submit you jobs, by running the following example scripts, with
python3 job_submit.py
# job_submit.py
import ray
import time
from ray.dashboard.modules.job.sdk import JobSubmissionClient
from ray.job_submission import JobStatus # , JobStatusInfo
from datetime import datetime
now = datetime.now()
client = JobSubmissionClient("http://127.0.0.1:8265")
job_id = client.submit_job(
# job id
submission_id=f"job_name_test_{now.strftime('%H:%M:%S')}",
# Entrypoint shell command to execute
entrypoint="python3 tpu-test.py",
# Working dir
runtime_env={
"working_dir": "./tpu_test",
"env_vars": {"TPU_NAM": "yejingxin-test", "TPU_LOAD_LIBRARY":"0"}
},
entrypoint_resources={"tpu": 1},
)
print(job_id)
Note: please configure you job with resource entrypoint_resources={"tpu": 1}
, so that each job will occupy the whole TPU pod slice.
- You can submit multiple jobs, all the jobs will be queued and executed in the submission order, forward 8265 port
-L8265:localhost:8265
, you can view the job list in Ray dashboard