Setting up Ray Cluster with TPU VMs#
Prerequisites#
Clone the Kithara repo.
git clone https://github.com/AI-Hypercomputer/kithara.git
Modify
ray/TPU/cluster.yamlwith your GCP project, zone, and TPU resource types.
Tip
Search for βMODIFYβ in the YAML file to find required changes
Setting up the Ray Cluster#
Launch the cluster:
ray up -y ray/TPU/cluster.yaml
Monitor setup process:
ray monitor ray/TPU/cluster.yamlLaunch Ray dashboard:
ray dashboard ray/TPU/cluster.yamlThe dashboard will be available at
localhost:8265
Troubleshooting#
update-failederrors typically donβt affect proper node setupCheck node status by executing:
ray attach cluster.yaml ray status
Running Multihost Jobs#
Submit job:
python ray/submit_job.py "python3.11 examples/multihost/ray/TPU/sft_lora_example.py" --hf-token your_tokenTo stop a job early:
export RAY_ADDRESS="http://127.0.0.1:8265" ray job stop ray_job_id
Cleanup#
When finished, tear down the cluster:
ray down cluster.yaml