Setting up Ray Cluster with TPU GKE#
Prerequisites#
Clone the Kithara repo.
git clone https://github.com/AI-Hypercomputer/kithara.git
Create a GKE Cluster with Ray add-on: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gcp-gke-tpu-cluster.html
Preparing Your GKE Cluster#
Enable GCSFuse.
This step allows GCS buckets to be mounted on GKE containers as drives. This makes it easier for Kithara to save checkpoints to GCS.
You can follow the instructions here: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gke-gcs-bucket.html.
Authenticate to Your GKE Cluster:
gcloud container clusters get-credentials $CLUSTER --zone $ZONE --project $YOUR_PROJECT
Create a Hugging Face token on https://huggingface.co/docs/hub/en/security-tokens
Save the Hugging Face token to the Kubernetes cluster:
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=HUGGING_FACE_TOKEN
Setting Up a Ray Cluster#
Edit one of the following manifest files:
Single-host: https://github.com/AI-Hypercomputer/kithara/blob/main/ray/TPU/GKE/single-host.yaml
Multi-host: https://github.com/AI-Hypercomputer/kithara/blob/main/ray/TPU/GKE/multi-host.yaml
Make sure to replace
YOUR_GCS_BUCKETwith the name of the GCS bucket created in previous steps.Deploy the Ray cluster:
kubectl apply -f $MANIFEST_FILE
Check that the cluster is running with:
kubectl get pods
If everything works as expected, you should see pods running:
NAME READY STATUS RESTARTS AGE
example-cluster-kuberay-head-kgxkp 2/2 Running 0 1m
example-cluster-kuberay-worker-workergroup-bzrz2 2/2 Running 0 1m
example-cluster-kuberay-worker-workergroup-g7k4t 2/2 Running 0 1m
example-cluster-kuberay-worker-workergroup-h6zsx 2/2 Running 0 1m
example-cluster-kuberay-worker-workergroup-pdf8x 2/2 Running 0 1m
Running a Ray Workload#
Set the following environment variable:
export RAY_ADDRESS=http://localhost:8265
Port-forward to the Ray cluster:
kubectl port-forward svc/example-cluster-kuberay-head-svc 8265:8265 &
Submit a Ray job, for example:
ray job submit --working-dir . \
--runtime-env-json='{"excludes": [".git", "kithara/model/maxtext/maxtext/MaxText/test_assets"]}' \
-- python examples/multihost/ray/TPU/full_finetuning_example.py
You can visit
http://localhost:8265in your browser to see the Ray dashboard and monitor job status.
Clean Up#
When your job is done, you can delete it by running:
kubectl delete -f $MANIFEST_FILE
The GKE cluster can be deleted with:
gcloud clusters delete $CLUSTER --zone $ZONE