Setting up Ray Cluster with TPU GKE#

Prerequisites#

  • Clone the Kithara repo.

git clone https://github.com/AI-Hypercomputer/kithara.git

Preparing Your GKE Cluster#

gcloud container clusters get-credentials $CLUSTER --zone $ZONE --project $YOUR_PROJECT
kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=HUGGING_FACE_TOKEN

Setting Up a Ray Cluster#

  1. Edit one of the following manifest files:

    Make sure to replace YOUR_GCS_BUCKET with the name of the GCS bucket created in previous steps.

  2. Deploy the Ray cluster:

kubectl apply -f $MANIFEST_FILE
  1. Check that the cluster is running with:

kubectl get pods

If everything works as expected, you should see pods running:

NAME                                               READY   STATUS    RESTARTS   AGE
example-cluster-kuberay-head-kgxkp                 2/2     Running   0          1m
example-cluster-kuberay-worker-workergroup-bzrz2   2/2     Running   0          1m
example-cluster-kuberay-worker-workergroup-g7k4t   2/2     Running   0          1m
example-cluster-kuberay-worker-workergroup-h6zsx   2/2     Running   0          1m
example-cluster-kuberay-worker-workergroup-pdf8x   2/2     Running   0          1m

Running a Ray Workload#

  1. Set the following environment variable:

export RAY_ADDRESS=http://localhost:8265
  1. Port-forward to the Ray cluster:

kubectl port-forward svc/example-cluster-kuberay-head-svc 8265:8265 &
  1. Submit a Ray job, for example:

ray job submit  --working-dir . \
    --runtime-env-json='{"excludes": [".git", "kithara/model/maxtext/maxtext/MaxText/test_assets"]}' \
    -- python examples/multihost/ray/TPU/full_finetuning_example.py
  1. You can visit http://localhost:8265 in your browser to see the Ray dashboard and monitor job status.

Clean Up#

  1. When your job is done, you can delete it by running:

kubectl delete -f $MANIFEST_FILE
  1. The GKE cluster can be deleted with:

gcloud clusters delete $CLUSTER --zone $ZONE