Getting TPUs#

Prerequisites#

A Google Cloud Platform (GCP) account with appropriate billing enabled.
Basic understanding of machine learning and command-line tools.
To set up your GCP Project for TPUs, see Set up a GCP Account

Supported TPUs#

Kithara supports all GCP generations and sizes of TPUs. For supported TPUs, see Cloud TPU Pricing.

TPU VM Image#

We recommend using the tpu-ubuntu2204-base image TPU VM image.

Requesting Capacity#

Please review the following steps before provisioning capacity (creating VMs):

Familiarize yourself with key concepts
Choose a TPU type
Choose a usage mode
Choose a type of capacity
Request quota

After completing these steps, you can provision capacity. Creating VMs without completing these steps first may lead to errors.

Tip

If you want to skip this section, to get started quickly, 64 chips of Trillium (TPU v6e) with DWS (Flex Start) mode in any supported zone is a good starting point. To see the size of models you can tune with this capacity, use this calculator. Instructions to create VMs are here.

Later, you can tune these parameters using the rest of the instructions in this section for production workloads.

1. Familiarize yourself with key concepts#

TPU Chip and Cores

A TPU chip typically (but not always) consists of two TPU cores which share memory and can be thought of as one large accelerator with twice the compute capacity (FLOPs). Inference-optimized chips like the TPU v5e and Trillium (TPU v6e) only have one TPU core per chip.

https://jax-ml.github.io/scaling-book/assets/img/cores.png — A TPU Chip (orange box) with two cores (white boxes)#

Where are these terms used typically?

Chips: To define pricing. E.g: price-per-chip-hour. Chips are also the equivalent of “devices” in the JAX language.
Cores: To create VMs. E.g: You may request a v5e-8 which requests 8 cores (for v5e, this is also equal to 8 chips). Or, you may request v5p-8 which creates a VM with 8 cores (4 chips).

TPU Type	Number of cores
TPU v4p	2x per chip
TPU v5p	2x per chip
TPU v5e	1x per chip
Trillium	1x per chip

Slices and Topology

You may see the terms topology and slice in Google Cloud TPU documentation. Topology refers to TPU networking, i.e., how the chips are connected to each other with high-speed inter-chip interconnects (ICI). A collection of chips connected to each other directly with ICI is called a slice.

The performance-optimized “p-series” (v4, v5p) are interconnected in a 3D topology. The efficient “e-series” (v5e, v6e) are interconnected in a 2D topology.

https://jax-ml.github.io/scaling-book/assets/img/subslices.png — A slice with 3D topology. Left: 2x2x1 topology; Right-top: 2x2x2; Right-bottom: 2x2x4#

https://jax-ml.github.io/scaling-book/assets/img/more-subslices.png — A slice with 2D topology. Right-top: 4x8; Right-bottom: 4x4#

Tip

For most users, the simplest way to get started is to simply create a VM specifying the number of cores you want to use (e.g: v5e-8) without specifying a specific slice shape. This will automatically create the largest slice possible.

Recap: A v5p-128 slice = 128 cores = 64 chips = 4x4x4 topology = 4x4x4 slice

If you are interested, you can read more about TPU networking here.

2. Choose a TPU Type#

Tip

For most Kithara workloads, we recommend choosing TPU v5e or Trillium since they have the most availability.

Additionally, here are some recipes to map your model and workload to an appropriate TPU generation and chips:

Note

You can use this calculator if you do not want to identify the right TPU type manually using the formulas below.

Identify the total HBM memory required by your model using the table at the bottom of the page.
Identify the per-chip HBM memory of the TPU generation of your interest
- TPU v4: 32GB per chip
- TPU v5e: 16GB per chip
- TPU v5p: 96GB per chip
- TPU v6e (Trillium): 32GB per chip
Calculate how many chip you will need
- Divide total required HBM by per-chip HBM
- Example: 35GB HBM required / 32GB per TPU V4 = 1.09 → Use 2 chips minimum
Consider Topology Constraints
- TPUs are arranged in pods with specific slice configurations
- Common topologies: 2×2×1 (4 chips), 2×2×2 (8 chips), 4×4×4 (64 chips)
- Choose next largest supported topology that meets your memory needs

How much total TPU HBM do I need for fine tuning my model?

Model Size Requirements#
Model size	Full Parameter	LoRA
2b	32 GB	10 GB
9b	144 GB	40 GB
27b	432 GB	124 GB
70b	1,120 GB	322 GB
405b	6,480 GB	1,863 GB

These approximates assume you are training with the default mixed precision strategy (i.e. model weights loaded in full precision, activations casted to bfloat16).

Tip

Total HBM required = Model Size + Optimizer Size + Buffer for intermediates

Model Parameters

Required Memory = (Model Size in Billions) × 4GB
Example: 2B model requires 8GB

Optimizer State

Full Fine-tuning: 3 × Model Parameter Memory
Partial Fine-tuning: (% trainable parameters) × Full Optimizer Memory
Example: 2B model
- Full fine-tuning: 24GB
- 5% partial fine-tuning: 1.2GB

Buffer for intermediates

Reserve ~10GB extra HBM memory for intermediate tensors
Memory usage scales linearly with batch size and sequence length
If experiencing Out-of-Memory (OOM):
- Reduce batch size
- Reduce sequence length

3. Choose a usage mode (consumption type)#

Cloud TPU Usage Modes#
Type	How it works	Supported versions, zones and billing	Best fit for:
Spot/Preemptible	You request TPU resources which could be preempted. Spot VMs are available at a much lower price than on-demand resources. Spot VMs might be easier to obtain than on-demand resources but can be preempted (shut down) at any time. There is no limit on runtime duration.	All versions and zones . Billing: Hourly, based on actual usage.	ML users who want to run batch / fault-tolerant workloads. Read more about Spot here.
On Demand	You request TPU resources to be used as soon as possible, for as long as you want. On-demand resources won’t be preempted, but there’s no guarantee that there will be enough available TPU resources to satisfy your request. On demand is the default when you create TPU resources.	All versions and zones . Billing: Hourly, based on actual usage.	On demand is a good fit for workloads that require a flexible end time, likely longer than 7 days.
Dynamic Workload Scheduler (DWS) - Flex Start	You request TPU resources for a specific amount of time, up to 7 days. DWS resources are delivered from a dedicated pool of capacity, so the availability of these resources is higher than on-demand.	TPU v5e, Trillium (TPU v6e) [Zones TBD] \| Billing: Hourly, based on actual usage	ML users who want short-term capacity for jobs that take less than 7 days. More about DWS is described here. Preview starting March 2025
Reservation: 3-year	You request TPU resources in advance for a specific amount of time. These resources are reserved for your exclusive use during that period of time. Reservations provide the highest level of assurance for capacity and are cost-effective, with a lower price than on-demand resources. You can only use a reservation for TPUs if you have a committed use discount (CUD). For more information, contact Google Cloud sales.	All versions and zones \| Billing: Monthly, based on reserved quota	Reservations are ideal for long-running training jobs and inference workloads. These are as they include 3-year committed use discounts (CUDs)
Reservation: 1-year	You request TPU resources in advance for a specific amount of time. These resources are reserved for your exclusive use during that period of time. Reservations provide the highest level of assurance for capacity and are cost-effective, with a lower price than on-demand resources. You can only use a reservation for TPUs if you have a committed use discount (CUD). For more information, contact Google Cloud sales.	All versions and zones \| Billing: Monthly, based on reserved quota	Reservations are ideal for long-running training jobs and inference workloads. These are priced lower than on-demand as they include 1-year committed use discounts (CUDs)

Note

We recommend DWS Flex Start, DWS Calendar Mode (coming soon) or Reservations for Kithara.

4. Choose a type of capacity#

Once you have decided on the billing mode, there are three ways you can secure capacity to create VMs:

Model Size Requirements#
VM Type	Supported Usage Modes	Recommended For
Queued Resource	DWS, Spot, On demand	Any non-GKE usage
Google Kubernetes Engine	DWS (Coming Soon), Spot, On demand, Reservations	Any GKE Usage
Compute Engine	DWS, Spot, On demand, Reservations	Customers with reservations

Before you can create VMs, you must request quota. Read more in the next section.

5. Quota#

Once quota has been granted, you can create as many Dynamic Workload Scheduler (DWS), spot, on-demand, reservation VMs as the quota allows.

When working with Cloud TPUs, you’ll encounter quotas that govern your usage. These limits manage availability.

For example, you might have a quota on:

The number of TPUs you can create: This prevents over-provisioning and ensures resources are available for all users.
The type of TPUs you can access: Quotas are tied to a particular TPU type (e.g: v5e) and a particular VM type (spot or on-demand).

These quotas help ensure fair access, prevent abuse, and maintain the stability of the cloud platform. If your project requires more resources than your current quotas allow, you can typically request increases.

To use TPUs with GKE, a separate quota is required. GKE quota is allocated in terms of number of chips. Non-GKE quota is allocated in terms of number of cores.

Read more about quotas and how to request them here.

6. Provision Capacity#

Once you have determined the type of TPU, type of capacity and usage mode, you are ready to provision capacity.

Steps to create VMs#
Queued Resource →	DWS Flex Start	On demand	Spot	Reservation
GKE VM →	N/A	On demand	Spot (Add `--spot` to the on-demand command)	Reservation (Add `--reservation` and `--reservation-affinity=specific` flags to the on-demand command)
Compute Engine VM →	N/A	On demand	Spot (Add `--spot` to the on-demand command)	Reservation (Add `--reserved` to the on-demand command)

Create DWS (Flex Start) VMs [Coming soon by end of March 2025]#

Flex Start uses the TPU queued resources API to request TPU resources in a queued manner. When the requested resource becomes available, it’s assigned to your Google Cloud project for your immediate, exclusive use. After the requested run duration, the TPU VMs are deleted and the queued resource moves to the SUSPENDED state. For more information about queued resources, see Manage queued resources.

To request TPUs using Flex Start, use the gcloud alpha compute tpus queued-resources create command with the --provisioning-model flag set to FLEX-START and the --max-run-duration flag set to the duration you want your TPUs to run.

gcloud alpha compute tpus queued-resources create \
<your-queued-resource-id> \
--zone=<your-zone> \
--accelerator-type=<your-accelerator-type> \
--runtime-version=<your-runtime-version> \
--node-id=<your-node-id> \
--provisioning-model=FLEX-START \
--max-run-duration=<run-duration>

Replace the following placeholders:

<your-queued-resource-id>: A user-assigned ID for the queued resource request.
<your-zone>: The zone in which to create the TPU VM.
<your-accelerator-type>: Specifies the version and size of the Cloud TPU to create. For more information about supported accelerator types for each TPU version, see TPU versions.
<your-runtime-version>: The Cloud TPU software version.
<your-node-id>: A user-assigned ID for the TPU that is created when the queued resource request is allocated.
<run-duration>: How long the TPUs should run. Format the duration as the number of days, hours, minutes, and seconds followed by d, h, m, and s, respectively. For example, specify 72h for a duration of 72 hours, or specify 1d2h3m4s for a duration of 1 day, 2 hours, 3 minutes, and 4 seconds. The maximum is 7 days.

You can further customize your queued resource request to run at specific times with additional flags:

--valid-after-duration: The duration before which the TPU must not be provisioned.
--valid-after-time: The time before which the TPU must not be provisioned.
--valid-until-duration: The duration for which the request is valid. If the request hasn’t been fulfilled by this duration, the request expires and moves to the FAILED state.
--valid-until-time: The time for which the request is valid. If the request hasn’t been fulfilled by this time, the request expires and moves to the FAILED state.

For more information about optional flags, see the gcloud alpha compute tpus queued-resources create documentation.

Get the status of a Flex Start request#

To monitor the status of your Flex Start request, use the queued resources API to get the status of the queued resource request using the gcloud alpha compute tpus queued-resources describe command:

gcloud alpha compute tpus queued-resources describe <your-queued-resource-id> \
--zone <your-zone>

A queued resource can be in one of the following states:

WAITING_FOR_RESOURCES: The request has passed initial validation and has been added to the queue.
PROVISIONING: The request has been selected from the queue and its TPU VMs are being created.
ACTIVE: The request has been fulfilled, and the VMs are ready.
FAILED: The request could not be completed. Use the describe command for more details.
SUSPENDING: The resources associated with the request are being deleted.
SUSPENDED: The resources specified in the request have been deleted.

For more information, see Retrieve state and diagnostic information about a queued resource request.

Monitor the run time of Flex Start TPUs#

You can monitor the run time of Flex Start TPUs by checking the TPU’s termination timestamp:

Get the details of your queued resource request using the steps in the previous section, Get the status of a Flex Start request.
If the queued resource is waiting for resources: In the output, see the maxRunDuration field. This field specifies how long the TPUs will run once they’re created. If the TPUs associated with the queued resource have been created: In the output, see the terminationTimestamp field listed for each node in the queued resource. This field specifies when the TPU will be terminated.

Delete a queued resource#

Important: Queued resources consume quota regardless of their state. Delete queued resources after use to avoid blocking future requests on quota limits.

You can delete a queued resource request and the TPUs associated with the request by deleting the queued resource request and passing the --force flag to the queued-resource delete command:

gcloud alpha compute tpus queued-resources delete <your-queued-resource-id> \
--zone <your-zone> \
--force

If you delete the TPU directly, you also need to delete the queued resource, as shown in the following example. When you delete the TPU, the queued resource request transitions to the SUSPENDED state, after which you can delete the queued resource request.

To delete a TPU, use the gcloud alpha compute tpus tpu-vm delete command:

gcloud compute tpus tpu-vm delete <your-node-id> \
--zone <your-zone>

Then, to delete the queued resource, use the gcloud alpha compute tpus queued-resources delete command:

gcloud compute tpus queued-resources delete <your-queued-resource-id> \
 --zone <your-zone>

For more information see Delete a queued resource request.