Serve with vLLM#
The fine-tuned model can be served on vLLM:GPU or vLLM:TPU.
Exporting your model#
In order to serve your finetuned model, first make sure that your output checkpoints are exported to a GCS bucket or persistent volume. You can do this byoverriding the model_output_dir parameter in the config yaml:
model_output_dir: "gs://bucket_name/ckpt/"
Starting vLLM with a Finetuned Model#
Next, start your vLLM deployment by attaching the previous output volume as a mounted drive. Please visit our GCS guide (link TBD) for details.
You can start vLLM with a path to the directory containing the model files. For example if you mounted the volume at /model and the model checkpoints are located under the checkpoint directory:
vllm serve /model/checkpoint --tensor_parallel_size 8 --max-model-length 4096
where tensor_parallel_size is equal to the number of TPU chips available. You can adjust max-model-length depending on the availably HBM.
If loading the model is successful, you should see the following output in the console:
INFO 03-05 23:24:13 api_server.py:756] Using supplied chat template:
INFO 03-05 23:24:13 api_server.py:756] None
INFO 03-05 23:24:13 launcher.py:21] Available routes are:
INFO 03-05 23:24:13 launcher.py:29] Route: /openapi.json, Methods: GET, HEAD
INFO 03-05 23:24:13 launcher.py:29] Route: /docs, Methods: GET, HEAD
INFO 03-05 23:24:13 launcher.py:29] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 03-05 23:24:13 launcher.py:29] Route: /redoc, Methods: GET, HEAD
INFO 03-05 23:24:13 launcher.py:29] Route: /health, Methods: GET
INFO 03-05 23:24:13 launcher.py:29] Route: /ping, Methods: POST, GET
INFO 03-05 23:24:13 launcher.py:29] Route: /tokenize, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /detokenize, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /v1/models, Methods: GET
INFO 03-05 23:24:13 launcher.py:29] Route: /version, Methods: GET
INFO 03-05 23:24:13 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /pooling, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /score, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /v1/score, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /rerank, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 03-05 23:24:13 launcher.py:29] Route: /invocations, Methods: POST
INFO: Started server process [67252]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Send a request to the vLLM server to verify that itβs working:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model/checkpoint/",
"prompt": "your prompt",
"max_tokens": 256,
"temperature": 0.7
}'