How I Actually Deployed vLLM on Kubernetes (And the Reality of a GTX 1080)
Table of Contents
vLLM is not hard to start. It is hard to run well on limited hardware. I learned this when I tried to deploy it on my GTX 1080 and realized most models would not even fit in memory.
I have watched a lot of vLLM discussions on r/LocalLLaMA and r/MachineLearning. The same failure patterns come up: tensor parallelism mismatch, CUDA out of memory, model downloads on pod startup, NCCL errors across nodes. I used to think these were edge cases. Then I tried to deploy vLLM myself and hit the memory wall immediately.
This post is how I actually deployed vLLM on Kubernetes. It is opinionated and assumes you have already proven that self-hosted inference makes sense for your workload. I had not proven that. I just wanted to see if I could do it.
What vLLM Is Actually For
vLLM is for serving many concurrent users from a small number of GPUs. Its advantages, continuous batching, PagedAttention, tensor parallelism, only matter when you have concurrency and GPU pressure.
If you have one user, Ollama is probably better. If you have ten users, maybe still Ollama. If you have a hundred concurrent requests or strict latency requirements, vLLM starts to pull ahead. I have one user. Myself. I was optimizing something I did not need yet. I do that sometimes.
The Hardware Reality
My homelab has a GTX 1080 with 8GB of VRAM. This is not a lot for modern LLMs. Here is what actually fits:
| Model | Size | Fits in 8GB? | Notes |
|---|---|---|---|
| Llama 3.1 8B | ~4GB (Q4) | Yes | The largest model I can run comfortably |
| Llama 3.1 70B | ~35GB (Q4) | No | Would need multiple GPUs or cloud |
| Mistral 7B | ~4GB (Q4) | Yes | Good alternative to Llama |
| CodeLlama 13B | ~7GB (Q4) | Maybe | Borderline, might OOM during long contexts |
I cannot run the models that vLLM is designed for. The whole point of vLLM is to serve large models efficiently. My GPU cannot hold large models. This was the first sign that I was using the wrong tool.
The Deployment That Actually Worked
My baseline Kubernetes deployment looks like this:
apiVersion: apps/v1kind: Deploymentmetadata: name: vllm-llama namespace: vllmspec: replicas: 1 selector: matchLabels: app: vllm-llama template: metadata: labels: app: vllm-llama spec: runtimeClassName: nvidia nodeSelector: gpu-type: nvidia containers: - name: vllm image: vllm/vllm-openai:latest args: - --model - meta-llama/Meta-Llama-3-8B-Instruct - --tensor-parallel-size - "1" - --gpu-memory-utilization - "0.75" ports: - containerPort: 8000 resources: limits: nvidia.com/gpu: "1" memory: 16Gi volumeMounts: - name: models mountPath: /models volumes: - name: models persistentVolumeClaim: claimName: vllm-modelsNotice --gpu-memory-utilization is 0.75, not 0.90. On an 8GB card, I need more headroom. I started at 0.90 and got OOMKills during the first request. I dropped to 0.75 and it worked. Barely.
The Most Common Mistakes (That I Actually Made)
-
Assuming vLLM would magically make my GPU bigger. It does not. vLLM optimizes throughput. It does not create VRAM. I spent an hour trying to load a 13B model before I checked the memory requirements.
-
Model download on startup. I did not pre-stage the model weights. The pod started, saw it needed Meta-Llama-3-8B-Instruct, and began downloading from HuggingFace. My internet is not fast. The pod took 15 minutes to become ready. During that time, Kubernetes killed it twice for failing the readiness probe. I fixed this by downloading the model to the PVC first, then pointing vLLM at the local path.
-
No GPU memory headroom. I mentioned this already. Pushing
--gpu-memory-utilizationto 0.90 works on a 3090. On a 1080, it crashes immediately. -
Wrong quantization format. vLLM expects HuggingFace safetensors, not GGUF. I had a GGUF version of the model from my Ollama setup. vLLM refused to load it. I had to re-download the safetensors version. That was another 20 minutes.
What I Actually Watch
Four metrics cover most of what matters:
- TTFT: time to first token. Tells you how fast new requests start. On my 1080 with 8B model, this is about 200ms. Not great, but usable.
- TPOT: time per output token. Tells you generation speed. About 50ms per token for short prompts. Slower than a 3090, but expected.
- Queue depth: how many requests are waiting. I rarely see this above 1 because I am the only user.
- GPU utilization: whether the GPU is actually busy. It spikes to 95% during generation and drops to 0% between requests. This is normal for a single user.
If queue depth grows while GPU utilization is low, your batching is wrong or your concurrency is too low. If GPU utilization is high and latency is bad, you need more GPUs or a smaller model. In my case, queue depth never grows because I am the only user. I was measuring metrics for a workload I did not have.
What I Actually Use vLLM For
After all this setup, I use vLLM for exactly one thing: benchmarking. I wanted to compare vLLM throughput to Ollama under load. I used a simple Python script to send concurrent requests and measured the results.
| Concurrent Requests | Ollama (tokens/sec) | vLLM (tokens/sec) |
|---|---|---|
| 1 | 12 | 15 |
| 2 | 10 | 14 |
| 4 | 7 | 13 |
| 8 | 4 | 12 |
For a single user, the difference is negligible. For concurrent users, vLLM is clearly better. But I do not have concurrent users. I have me.
I keep vLLM running because I might need it someday. But honestly, Ollama is fine for my actual workload. vLLM was an optimization for a problem I do not have yet, on hardware that cannot support it.
Conclusion
vLLM is the right inference engine for high-throughput, multi-user self-hosted AI. The deployment is not conceptually hard, but the details matter. On a GTX 1080, the details are mostly about memory constraints.
Run Ollama until you have a reason not to. When that reason is real, and you have the hardware to support it, vLLM is waiting. Do not deploy vLLM just because it is “production-grade.” Deploy it when you have production-grade load and production-grade hardware.
The GTX 1080 taught me that not every GPU is meant for LLM serving. It is a great card for learning, experimenting, and small models. It is not a production inference server. I spent a day learning that. You do not have to.