Running quantized LLMs behind a FastAPI proxy with Ollama and vLLM backends. Covers model quantization tradeoffs, GGUF vs GPTQ vs AWQ, streaming responses, request queuing, Docker Compose deployment, and production monitoring.
Tyler McDaniel
AI Engineer & IBM Business Partner
Calling OpenAI's API is fine until it isn't. Maybe your data can't leave your network. Maybe you're burning $2,000/month on tokens that a quantized local model handles for 80% of your use cases. Maybe you just want to stop building on someone else's rate limits. Whatever the reason, self-hosting LLMs with FastAPI is the production path I keep coming back to — and after running local inference for eight clients across three GPU configurations, this is the guide that covers what the tutorials skip.
Self-hosting LLMs with FastAPI gives you a clean, async HTTP interface in front of whatever serving backend you choose. The model runs behind a backend like [Ollama](https://ollama.com/), [vLLM](https://docs.vllm.ai/), or [llama.cpp server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server). FastAPI handles routing, authentication, request queuing, streaming, and health checks. You ship a Docker container, not a notebook.
The model choice isn't about "which chatbot is best." It's about fitting your quality requirements into your VRAM budget.
| Model | Parameters | VRAM (FP16) | VRAM (Q4_K_M) | Quality Tier | Best For |
|-------|-----------|-------------|---------------|-------------|----------|
| Llama 3.1 8B | 8B | 16 GB | 5 GB | Good for structured tasks | Classification, extraction, simple RAG |
| Mistral 7B v0.3 | 7B | 14 GB | 5 GB | Good general purpose | Chat, summarization, code assist |
| Llama 3.1 70B | 70B | 140 GB | 40 GB | Excellent | Complex reasoning, long-form generation |
| Qwen 2.5 72B | 72B | 144 GB | 42 GB | Excellent, strong multilingual | Same as 70B + CJK languages |
| Phi-3.5 Mini | 3.8B | 8 GB | 3 GB | Decent for size | Edge deployment, low-resource servers |
| DeepSeek-R1 | 671B (MoE) | Massive (multi-node) | ~160 GB active | State-of-the-art reasoning | Complex chain-of-thought, math |
The quantization sweet spot for production is Q4_K_M or Q5_K_M in GGUF format. Q4_K_M gives you roughly 80-85% of FP16 quality at 25-30% of the VRAM. Below Q4, quality degrades noticeably on nuanced tasks. Above Q5, the VRAM increase isn't justified by the quality gain for most applications.
My rule of thumb: if you have a single consumer GPU (RTX 4090 with 24 GB VRAM), you're running 7-8B models at Q4_K_M or 70B models with heavy quantization. If you have an A100 (80 GB), you can run 70B at Q5_K_M comfortably. Two A100s open up everything.
A note on GGUF vs GPTQ vs AWQ. These are different quantization formats, and which one you use depends on your serving backend:For practical purposes: if you're using Ollama, you use GGUF. If you're using vLLM, you use AWQ or GPTQ. Don't fight the ecosystem.
Your FastAPI app doesn't run inference directly. It proxies to a serving backend that manages the model, handles batching, and talks to the GPU.
| Backend | Quantization Support | Continuous Batching | GPU Utilization | Production Ready | API Style |
|---------|---------------------|-------------------|----------------|-----------------|-----------|
| [Ollama](https://ollama.com/) | GGUF (Q2–Q8) | Limited | Moderate | Yes | REST (OpenAI-compatible) |
| [vLLM](https://docs.vllm.ai/) | GPTQ, AWQ, FP16 | Yes (PagedAttention) | Excellent | Yes | OpenAI-compatible |
| [llama.cpp server](https://github.com/ggerganov/llama.cpp) | GGUF (all quants) | Yes (recent versions) | Good | Yes (mature) | REST |
| [TGI](https://huggingface.co/docs/text-generation-inference/) | GPTQ, AWQ, FP16 | Yes | Excellent | Yes | REST + gRPC |
Ollama is the easiest to get running.ollama pull llama3.1:8b and you have a model serving on port 11434. The API is OpenAI-compatible, so your FastAPI proxy barely needs to transform payloads. Downside: batching is limited, and throughput under concurrent load lags behind vLLM significantly.
vLLM is the throughput king. [PagedAttention](https://arxiv.org/abs/2309.06180) manages KV-cache like virtual memory, which means it serves 3-5x more concurrent requests per GPU than naive implementations. If you're building anything that handles more than 10 concurrent users, vLLM is the answer.
llama.cpp server strikes the middle ground. It runs GGUF models on anything — CPU-only, Apple Silicon, NVIDIA, AMD. The continuous batching support landed in late 2024 and has gotten solid. I use it when I need to support heterogeneous hardware and don't want to maintain separate deployments.For this guide I'll show Ollama first for simplicity, then a vLLM alternative — the FastAPI layer I'm building is backend-agnostic.
Here's the full FastAPI application. This is production code, not a tutorial fragment.
import asyncio
import hashlib
import hmac
import time
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
from typing import Optional
import httpx
from fastapi import FastAPI, HTTPException, Request, Depends
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
OLLAMA_BASE = "http://localhost:11434"
API_KEY_HASH = hashlib.sha256(b"your-secret-key-change-this").hexdigest()
MAX_CONCURRENT = 10
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
http_client: Optional[httpx.AsyncClient] = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global http_client
http_client = httpx.AsyncClient(
base_url=OLLAMA_BASE,
timeout=httpx.Timeout(connect=5.0, read=120.0, write=10.0, pool=5.0),
)
# Verify backend is reachable
try:
resp = await http_client.get("/api/tags")
resp.raise_for_status()
except httpx.ConnectError:
raise RuntimeError(f"Cannot reach Ollama at {OLLAMA_BASE}")
yield
await http_client.aclose()
app = FastAPI(title="LLM Proxy", version="1.0.0", lifespan=lifespan)
class ChatMessage(BaseModel):
role: str = Field(pattern=r"^(system|user|assistant)$")
content: str = Field(min_length=1, max_length=32_000)
class ChatRequest(BaseModel):
model: str = Field(default="llama3.1:8b", max_length=100)
messages: list[ChatMessage] = Field(min_length=1, max_length=50)
stream: bool = False
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
max_tokens: int = Field(default=2048, ge=1, le=8192)
class ChatChoice(BaseModel):
index: int
message: ChatMessage
finish_reason: str
class ChatResponse(BaseModel):
id: str
model: str
choices: list[ChatChoice]
usage: dict
def verify_api_key(request: Request) -> None:
auth = request.headers.get("Authorization", "")
if not auth.startswith("Bearer "):
raise HTTPException(status_code=401, detail="Missing Bearer token")
token = auth[7:]
token_hash = hashlib.sha256(token.encode()).hexdigest()
if not hmac.compare_digest(token_hash, API_KEY_HASH):
raise HTTPException(status_code=401, detail="Invalid API key")
@app.get("/health")
async def health():
try:
resp = await http_client.get("/api/tags")
models = resp.json().get("models", [])
return {
"status": "healthy",
"backend": "ollama",
"models_loaded": [m["name"] for m in models],
"timestamp": time.time(),
}
except Exception as e:
raise HTTPException(status_code=503, detail=f"Backend unhealthy: {e}")
@app.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completions(req: ChatRequest):
async with semaphore:
payload = {
"model": req.model,
"messages": [m.model_dump() for m in req.messages],
"stream": req.stream,
"options": {
"temperature": req.temperature,
"num_predict": req.max_tokens,
},
}
if req.stream:
return StreamingResponse(
stream_response(payload),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
resp = await http_client.post("/api/chat", json=payload)
if resp.status_code != 200:
raise HTTPException(status_code=502, detail="Backend error")
data = resp.json()
return ChatResponse(
id=f"chatcmpl-{int(time.time())}",
model=req.model,
choices=[
ChatChoice(
index=0,
message=ChatMessage(
role="assistant",
content=data["message"]["content"],
),
finish_reason="stop",
)
],
usage={
"prompt_tokens": data.get("prompt_eval_count", 0),
"completion_tokens": data.get("eval_count", 0),
"total_tokens": data.get("prompt_eval_count", 0)
+ data.get("eval_count", 0),
},
)
async def stream_response(payload: dict) -> AsyncGenerator[str, None]:
async with http_client.stream("POST", "/api/chat", json=payload) as resp:
async for line in resp.aiter_lines():
if not line:
continue
yield f"data: {line}\n\n"
yield "data: [DONE]\n\n"
@app.get("/v1/models", dependencies=[Depends(verify_api_key)])
async def list_models():
resp = await http_client.get("/api/tags")
models = resp.json().get("models", [])
return {
"data": [
{
"id": m["name"],
"object": "model",
"owned_by": "local",
}
for m in models
]
}
Key design decisions:
asyncio.Semaphore(10) caps concurrent backend calls. Excess requests wait in the asyncio event loop, not in a crash./v1/chat/completions with the same request/response shape. This means any client library that talks to OpenAI can point at your server by changing the base URL. openai.OpenAI(base_url="http://your-server:8000/v1") just works.stream_response generator reads from Ollama's streaming endpoint and re-emits as Server-Sent Events. This gives clients token-by-token output without holding the entire response in memory.hmac.compare_digest prevents timing attacks. The key is hashed at startup so the plaintext never sits in server memory after init. In production, pull the hash from an environment variable or secrets manager, not a hardcoded string./health verifies the backend is responsive and returns loaded models. Your container orchestrator (Docker Compose, Kubernetes) should poll this.Here's a production-ready Docker Compose stack. Save as docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
api:
build: .
ports:
- "8000:8000"
environment:
- OLLAMA_BASE=http://ollama:11434
depends_on:
ollama:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 15s
timeout: 5s
retries: 3
volumes:
ollama_data:
And the Dockerfile:
FROM python:3.12-slimWORKDIR /app
RUN pip install --no-cache-dir fastapi uvicorn httpx pydantic
COPY main.py .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
One worker. Not two, not four. LLM inference is GPU-bound, not CPU-bound. Multiple Uvicorn workers don't help — they just multiply memory usage. If you need to scale, run multiple replicas behind a load balancer, each with its own GPU allocation.
Pull and preload your model after the stack is up:
docker compose up -d
docker compose exec ollama ollama pull llama3.1:8b
Test it:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer your-secret-key-change-this" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Explain TCP in one sentence."}]
}'
When concurrent users climb past single digits, Ollama's lack of continuous batching becomes the bottleneck. vLLM's [PagedAttention](https://arxiv.org/abs/2309.06180) mechanism manages KV-cache blocks like OS virtual memory pages — fragmentation drops, throughput climbs, and you serve 3-5x more concurrent requests per GPU.
The swap is mostly infrastructure. Your FastAPI code barely changes because vLLM exposes an OpenAI-compatible endpoint natively. Here's the updated Docker Compose service:
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8080:8000"
volumes:
- model_cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--quantization awq
--max-model-len 8192
--gpu-memory-utilization 0.85
--enable-chunked-prefill
--max-num-batched-tokens 16384
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s
volumes:
model_cache:
Note the start_period: 120s on the health check. vLLM takes 30-120 seconds to load a model, depending on size and whether it's cached. Without a grace period, Docker restarts the container before it finishes loading.
The FastAPI proxy needs only one change — point OLLAMA_BASE at http://vllm:8000 and use the standard /v1/chat/completions path directly:
BACKEND_BASE = os.environ.get("BACKEND_BASE", "http://vllm:8000")
BACKEND_CHAT_PATH = "/v1/chat/completions"
In the endpoint handler, replace the Ollama-specific /api/chat with:
resp = await http_client.post(BACKEND_CHAT_PATH, json={
"model": req.model,
"messages": [m.model_dump() for m in req.messages],
"stream": req.stream,
"temperature": req.temperature,
"max_tokens": req.max_tokens,
})
That's it. vLLM's API matches OpenAI's schema exactly, so the response parsing from the Ollama version works without changes. The usage field comes back with proper token counts, and streaming uses the same SSE format.
The --gpu-memory-utilization 0.85 flag is critical. vLLM pre-allocates a percentage of available VRAM for KV-cache at startup. Setting it to 0.85 leaves headroom for CUDA kernels and prevents OOM kills under peak load. I've found 0.80-0.90 to be the safe range; below 0.80 you're leaving throughput on the table.
--enable-chunked-prefill lets vLLM interleave prefill (processing the prompt) and decode (generating tokens) within the same batch. This reduces time-to-first-token for new requests when the GPU is already generating for other users. It's the single biggest latency improvement for multi-user serving.
The asyncio.Semaphore in the FastAPI app caps concurrent backend calls, but it doesn't give you visibility or control over the queue. In production, you need to know:
Here's a more production-grade queue implementation:
import asyncio
import time
from dataclasses import dataclass, field
from fastapi import HTTPException
@dataclass
class QueueMetrics:
queued: int = 0
processing: int = 0
completed: int = 0
rejected: int = 0
total_wait_ms: float = 0.0
total_inference_ms: float = 0.0
metrics = QueueMetrics()
class InferenceQueue:
def __init__(self, max_concurrent: int = 10, max_waiting: int = 50,
max_wait_seconds: float = 30.0):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.max_waiting = max_waiting
self.max_wait_seconds = max_wait_seconds
self._waiting = 0
async def __aenter__(self):
if self._waiting >= self.max_waiting:
metrics.rejected += 1
raise HTTPException(
status_code=429,
detail=f"Queue full ({self.max_waiting} requests waiting)",
headers={"Retry-After": "5"},
)
self._waiting += 1
metrics.queued = self._waiting
enqueue_time = time.monotonic()
try:
await asyncio.wait_for(
self.semaphore.acquire(),
timeout=self.max_wait_seconds,
)
except asyncio.TimeoutError:
self._waiting -= 1
metrics.queued = self._waiting
metrics.rejected += 1
raise HTTPException(
status_code=504,
detail=f"Queue timeout after {self.max_wait_seconds}s",
)
self._waiting -= 1
wait_ms = (time.monotonic() - enqueue_time) * 1000
metrics.total_wait_ms += wait_ms
metrics.queued = self._waiting
metrics.processing += 1
return self
async def __aexit__(self, *exc):
self.semaphore.release()
metrics.processing -= 1
metrics.completed += 1
queue = InferenceQueue(max_concurrent=10, max_waiting=50, max_wait_seconds=30.0)
Use it in the endpoint:
@app.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completions(req: ChatRequest):
async with queue:
# ... same inference logic as before
pass
The max_waiting=50 parameter is the backpressure valve. When 50 requests are already queued, new ones get a 429 with a Retry-After header. This is better than letting the queue grow unbounded — an unbounded queue eventually exhausts memory or makes every request time out. Tell clients to retry rather than pretending you can handle infinite load.
The max_wait_seconds=30.0 timeout catches the case where the backend is alive but slow. If a user's request sits in queue for 30 seconds, they've already left the page. Returning a 504 is more honest than making them wait 2 minutes.
You can't operate what you can't observe. Add a /metrics endpoint that your monitoring stack (Prometheus, Datadog, whatever) can scrape:
@app.get("/metrics")
async def get_metrics():
avg_wait = (
metrics.total_wait_ms / metrics.completed
if metrics.completed > 0
else 0.0
)
avg_inference = (
metrics.total_inference_ms / metrics.completed
if metrics.completed > 0
else 0.0
)
return {
"requests_queued": metrics.queued,
"requests_processing": metrics.processing,
"requests_completed": metrics.completed,
"requests_rejected": metrics.rejected,
"avg_wait_ms": round(avg_wait, 2),
"avg_inference_ms": round(avg_inference, 2),
}
For GPU-level monitoring, I run [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) as a sidecar container. It exposes GPU utilization, VRAM usage, temperature, and power draw as Prometheus metrics. The three numbers I watch:
max_concurrent or switch to vLLM.--gpu-memory-utilization.requests_queued regularly exceeds max_concurrent * 2, you need more GPU capacity or a smaller model.Alert on requests_rejected > 0 — any rejected request means your capacity is insufficient for your traffic.
GPU memory management is the single biggest operational headache when self-hosting LLMs with FastAPI — or any serving setup. Here's what I've learned the hard way: Monitor VRAM constantly. The model itself takes a fixed chunk (proportional to parameter count and quantization). The KV-cache grows with context length and batch size. If KV-cache + model weights > VRAM, you get an OOM kill with no graceful degradation.
# Watch GPU memory in real-time
watch -n 1 nvidia-smi
Set context length limits. Ollama defaults to 2048 context tokens. For my deployments I set it per-model:
ollama run llama3.1:8b --ctx-size 4096
Every 1K of additional context costs roughly 50-100 MB of VRAM depending on the model architecture. An 8B model at 32K context can eat 16 GB just in KV-cache. Use swap as a last resort, not a strategy. NVIDIA Unified Memory will spill to system RAM when VRAM is full. This "works" but inference speed drops 10-50x. If you're hitting swap regularly, you need a smaller model or more VRAM — there's no trick to make it fast. NVIDIA MIG (Multi-Instance GPU) is worth investigating if you're on A100 or H100. It partitions a single GPU into isolated instances, each with its own VRAM and compute. You can run multiple small models on one GPU without them fighting for memory. See the [NVIDIA MIG documentation](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for setup.
Self-hosting is not always cheaper than API calls. Here's the honest math:
An RTX 4090 costs ~$1,600. Electricity, cooling, and your time to maintain it add ~$50-100/month. That's roughly $200/month amortized over a year.
For $200/month in OpenAI API costs, you get roughly 20 million GPT-4o-mini tokens or 1 million GPT-4o tokens. If your workload is below that, the API is cheaper. Period.
Self-hosting wins when:
It doesn't win when you need frontier model quality. GPT-4o and Claude Opus are still better than anything you can self-host. The gap is closing, but it's real.
One of the underrated benefits of an OpenAI-compatible self-hosted endpoint is that it's a drop-in replacement everywhere. Here's how you swap in your local model with the official OpenAI Python client:
from openai import OpenAIclient = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key-change-this",
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a code review assistant."},
{"role": "user", "content": "Review this Python function for bugs..."},
],
temperature=0.3,
max_tokens=1024,
)
print(response.choices[0].message.content)
This also works with [LangChain](https://python.langchain.com/docs/integrations/chat/openai/), [LlamaIndex](https://docs.llamaindex.ai/), or any framework that accepts an OpenAI-compatible endpoint. No SDK changes, no custom adapters. That's the whole point of matching the API surface.
For [MCP Protocol in LLM Applications](https://tostupidtooquit.com/blog/mcp-protocol-llm-applications), you can point MCP clients at your self-hosted endpoint the same way you'd point them at OpenAI. The tool-calling schema is identical.
For hardening the Linux servers running these workloads, see [Linux Server Hardening for AI Workloads](https://tostupidtooquit.com/blog/linux-server-hardening-ai-workloads) — GPU servers are high-value targets and most aren't secured properly. If you're feeding these models with vector search, [Vector Databases: A Practitioner's Comparison](https://tostupidtooquit.com/blog/vector-databases-practitioner-comparison) covers which embedding stores are worth running locally.
---
Building production MCP servers and clients with Python. Covers the JSON-RPC 2.0 wire protocol, transport layers (stdio, SSE, Streamable HTTP), filesystem tool implementation with path traversal protection, and connecting Claude to your custom tools.
Hardening Linux servers running GPU inference and training workloads. Covers SSH lockdown, Docker rootless mode, NVIDIA driver security, systemd sandboxing, audit logging, and network segmentation for AI infrastructure.
Building production multi-agent systems from scratch. Covers ReAct, plan-and-execute, supervisor, and pipeline patterns with full Python implementations. Includes inter-agent communication, human-in-the-loop, memory systems, failure modes, and a real EdTech production architecture.