>Services >Proof >Projects >About >Contact

Loading…

>Services >Proof >Projects >About >Contact

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Running quantized LLMs behind a FastAPI proxy with Ollama and vLLM backends. Covers model quantization tradeoffs, GGUF vs GPTQ vs AWQ, streaming responses, request queuing, Docker Compose deployment, and production monitoring.

Tyler McDaniel

AI Engineer & IBM Business Partner

Mar 31, 202624 min read

#llm#python#fastapi#docker#self-hosting

Calling OpenAI's API is fine until it isn't. Maybe your data can't leave your network. Maybe you're burning $2,000/month on tokens that a quantized local model handles for 80% of your use cases. Maybe you just want to stop building on someone else's rate limits. Whatever the reason, self-hosting LLMs with FastAPI is the production path I keep coming back to, and after running local inference across several GPU configurations, this is the guide that covers what the tutorials skip.

Self-hosting LLMs with FastAPI gives you a clean, async HTTP interface in front of whatever serving backend you choose. The model runs behind a backend like Ollama, vLLM, or llama.cpp server. FastAPI handles routing, authentication, request queuing, streaming, and health checks. You ship a Docker container, not a notebook.

Choosing a Model: Quantization Tradeoffs That Actually Matter

The model choice isn't about "which chatbot is best." It's about fitting your quality requirements into your VRAM budget.

Model	Parameters	VRAM (FP16)	VRAM (Q4_K_M)	Quality Tier	Best For
Llama 3.1 8B	8B	16 GB	5 GB	Good for structured tasks	Classification, extraction, simple RAG
Mistral 7B v0.3	7B	14 GB	5 GB	Good general purpose	Chat, summarization, code assist
Llama 3.1 70B	70B	140 GB	40 GB	Excellent	Complex reasoning, long-form generation
Qwen 2.5 72B	72B	144 GB	42 GB	Excellent, strong multilingual	Same as 70B + CJK languages
Phi-3.5 Mini	3.8B	8 GB	3 GB	Decent for size	Edge deployment, low-resource servers
DeepSeek-R1	671B (MoE)	Massive (multi-node)	~160 GB active	State-of-the-art reasoning	Complex chain-of-thought, math

The quantization sweet spot for production is Q4_K_M or Q5_K_M in GGUF format. Q4_K_M gives you roughly 80-85% of FP16 quality at 25-30% of the VRAM. Below Q4, quality degrades noticeably on nuanced tasks. Above Q5, the VRAM increase isn't justified by the quality gain for most applications.

My rule of thumb: if you have a single consumer GPU (RTX 4090 with 24 GB VRAM), you're running 7-8B models at Q4_K_M or 70B models with heavy quantization. If you have an A100 (80 GB), you can run 70B at Q5_K_M comfortably. Two A100s open up everything.

A note on GGUF vs GPTQ vs AWQ. These are different quantization formats, and which one you use depends on your serving backend:

GGUF is the llama.cpp native format. Works with Ollama and llama.cpp server. Supports on-the-fly quantization levels (Q2 through Q8) and mixed precision (K-quants). Best for CPU+GPU hybrid inference and Apple Silicon.
GPTQ is a post-training quantization method that produces calibrated weights. Works with vLLM, TGI, and AutoGPTQ. Generally higher quality than GGUF at the same bit width because it uses a calibration dataset.
AWQ (Activation-aware Weight Quantization) is newer and often slightly better than GPTQ. Supported by vLLM natively. If you're using vLLM, AWQ is the default recommendation.

For practical purposes: if you're using Ollama, you use GGUF. If you're using vLLM, you use AWQ or GPTQ. Don't fight the ecosystem.

Serving Backend Comparison

Your FastAPI app doesn't run inference directly. It proxies to a serving backend that manages the model, handles batching, and talks to the GPU.

Backend	Quantization Support	Continuous Batching	GPU Utilization	Production Ready	API Style
Ollama	GGUF (Q2-Q8)	Limited	Moderate	Yes	REST (OpenAI-compatible)
vLLM	GPTQ, AWQ, FP16	Yes (PagedAttention)	Excellent	Yes	OpenAI-compatible
llama.cpp server	GGUF (all quants)	Yes (recent versions)	Good	Yes (mature)	REST
TGI	GPTQ, AWQ, FP16	Yes	Excellent	Yes	REST + gRPC

Ollama is the easiest to get running. ollama pull llama3.1:8b and you have a model serving on port 11434. The API is OpenAI-compatible, so your FastAPI proxy barely needs to transform payloads. Downside: batching is limited, and throughput under concurrent load lags behind vLLM significantly. vLLM is the throughput king. PagedAttention manages KV-cache like virtual memory, which means it serves 3-5x more concurrent requests per GPU than naive implementations. If you're building anything that handles more than 10 concurrent users, vLLM is the answer. llama.cpp server strikes the middle ground. It runs GGUF models on anything, CPU-only, Apple Silicon, NVIDIA, AMD. The continuous batching support landed in late 2024 and has gotten solid. I use it when I need to support heterogeneous hardware and don't want to maintain separate deployments.

For this guide I'll show Ollama first for simplicity, then a vLLM alternative, the FastAPI layer I'm building is backend-agnostic.

Self-Hosting LLMs with FastAPI: The Application

Here's the full FastAPI application. This is production code, not a tutorial fragment.

import asyncio
import hashlib
import hmac
import time
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
from typing import Optional

import httpx
from fastapi import FastAPI, HTTPException, Request, Depends
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field

OLLAMA_BASE = "http://localhost:11434"
API_KEY_HASH = hashlib.sha256(b"your-secret-key-change-this").hexdigest()
MAX_CONCURRENT = 10

semaphore = asyncio.Semaphore(MAX_CONCURRENT)
http_client: Optional[httpx.AsyncClient] = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    global http_client
    http_client = httpx.AsyncClient(
        base_url=OLLAMA_BASE,
        timeout=httpx.Timeout(connect=5.0, read=120.0, write=10.0, pool=5.0),
    )
    # Verify backend is reachable
    try:
        resp = await http_client.get("/api/tags")
        resp.raise_for_status()
    except httpx.ConnectError:
        raise RuntimeError(f"Cannot reach Ollama at {OLLAMA_BASE}")
    yield
    await http_client.aclose()


app = FastAPI(title="LLM Proxy", version="1.0.0", lifespan=lifespan)


class ChatMessage(BaseModel):
    role: str = Field(pattern=r"^(system|user|assistant)%%CODEBLOCK_0%%quot;)
    content: str = Field(min_length=1, max_length=32_000)


class ChatRequest(BaseModel):
    model: str = Field(default="llama3.1:8b", max_length=100)
    messages: list[ChatMessage] = Field(min_length=1, max_length=50)
    stream: bool = False
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    max_tokens: int = Field(default=2048, ge=1, le=8192)


class ChatChoice(BaseModel):
    index: int
    message: ChatMessage
    finish_reason: str


class ChatResponse(BaseModel):
    id: str
    model: str
    choices: list[ChatChoice]
    usage: dict


def verify_api_key(request: Request) -> None:
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        raise HTTPException(status_code=401, detail="Missing Bearer token")
    token = auth[7:]
    token_hash = hashlib.sha256(token.encode()).hexdigest()
    if not hmac.compare_digest(token_hash, API_KEY_HASH):
        raise HTTPException(status_code=401, detail="Invalid API key")


@app.get("/health")
async def health():
    try:
        resp = await http_client.get("/api/tags")
        models = resp.json().get("models", [])
        return {
            "status": "healthy",
            "backend": "ollama",
            "models_loaded": [m["name"] for m in models],
            "timestamp": time.time(),
        }
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Backend unhealthy: {e}")


@app.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completions(req: ChatRequest):
    async with semaphore:
        payload = {
            "model": req.model,
            "messages": [m.model_dump() for m in req.messages],
            "stream": req.stream,
            "options": {
                "temperature": req.temperature,
                "num_predict": req.max_tokens,
            },
        }

        if req.stream:
            return StreamingResponse(
                stream_response(payload),
                media_type="text/event-stream",
                headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
            )

        resp = await http_client.post("/api/chat", json=payload)
        if resp.status_code != 200:
            raise HTTPException(status_code=502, detail="Backend error")

        data = resp.json()
        return ChatResponse(
            id=f"chatcmpl-{int(time.time())}",
            model=req.model,
            choices=[
                ChatChoice(
                    index=0,
                    message=ChatMessage(
                        role="assistant",
                        content=data["message"]["content"],
                    ),
                    finish_reason="stop",
                )
            ],
            usage={
                "prompt_tokens": data.get("prompt_eval_count", 0),
                "completion_tokens": data.get("eval_count", 0),
                "total_tokens": data.get("prompt_eval_count", 0)
                + data.get("eval_count", 0),
            },
        )


async def stream_response(payload: dict) -> AsyncGenerator[str, None]:
    async with http_client.stream("POST", "/api/chat", json=payload) as resp:
        async for line in resp.aiter_lines():
            if not line:
                continue
            yield f"data: {line}\n\n"
    yield "data: [DONE]\n\n"


@app.get("/v1/models", dependencies=[Depends(verify_api_key)])
async def list_models():
    resp = await http_client.get("/api/tags")
    models = resp.json().get("models", [])
    return {
        "data": [
            {
                "id": m["name"],
                "object": "model",
                "owned_by": "local",
            }
            for m in models
        ]
    }

Key design decisions:

Semaphore-based concurrency limiting. GPU memory is finite. If 50 requests hit simultaneously, you don't want 50 concurrent inference calls, you want a queue. asyncio.Semaphore(10) caps concurrent backend calls. Excess requests wait in the asyncio event loop, not in a crash.

OpenAI-compatible API surface. The endpoint is /v1/chat/completions with the same request/response shape. This means any client library that talks to OpenAI can point at your server by changing the base URL. openai.OpenAI(base_url="http://your-server:8000/v1") just works.

Streaming via SSE. The stream_response generator reads from Ollama's streaming endpoint and re-emits as Server-Sent Events. This gives clients token-by-token output without holding the entire response in memory.

API key auth with constant-time comparison. hmac.compare_digest prevents timing attacks. The key is hashed at startup so the plaintext never sits in server memory after init. In production, pull the hash from an environment variable or secrets manager, not a hardcoded string.

Health check endpoint. /health verifies the backend is responsive and returns loaded models. Your container orchestrator (Docker Compose, Kubernetes) should poll this.

Docker Deployment

Here's a production-ready Docker Compose stack. Save as docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_BASE=http://ollama:11434
    depends_on:
      ollama:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 15s
      timeout: 5s
      retries: 3

volumes:
  ollama_data:

And the Dockerfile:

FROM python:3.12-slim

WORKDIR /app

RUN pip install --no-cache-dir fastapi uvicorn httpx pydantic

COPY main.py .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

One worker. Not two, not four. LLM inference is GPU-bound, not CPU-bound. Multiple Uvicorn workers don't help, they just multiply memory usage. If you need to scale, run multiple replicas behind a load balancer, each with its own GPU allocation.

Pull and preload your model after the stack is up:

docker compose up -d
docker compose exec ollama ollama pull llama3.1:8b

Test it:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer your-secret-key-change-this" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain TCP in one sentence."}]
  }'

Swapping to vLLM: The High-Throughput Alternative

When concurrent users climb past single digits, Ollama's lack of continuous batching becomes the bottleneck. vLLM's PagedAttention mechanism manages KV-cache blocks like OS virtual memory pages, fragmentation drops, throughput climbs, and you serve 3-5x more concurrent requests per GPU.

The swap is mostly infrastructure. Your FastAPI code barely changes because vLLM exposes an OpenAI-compatible endpoint natively. Here's the updated Docker Compose service:

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8080:8000"
    volumes:
      - model_cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --quantization awq
      --max-model-len 8192
      --gpu-memory-utilization 0.85
      --enable-chunked-prefill
      --max-num-batched-tokens 16384
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

volumes:
  model_cache:

Note the start_period: 120s on the health check. vLLM takes 30-120 seconds to load a model, depending on size and whether it's cached. Without a grace period, Docker restarts the container before it finishes loading.

The FastAPI proxy needs only one change, point OLLAMA_BASE at http://vllm:8000 and use the standard /v1/chat/completions path directly:

BACKEND_BASE = os.environ.get("BACKEND_BASE", "http://vllm:8000")
BACKEND_CHAT_PATH = "/v1/chat/completions"

# In the endpoint handler, replace the Ollama-specific /api/chat with:
resp = await http_client.post(BACKEND_CHAT_PATH, json={
    "model": req.model,
    "messages": [m.model_dump() for m in req.messages],
    "stream": req.stream,
    "temperature": req.temperature,
    "max_tokens": req.max_tokens,
})

That's it. vLLM's API matches OpenAI's schema exactly, so the response parsing from the Ollama version works without changes. The usage field comes back with proper token counts, and streaming uses the same SSE format.

The --gpu-memory-utilization 0.85 flag is critical. vLLM pre-allocates a percentage of available VRAM for KV-cache at startup. Setting it to 0.85 leaves headroom for CUDA kernels and prevents OOM kills under peak load. I've found 0.80-0.90 to be the safe range; below 0.80 you're leaving throughput on the table. --enable-chunked-prefill lets vLLM interleave prefill (processing the prompt) and decode (generating tokens) within the same batch. This reduces time-to-first-token for new requests when the GPU is already generating for other users. It's the single biggest latency improvement for multi-user serving.

Request Queuing: Why Your Semaphore Isn't Enough

The asyncio.Semaphore in the FastAPI app caps concurrent backend calls, but it doesn't give you visibility or control over the queue. In production, you need to know:

How many requests are waiting?
How long has the oldest request been waiting?
Should you reject requests instead of queuing indefinitely?

Here's a more production-grade queue implementation:

import asyncio
import time
from dataclasses import dataclass, field
from fastapi import HTTPException


@dataclass
class QueueMetrics:
    queued: int = 0
    processing: int = 0
    completed: int = 0
    rejected: int = 0
    total_wait_ms: float = 0.0
    total_inference_ms: float = 0.0


metrics = QueueMetrics()


class InferenceQueue:
    def __init__(self, max_concurrent: int = 10, max_waiting: int = 50,
                 max_wait_seconds: float = 30.0):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.max_waiting = max_waiting
        self.max_wait_seconds = max_wait_seconds
        self._waiting = 0

    async def __aenter__(self):
        if self._waiting >= self.max_waiting:
            metrics.rejected += 1
            raise HTTPException(
                status_code=429,
                detail=f"Queue full ({self.max_waiting} requests waiting)",
                headers={"Retry-After": "5"},
            )

        self._waiting += 1
        metrics.queued = self._waiting
        enqueue_time = time.monotonic()

        try:
            await asyncio.wait_for(
                self.semaphore.acquire(),
                timeout=self.max_wait_seconds,
            )
        except asyncio.TimeoutError:
            self._waiting -= 1
            metrics.queued = self._waiting
            metrics.rejected += 1
            raise HTTPException(
                status_code=504,
                detail=f"Queue timeout after {self.max_wait_seconds}s",
            )

        self._waiting -= 1
        wait_ms = (time.monotonic() - enqueue_time) * 1000
        metrics.total_wait_ms += wait_ms
        metrics.queued = self._waiting
        metrics.processing += 1
        return self

    async def __aexit__(self, *exc):
        self.semaphore.release()
        metrics.processing -= 1
        metrics.completed += 1


queue = InferenceQueue(max_concurrent=10, max_waiting=50, max_wait_seconds=30.0)

Use it in the endpoint:

@app.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completions(req: ChatRequest):
    async with queue:
        # ... same inference logic as before
        pass

The max_waiting=50 parameter is the backpressure valve. When 50 requests are already queued, new ones get a 429 with a Retry-After header. This is better than letting the queue grow unbounded, an unbounded queue eventually exhausts memory or makes every request time out. Tell clients to retry rather than pretending you can handle infinite load.

The max_wait_seconds=30.0 timeout catches the case where the backend is alive but slow. If a user's request sits in queue for 30 seconds, they've already left the page. Returning a 504 is more honest than making them wait 2 minutes.

Production Monitoring

You can't operate what you can't observe. Add a /metrics endpoint that your monitoring stack (Prometheus, Datadog, whatever) can scrape:

@app.get("/metrics")
async def get_metrics():
    avg_wait = (
        metrics.total_wait_ms / metrics.completed
        if metrics.completed > 0
        else 0.0
    )
    avg_inference = (
        metrics.total_inference_ms / metrics.completed
        if metrics.completed > 0
        else 0.0
    )
    return {
        "requests_queued": metrics.queued,
        "requests_processing": metrics.processing,
        "requests_completed": metrics.completed,
        "requests_rejected": metrics.rejected,
        "avg_wait_ms": round(avg_wait, 2),
        "avg_inference_ms": round(avg_inference, 2),
    }

For GPU-level monitoring, I run DCGM Exporter as a sidecar container. It exposes GPU utilization, VRAM usage, temperature, and power draw as Prometheus metrics. The three numbers I watch:

GPU utilization %, if it's consistently below 50%, your batching isn't saturating the GPU. Increase max_concurrent or switch to vLLM.

VRAM usage %, if it's above 95%, you're one large prompt away from OOM. Reduce context length or --gpu-memory-utilization.

Queue depth, if requests_queued regularly exceeds max_concurrent * 2, you need more GPU capacity or a smaller model.

Alert on requests_rejected > 0, any rejected request means your capacity is insufficient for your traffic.

GPU Memory Management: The Hard Lessons

GPU memory management is the single biggest operational headache when self-hosting LLMs with FastAPI, or any serving setup. Here's what I've learned the hard way: Monitor VRAM constantly. The model itself takes a fixed chunk (proportional to parameter count and quantization). The KV-cache grows with context length and batch size. If KV-cache + model weights > VRAM, you get an OOM kill with no graceful degradation.

# Watch GPU memory in real-time
watch -n 1 nvidia-smi

Set context length limits. Ollama defaults to 2048 context tokens. For my deployments I set it per-model:

ollama run llama3.1:8b --ctx-size 4096

Every 1K of additional context costs roughly 50-100 MB of VRAM depending on the model architecture. An 8B model at 32K context can eat 16 GB just in KV-cache. Use swap as a last resort, not a strategy. NVIDIA Unified Memory will spill to system RAM when VRAM is full. This "works" but inference speed drops 10-50x. If you're hitting swap regularly, you need a smaller model or more VRAM, there's no trick to make it fast. NVIDIA MIG (Multi-Instance GPU) is worth investigating if you're on A100 or H100. It partitions a single GPU into isolated instances, each with its own VRAM and compute. You can run multiple small models on one GPU without them fighting for memory. See the NVIDIA MIG documentation for setup.

Self-Hosting LLMs with FastAPI: When It Makes Sense

Self-hosting is not always cheaper than API calls. Here's the honest math:

An RTX 4090 costs ~$1,600. Electricity, cooling, and your time to maintain it add ~$50-100/month. That's roughly $200/month amortized over a year.

For $200/month in OpenAI API costs, you get roughly 20 million GPT-4o-mini tokens or 1 million GPT-4o tokens. If your workload is below that, the API is cheaper. Period.

Self-hosting wins when:

Volume is high. If you're processing 5M+ tokens/day, local inference is an order of magnitude cheaper.
Data sovereignty is non-negotiable. Healthcare, legal, government, finance, if data can't leave your network, you're self-hosting.
Latency matters. Local inference on a good GPU has 15-30ms time-to-first-token. API calls have 200-800ms network overhead before the first token.
You need predictable costs. No surprise bills. The GPU costs what it costs.

It doesn't win when you need frontier model quality. GPT-4o and Claude Opus are still better than anything you can self-host. The gap is closing, but it's real.

Integrating Self-Hosted LLMs into Your Stack

One of the underrated benefits of an OpenAI-compatible self-hosted endpoint is that it's a drop-in replacement everywhere. Here's how you swap in your local model with the official OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key-change-this",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a code review assistant."},
        {"role": "user", "content": "Review this Python function for bugs..."},
    ],
    temperature=0.3,
    max_tokens=1024,
)

print(response.choices[0].message.content)

This also works with LangChain, LlamaIndex, or any framework that accepts an OpenAI-compatible endpoint. No SDK changes, no custom adapters. That's the whole point of matching the API surface.

For MCP Protocol in LLM Applications, you can point MCP clients at your self-hosted endpoint the same way you'd point them at OpenAI. The tool-calling schema is identical.

For hardening the Linux servers running these workloads, see Linux Server Hardening for AI Workloads, GPU servers are high-value targets and most aren't secured properly. If you're feeding these models with vector search, Vector Databases: A Practitioner's Comparison covers which embedding stores are worth running locally.

Continue Reading

MCP Protocol in LLM Applications: A Practitioner's Guide

Building production MCP servers and clients with Python. Covers the JSON-RPC 2.0 wire protocol, transport layers (stdio, SSE, Streamable HTTP), filesystem tool implementation with path traversal protection, and connecting Claude to your custom tools.

Apr 7, 2026•17 min read

systems

Linux Server Hardening for AI Workloads: The Security Guide Nobody Wrote

Hardening Linux servers running GPU inference and training workloads. Covers SSH lockdown, Docker rootless mode, NVIDIA driver security, systemd sandboxing, audit logging, and network segmentation for AI infrastructure.

Feb 24, 2026•22 min read

Agentic AI and Multi-Agent Systems: Building Beyond Single-Prompt Chains

Building production multi-agent systems from scratch. Covers ReAct, plan-and-execute, supervisor, and pipeline patterns with full Python implementations. Includes inter-agent communication, human-in-the-loop, memory systems, failure modes, and a real EdTech production architecture.

Feb 17, 2026•25 min read

back to blog

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Tyler McDaniel

AI Engineer & IBM Business Partner

Mar 31, 202624 min read

#llm#python#fastapi#docker#self-hosting

Choosing a Model: Quantization Tradeoffs That Actually Matter

The model choice isn't about "which chatbot is best." It's about fitting your quality requirements into your VRAM budget.

Model	Parameters	VRAM (FP16)	VRAM (Q4_K_M)	Quality Tier	Best For
Llama 3.1 8B	8B	16 GB	5 GB	Good for structured tasks	Classification, extraction, simple RAG
Mistral 7B v0.3	7B	14 GB	5 GB	Good general purpose	Chat, summarization, code assist
Llama 3.1 70B	70B	140 GB	40 GB	Excellent	Complex reasoning, long-form generation
Qwen 2.5 72B	72B	144 GB	42 GB	Excellent, strong multilingual	Same as 70B + CJK languages
Phi-3.5 Mini	3.8B	8 GB	3 GB	Decent for size	Edge deployment, low-resource servers
DeepSeek-R1	671B (MoE)	Massive (multi-node)	~160 GB active	State-of-the-art reasoning	Complex chain-of-thought, math

A note on GGUF vs GPTQ vs AWQ. These are different quantization formats, and which one you use depends on your serving backend:

GGUF is the llama.cpp native format. Works with Ollama and llama.cpp server. Supports on-the-fly quantization levels (Q2 through Q8) and mixed precision (K-quants). Best for CPU+GPU hybrid inference and Apple Silicon.
GPTQ is a post-training quantization method that produces calibrated weights. Works with vLLM, TGI, and AutoGPTQ. Generally higher quality than GGUF at the same bit width because it uses a calibration dataset.
AWQ (Activation-aware Weight Quantization) is newer and often slightly better than GPTQ. Supported by vLLM natively. If you're using vLLM, AWQ is the default recommendation.

For practical purposes: if you're using Ollama, you use GGUF. If you're using vLLM, you use AWQ or GPTQ. Don't fight the ecosystem.

Serving Backend Comparison

Your FastAPI app doesn't run inference directly. It proxies to a serving backend that manages the model, handles batching, and talks to the GPU.

Backend	Quantization Support	Continuous Batching	GPU Utilization	Production Ready	API Style
Ollama	GGUF (Q2-Q8)	Limited	Moderate	Yes	REST (OpenAI-compatible)
vLLM	GPTQ, AWQ, FP16	Yes (PagedAttention)	Excellent	Yes	OpenAI-compatible
llama.cpp server	GGUF (all quants)	Yes (recent versions)	Good	Yes (mature)	REST
TGI	GPTQ, AWQ, FP16	Yes	Excellent	Yes	REST + gRPC

For this guide I'll show Ollama first for simplicity, then a vLLM alternative, the FastAPI layer I'm building is backend-agnostic.

Self-Hosting LLMs with FastAPI: The Application

Here's the full FastAPI application. This is production code, not a tutorial fragment.

import asyncio
import hashlib
import hmac
import time
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
from typing import Optional

import httpx
from fastapi import FastAPI, HTTPException, Request, Depends
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field

OLLAMA_BASE = "http://localhost:11434"
API_KEY_HASH = hashlib.sha256(b"your-secret-key-change-this").hexdigest()
MAX_CONCURRENT = 10

semaphore = asyncio.Semaphore(MAX_CONCURRENT)
http_client: Optional[httpx.AsyncClient] = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    global http_client
    http_client = httpx.AsyncClient(
        base_url=OLLAMA_BASE,
        timeout=httpx.Timeout(connect=5.0, read=120.0, write=10.0, pool=5.0),
    )
    # Verify backend is reachable
    try:
        resp = await http_client.get("/api/tags")
        resp.raise_for_status()
    except httpx.ConnectError:
        raise RuntimeError(f"Cannot reach Ollama at {OLLAMA_BASE}")
    yield
    await http_client.aclose()


app = FastAPI(title="LLM Proxy", version="1.0.0", lifespan=lifespan)


class ChatMessage(BaseModel):
    role: str = Field(pattern=r"^(system|user|assistant)%%CODEBLOCK_0%%quot;)
    content: str = Field(min_length=1, max_length=32_000)


class ChatRequest(BaseModel):
    model: str = Field(default="llama3.1:8b", max_length=100)
    messages: list[ChatMessage] = Field(min_length=1, max_length=50)
    stream: bool = False
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    max_tokens: int = Field(default=2048, ge=1, le=8192)


class ChatChoice(BaseModel):
    index: int
    message: ChatMessage
    finish_reason: str


class ChatResponse(BaseModel):
    id: str
    model: str
    choices: list[ChatChoice]
    usage: dict


def verify_api_key(request: Request) -> None:
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        raise HTTPException(status_code=401, detail="Missing Bearer token")
    token = auth[7:]
    token_hash = hashlib.sha256(token.encode()).hexdigest()
    if not hmac.compare_digest(token_hash, API_KEY_HASH):
        raise HTTPException(status_code=401, detail="Invalid API key")


@app.get("/health")
async def health():
    try:
        resp = await http_client.get("/api/tags")
        models = resp.json().get("models", [])
        return {
            "status": "healthy",
            "backend": "ollama",
            "models_loaded": [m["name"] for m in models],
            "timestamp": time.time(),
        }
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Backend unhealthy: {e}")


@app.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completions(req: ChatRequest):
    async with semaphore:
        payload = {
            "model": req.model,
            "messages": [m.model_dump() for m in req.messages],
            "stream": req.stream,
            "options": {
                "temperature": req.temperature,
                "num_predict": req.max_tokens,
            },
        }

        if req.stream:
            return StreamingResponse(
                stream_response(payload),
                media_type="text/event-stream",
                headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
            )

        resp = await http_client.post("/api/chat", json=payload)
        if resp.status_code != 200:
            raise HTTPException(status_code=502, detail="Backend error")

        data = resp.json()
        return ChatResponse(
            id=f"chatcmpl-{int(time.time())}",
            model=req.model,
            choices=[
                ChatChoice(
                    index=0,
                    message=ChatMessage(
                        role="assistant",
                        content=data["message"]["content"],
                    ),
                    finish_reason="stop",
                )
            ],
            usage={
                "prompt_tokens": data.get("prompt_eval_count", 0),
                "completion_tokens": data.get("eval_count", 0),
                "total_tokens": data.get("prompt_eval_count", 0)
                + data.get("eval_count", 0),
            },
        )


async def stream_response(payload: dict) -> AsyncGenerator[str, None]:
    async with http_client.stream("POST", "/api/chat", json=payload) as resp:
        async for line in resp.aiter_lines():
            if not line:
                continue
            yield f"data: {line}\n\n"
    yield "data: [DONE]\n\n"


@app.get("/v1/models", dependencies=[Depends(verify_api_key)])
async def list_models():
    resp = await http_client.get("/api/tags")
    models = resp.json().get("models", [])
    return {
        "data": [
            {
                "id": m["name"],
                "object": "model",
                "owned_by": "local",
            }
            for m in models
        ]
    }

Key design decisions:

Health check endpoint. /health verifies the backend is responsive and returns loaded models. Your container orchestrator (Docker Compose, Kubernetes) should poll this.

Docker Deployment

Here's a production-ready Docker Compose stack. Save as docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_BASE=http://ollama:11434
    depends_on:
      ollama:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 15s
      timeout: 5s
      retries: 3

volumes:
  ollama_data:

And the Dockerfile:

FROM python:3.12-slim

WORKDIR /app

RUN pip install --no-cache-dir fastapi uvicorn httpx pydantic

COPY main.py .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

Pull and preload your model after the stack is up:

docker compose up -d
docker compose exec ollama ollama pull llama3.1:8b

Test it:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer your-secret-key-change-this" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain TCP in one sentence."}]
  }'

Swapping to vLLM: The High-Throughput Alternative

The swap is mostly infrastructure. Your FastAPI code barely changes because vLLM exposes an OpenAI-compatible endpoint natively. Here's the updated Docker Compose service:

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8080:8000"
    volumes:
      - model_cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --quantization awq
      --max-model-len 8192
      --gpu-memory-utilization 0.85
      --enable-chunked-prefill
      --max-num-batched-tokens 16384
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

volumes:
  model_cache:

The FastAPI proxy needs only one change, point OLLAMA_BASE at http://vllm:8000 and use the standard /v1/chat/completions path directly:

BACKEND_BASE = os.environ.get("BACKEND_BASE", "http://vllm:8000")
BACKEND_CHAT_PATH = "/v1/chat/completions"

# In the endpoint handler, replace the Ollama-specific /api/chat with:
resp = await http_client.post(BACKEND_CHAT_PATH, json={
    "model": req.model,
    "messages": [m.model_dump() for m in req.messages],
    "stream": req.stream,
    "temperature": req.temperature,
    "max_tokens": req.max_tokens,
})

Request Queuing: Why Your Semaphore Isn't Enough

The asyncio.Semaphore in the FastAPI app caps concurrent backend calls, but it doesn't give you visibility or control over the queue. In production, you need to know:

How many requests are waiting?
How long has the oldest request been waiting?
Should you reject requests instead of queuing indefinitely?

Here's a more production-grade queue implementation:

import asyncio
import time
from dataclasses import dataclass, field
from fastapi import HTTPException


@dataclass
class QueueMetrics:
    queued: int = 0
    processing: int = 0
    completed: int = 0
    rejected: int = 0
    total_wait_ms: float = 0.0
    total_inference_ms: float = 0.0


metrics = QueueMetrics()


class InferenceQueue:
    def __init__(self, max_concurrent: int = 10, max_waiting: int = 50,
                 max_wait_seconds: float = 30.0):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.max_waiting = max_waiting
        self.max_wait_seconds = max_wait_seconds
        self._waiting = 0

    async def __aenter__(self):
        if self._waiting >= self.max_waiting:
            metrics.rejected += 1
            raise HTTPException(
                status_code=429,
                detail=f"Queue full ({self.max_waiting} requests waiting)",
                headers={"Retry-After": "5"},
            )

        self._waiting += 1
        metrics.queued = self._waiting
        enqueue_time = time.monotonic()

        try:
            await asyncio.wait_for(
                self.semaphore.acquire(),
                timeout=self.max_wait_seconds,
            )
        except asyncio.TimeoutError:
            self._waiting -= 1
            metrics.queued = self._waiting
            metrics.rejected += 1
            raise HTTPException(
                status_code=504,
                detail=f"Queue timeout after {self.max_wait_seconds}s",
            )

        self._waiting -= 1
        wait_ms = (time.monotonic() - enqueue_time) * 1000
        metrics.total_wait_ms += wait_ms
        metrics.queued = self._waiting
        metrics.processing += 1
        return self

    async def __aexit__(self, *exc):
        self.semaphore.release()
        metrics.processing -= 1
        metrics.completed += 1


queue = InferenceQueue(max_concurrent=10, max_waiting=50, max_wait_seconds=30.0)

Use it in the endpoint:

@app.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completions(req: ChatRequest):
    async with queue:
        # ... same inference logic as before
        pass

Production Monitoring

You can't operate what you can't observe. Add a /metrics endpoint that your monitoring stack (Prometheus, Datadog, whatever) can scrape:

@app.get("/metrics")
async def get_metrics():
    avg_wait = (
        metrics.total_wait_ms / metrics.completed
        if metrics.completed > 0
        else 0.0
    )
    avg_inference = (
        metrics.total_inference_ms / metrics.completed
        if metrics.completed > 0
        else 0.0
    )
    return {
        "requests_queued": metrics.queued,
        "requests_processing": metrics.processing,
        "requests_completed": metrics.completed,
        "requests_rejected": metrics.rejected,
        "avg_wait_ms": round(avg_wait, 2),
        "avg_inference_ms": round(avg_inference, 2),
    }

For GPU-level monitoring, I run DCGM Exporter as a sidecar container. It exposes GPU utilization, VRAM usage, temperature, and power draw as Prometheus metrics. The three numbers I watch:

GPU utilization %, if it's consistently below 50%, your batching isn't saturating the GPU. Increase max_concurrent or switch to vLLM.

VRAM usage %, if it's above 95%, you're one large prompt away from OOM. Reduce context length or --gpu-memory-utilization.

Queue depth, if requests_queued regularly exceeds max_concurrent * 2, you need more GPU capacity or a smaller model.

Alert on requests_rejected > 0, any rejected request means your capacity is insufficient for your traffic.

GPU Memory Management: The Hard Lessons

# Watch GPU memory in real-time
watch -n 1 nvidia-smi

Set context length limits. Ollama defaults to 2048 context tokens. For my deployments I set it per-model:

ollama run llama3.1:8b --ctx-size 4096

Self-Hosting LLMs with FastAPI: When It Makes Sense

Self-hosting is not always cheaper than API calls. Here's the honest math:

An RTX 4090 costs ~$1,600. Electricity, cooling, and your time to maintain it add ~$50-100/month. That's roughly $200/month amortized over a year.

For $200/month in OpenAI API costs, you get roughly 20 million GPT-4o-mini tokens or 1 million GPT-4o tokens. If your workload is below that, the API is cheaper. Period.

Self-hosting wins when:

Volume is high. If you're processing 5M+ tokens/day, local inference is an order of magnitude cheaper.
Data sovereignty is non-negotiable. Healthcare, legal, government, finance, if data can't leave your network, you're self-hosting.
Latency matters. Local inference on a good GPU has 15-30ms time-to-first-token. API calls have 200-800ms network overhead before the first token.
You need predictable costs. No surprise bills. The GPU costs what it costs.

It doesn't win when you need frontier model quality. GPT-4o and Claude Opus are still better than anything you can self-host. The gap is closing, but it's real.

Integrating Self-Hosted LLMs into Your Stack

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key-change-this",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a code review assistant."},
        {"role": "user", "content": "Review this Python function for bugs..."},
    ],
    temperature=0.3,
    max_tokens=1024,
)

print(response.choices[0].message.content)

This also works with LangChain, LlamaIndex, or any framework that accepts an OpenAI-compatible endpoint. No SDK changes, no custom adapters. That's the whole point of matching the API surface.

For MCP Protocol in LLM Applications, you can point MCP clients at your self-hosted endpoint the same way you'd point them at OpenAI. The tool-calling schema is identical.

Continue Reading

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Choosing a Model: Quantization Tradeoffs That Actually Matter

Serving Backend Comparison

Self-Hosting LLMs with FastAPI: The Application

Docker Deployment

Swapping to vLLM: The High-Throughput Alternative

Request Queuing: Why Your Semaphore Isn't Enough

Production Monitoring

GPU Memory Management: The Hard Lessons

Self-Hosting LLMs with FastAPI: When It Makes Sense

Integrating Self-Hosted LLMs into Your Stack

Related Posts on To Stupid Too Quit

Continue Reading

MCP Protocol in LLM Applications: A Practitioner's Guide

Linux Server Hardening for AI Workloads: The Security Guide Nobody Wrote

Agentic AI and Multi-Agent Systems: Building Beyond Single-Prompt Chains

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Choosing a Model: Quantization Tradeoffs That Actually Matter

Serving Backend Comparison

Self-Hosting LLMs with FastAPI: The Application

Docker Deployment

Swapping to vLLM: The High-Throughput Alternative

Request Queuing: Why Your Semaphore Isn't Enough

Production Monitoring

GPU Memory Management: The Hard Lessons

Self-Hosting LLMs with FastAPI: When It Makes Sense

Integrating Self-Hosted LLMs into Your Stack

Related Posts on To Stupid Too Quit

Continue Reading

MCP Protocol in LLM Applications: A Practitioner's Guide

Linux Server Hardening for AI Workloads: The Security Guide Nobody Wrote

Agentic AI and Multi-Agent Systems: Building Beyond Single-Prompt Chains