>Services >Proof >Projects >About >Contact

Loading…

>Services >Proof >Projects >About >Contact

Vector Databases: A Practitioner's Comparison for Production AI

Hands-on comparison of Pinecone, Qdrant, Weaviate, pgvector, and Chroma for production AI. Covers embedding fundamentals, indexing algorithms (HNSW, IVF, PQ), chunking strategies, reranking, and when each database fits.

Tyler McDaniel

AI Engineer & IBM Business Partner

Feb 3, 202616 min read

#vector-database#embeddings#ai#pinecone#qdrant

Vector databases are the infrastructure behind every retrieval-augmented generation (RAG) pipeline, semantic search engine, and recommendation system built on embeddings. The concept is simple: store high-dimensional vectors, find the nearest neighbors. The implementation complexity determines whether your system returns results in 5ms or 500ms, costs $50/month or $5,000/month, and scales to a million vectors or a billion.

This is a practitioner's comparison of the vector databases I've worked with hands-on, Pinecone, Weaviate, Qdrant, Chroma, and pgvector, based on actual deployment experience, not feature-list marketing.

Embedding Fundamentals

Before comparing databases, you need to understand what you're storing. An embedding is a fixed-size numerical representation of unstructured data, text, images, audio, produced by a neural network. Two pieces of content that are semantically similar produce vectors that are close together in the embedding space.

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

# These two will be close in vector space
v1 = get_embedding("How do I reset my password?")
v2 = get_embedding("I forgot my login credentials")

# This will be far from both
v3 = get_embedding("The weather in Tokyo is sunny")

OpenAI's text-embedding-3-small produces 1,536-dimensional vectors. Each vector is 1,536 float32 values, about 6KB per vector. At 10 million documents, that's ~60GB of raw vector data before indexes. Dimension matters. Higher dimensions capture more semantic nuance but cost more storage and compute. OpenAI's text-embedding-3-large outputs 3,072 dimensions. For most RAG use cases, 1,536 is sufficient. Some models like Nomic Embed produce 768 dimensions with competitive quality at lower cost. Distance metrics matter. Cosine similarity is the default for normalized text embeddings. Euclidean (L2) distance works for image embeddings. Dot product is fastest but requires normalized vectors. Pick the metric that matches your embedding model's training objective, most text models use cosine.

Indexing Algorithms

The naive approach, comparing your query vector against every stored vector, is O(n). With a million vectors at 1,536 dimensions, that's ~1.5 billion floating-point multiplications per query. Vector databases use approximate nearest neighbor (ANN) algorithms to make this tractable.

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph where each node is a vector and edges connect nearby vectors. The search starts at the top layer (sparse, long edges for coarse navigation) and descends to the bottom layer (dense, short edges for fine-grained search).

Query latency: ~1-5ms for 1M vectors
Build time: Slower (graph construction is expensive)
Memory: High, the graph structure sits in RAM alongside vectors
Recall: Excellent (95-99%+ at typical settings)
Update cost: Moderate, new vectors can be inserted without full rebuild

HNSW dominates production vector search. Qdrant, Weaviate, and pgvector's hnsw index all use it.

IVF (Inverted File Index)

IVF partitions the vector space into clusters using k-means. At query time, it searches only the nearest clusters instead of the full dataset.

Query latency: 5-20ms for 1M vectors (depends on nprobe)
Build time: Faster than HNSW
Memory: Lower, vectors can stay on disk with cluster centroids in RAM
Recall: Good but lower than HNSW at equivalent speed
Update cost: Low for inserts, but clusters degrade over time and need periodic rebuilding

IVF is the go-to when you have more vectors than RAM. Pinecone uses a proprietary variation of IVF internally.

Product Quantization (PQ)

PQ compresses vectors by splitting them into sub-vectors and quantizing each sub-vector to its nearest centroid in a codebook. This reduces memory by 4-64x at the cost of some recall accuracy.

Original: [0.12, 0.34, 0.56, 0.78, 0.23, 0.45, 0.67, 0.89]  to  32 bytes
PQ (4 sub-vectors, 256 centroids): [42, 187, 3, 211]           to  4 bytes

Every production database supports PQ or a variant (scalar quantization, binary quantization). For datasets above 10M vectors, quantization is not optional, it's the difference between fitting in RAM or not.

Choosing an Embedding Model

The vector database is half the equation. The embedding model determines the quality of your vectors and thus the quality of your retrieval.

Model	Dimensions	MTEB Score	Cost	Latency	Notes
OpenAI `text-embedding-3-small`	1,536	62.3	$0.02/1M tokens	~50ms	Best price/performance ratio
OpenAI `text-embedding-3-large`	3,072	64.6	$0.13/1M tokens	~80ms	Highest quality from OpenAI
Cohere `embed-english-v3.0`	1,024	64.5	$0.10/1M tokens	~60ms	Excellent for search and RAG
Nomic `nomic-embed-text-v1.5`	768	62.3	Free (self-host)	~10ms	Open-source, self-hostable
BGE `bge-large-en-v1.5`	1,024	63.6	Free (self-host)	~15ms	Open-source, top MTEB performer
Voyage `voyage-large-2`	1,536	64.8	$0.12/1M tokens	~70ms	Strong for code and technical text

*Self-hosted latency depends on your GPU. These are estimates for an RTX 4090.

My recommendation: Start with text-embedding-3-small. It's cheap, fast, and good enough for most use cases. If retrieval quality isn't meeting your bar, switch to text-embedding-3-large or Cohere v3, don't tune the vector database first. Embedding quality dominates retrieval quality more than index configuration.

If you're cost-sensitive or have privacy constraints, self-host Nomic or BGE. Both run on consumer GPUs and produce competitive embeddings. See Self-Hosting LLMs with FastAPI for the deployment infrastructure, the same FastAPI + Docker setup works for embedding models. Critical rule: never mix embedding models in the same collection. Vectors from different models are incompatible, their embedding spaces are different. If you switch models, re-embed your entire corpus. This is expensive, so choose carefully and commit.

The Practitioner's Comparison

Feature	Pinecone	Weaviate	Qdrant	Chroma	pgvector
Deployment	Managed only	Self-hosted + Cloud	Self-hosted + Cloud	Self-hosted + Cloud	PostgreSQL extension
Index type	Proprietary	HNSW	HNSW + quantization	HNSW (via hnswlib)	HNSW + IVFFlat
Max dimensions	20,000	Unlimited	Unlimited	Unlimited	2,000
Metadata filtering	Yes (server-side)	Yes (inverted index)	Yes (payload index)	Yes (basic)	Yes (SQL WHERE)
Hybrid search	Sparse + dense	BM25 + vector	Sparse + dense	No	Full-text + vector
Multi-tenancy	Namespaces	Built-in	Collection-level	Collections	Schema/row-level
Disk-based index	No (all in RAM)	Yes (mmap)	Yes (mmap + on-disk)	No	Yes (PostgreSQL storage)
Operational overhead	Zero (managed)	Moderate	Low-Moderate	Low	Low (if you run Postgres)
Cost (1M vectors)	~$70/mo (s1)	~$25/mo (self-hosted)	~$20/mo (self-hosted)	Free (self-hosted)	Free (existing Postgres)
Maturity	Production-proven	Production-proven	Production-proven	Early but growing	Production-proven

When to Use What

Pinecone: When you don't want to think about infrastructure

Pinecone is fully managed. No servers to provision, no indexes to tune, no replication to configure. You get an API endpoint and it works. This is its entire value proposition, and it's a legitimate one. If your team is 3 engineers building a RAG feature and none of you want to become vector database operators, Pinecone is the right choice.

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")

# Upsert vectors with metadata
index.upsert(
    vectors=[
        {
            "id": "doc-1",
            "values": get_embedding("LTI 1.3 uses OIDC for authentication"),
            "metadata": {"source": "docs", "category": "auth", "date": "2024-01-15"},
        },
    ],
    namespace="knowledge-base",
)

# Query with metadata filter
results = index.query(
    vector=get_embedding("How does LTI authentication work?"),
    top_k=5,
    namespace="knowledge-base",
    filter={"category": {"$eq": "auth"}},
    include_metadata=True,
)

The downside: cost scales linearly, no on-disk storage option, and you're fully vendor-locked. At 100M+ vectors, the bill gets serious.

Qdrant: When you need fine-grained control without the complexity

Qdrant is my default recommendation for teams that want to self-host. The API is clean, the documentation is excellent, and it handles the hard problems (quantization, on-disk storage, distributed deployment) without requiring a PhD in distributed systems.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance,
    VectorParams,
    PointStruct,
    Filter,
    FieldCondition,
    MatchValue,
)

client = QdrantClient(url="http://localhost:6333")

# Create collection with quantization
client.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
        on_disk=True,
    ),
    quantization_config={
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": True,
        }
    },
)

# Upsert with rich payload
client.upsert(
    collection_name="knowledge_base",
    points=[
        PointStruct(
            id=1,
            vector=get_embedding("LTI 1.3 uses OIDC for authentication"),
            payload={
                "source": "docs",
                "category": "auth",
                "date": "2024-01-15",
                "word_count": 450,
            },
        ),
    ],
)

# Query with payload filtering
results = client.query_points(
    collection_name="knowledge_base",
    query=get_embedding("How does LTI authentication work?"),
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="auth"))]
    ),
    limit=5,
)

Qdrant's scalar quantization keeps the quantized vectors in RAM for fast distance computation while storing the full-precision vectors on disk for re-ranking. This gives HNSW-speed queries at IVF-level memory usage. For my agentic AI systems, Qdrant handles the long-term memory store with sub-10ms recall latency.

pgvector: When you already run PostgreSQL

If you already have a Postgres database, pgvector is the zero-infrastructure option. No new service to deploy, no new backup strategy, no new monitoring. Add the extension, create an index, query with SQL.

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1536),
    category TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);

-- Create HNSW index for cosine distance
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Query: nearest neighbors with metadata filter
SELECT id, content, category,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'auth'
ORDER BY embedding <=> $1::vector
LIMIT 5;

The <=> operator computes cosine distance. The HNSW index makes it fast. The WHERE clause filters happen at the SQL level, which means you get the full power of PostgreSQL's query planner for metadata filtering, something purpose-built vector databases are still improving. pgvector's limitations: The HNSW index lives in shared memory, so large indexes compete with your regular Postgres workload. Build times are slow for 10M+ vectors. And the 2,000-dimension limit means you can't use OpenAI's text-embedding-3-large (3,072 dimensions) without truncation. For most RAG use cases with 1,536-dimension embeddings and under 5M vectors, pgvector is performant and operationally free if Postgres is already in your stack.

Weaviate: When you need hybrid search

Weaviate combines vector search with BM25 keyword search in a single query. This is important because pure vector search misses exact-match queries ("error code E4021") and pure keyword search misses semantic queries ("authentication failures"). Hybrid search does both:

import weaviate

client = weaviate.connect_to_local()

collection = client.collections.get("Document")

# Hybrid query: combines BM25 + vector search
response = collection.query.hybrid(
    query="LTI authentication OIDC flow",
    alpha=0.7,  # 0 = pure BM25, 1 = pure vector
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("category").equal("auth"),
    return_metadata=weaviate.classes.query.MetadataQuery(score=True),
)

for obj in response.objects:
    print(f"{obj.properties['title']} (score: {obj.metadata.score:.3f})")

The alpha parameter controls the balance. In my experience, alpha=0.7 (70% vector, 30% keyword) works well for general-purpose RAG. For code search, drop it to alpha=0.5 because function names and variable names are exact-match patterns that BM25 handles better.

Chroma: For prototyping and local development

Chroma is the SQLite of vector databases. Zero configuration, runs in-process, perfect for prototyping. I use it for local development of RAG pipelines and swap in Qdrant or pgvector for production.

import chromadb

client = chromadb.Client()
collection = client.create_collection("test")

collection.add(
    documents=["LTI 1.3 uses OIDC", "OAuth 2.0 token exchange"],
    ids=["doc1", "doc2"],
)

results = collection.query(
    query_texts=["authentication protocol"],
    n_results=2,
)

Chroma handles embedding automatically (uses a local model by default). For production, use a dedicated embedding endpoint and manage vectors yourself. Don't deploy Chroma as your production vector store, it's not designed for multi-tenant, high-availability workloads.

Production Deployment Patterns

Chunking Strategy

Before vectors reach the database, documents must be chunked. How you chunk determines retrieval quality more than which database you use. Three strategies: Fixed-size chunking. Split every N tokens with M token overlap. Simple, predictable, works for homogeneous content. I use 512 tokens with 50 token overlap as a starting point:

from tiktoken import encoding_for_model

enc = encoding_for_model("text-embedding-3-small")


def chunk_fixed(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start += chunk_size - overlap
    return chunks

Semantic chunking. Split at natural boundaries, paragraph breaks, section headers, topic shifts. Better retrieval quality because chunks are coherent units of meaning:

import re


def chunk_semantic(text: str, max_tokens: int = 512) -> list[str]:
    # Split on double newlines (paragraph boundaries)
    paragraphs = re.split(r"\n\n+", text)
    chunks = []
    current_chunk = []
    current_size = 0

    for para in paragraphs:
        para_tokens = len(enc.encode(para))
        if current_size + para_tokens > max_tokens and current_chunk:
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [para]
            current_size = para_tokens
        else:
            current_chunk.append(para)
            current_size += para_tokens

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))
    return chunks

Parent-child chunking. Embed small chunks (256 tokens) for precise retrieval, but return the parent chunk (1024 tokens) for context. The small chunk matches the query accurately; the parent chunk gives the LLM enough context to generate a good response. This is the highest-quality strategy and what I use for production RAG systems:

def chunk_parent_child(
    text: str, parent_size: int = 1024, child_size: int = 256, child_overlap: int = 50
) -> list[dict]:
    """Create parent-child chunk pairs for retrieval."""
    parent_chunks = chunk_fixed(text, parent_size, overlap=100)
    results = []

    for parent_idx, parent in enumerate(parent_chunks):
        children = chunk_fixed(parent, child_size, child_overlap)
        for child_idx, child in enumerate(children):
            results.append({
                "parent_id": f"parent_{parent_idx}",
                "child_id": f"parent_{parent_idx}_child_{child_idx}",
                "parent_text": parent,
                "child_text": child,
                "embedding": get_embedding(child),  # Embed the child
            })

    return results

Store child embeddings in the vector database with parent_id as metadata. Search returns child matches; look up the parent for context:

results = client.query_points(
    collection_name="knowledge_base",
    query=get_embedding(user_query),
    limit=5,
)

# Retrieve unique parent chunks for full context
parent_ids = list(set(r.payload["parent_id"] for r in results.points))
parent_chunks = fetch_parents_from_store(parent_ids)

Embedding Pipeline Architecture

Don't embed documents synchronously in your API request path. Build an async pipeline:

Document Created/Updated
        │
        ▼
┌─────────────────────┐
│  Message Queue       │  (Redis, SQS, or Kafka)
│  "embed:doc-123"     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Embedding Worker    │  (Batches documents, calls embedding API)
│  - Chunk document    │
│  - Generate vectors  │
│  - Upsert to vecDB   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Vector Database     │  (Qdrant / pgvector / Pinecone)
└─────────────────────┘

Batch embedding calls whenever possible. OpenAI's API accepts arrays of inputs, embedding 100 texts in one call is ~95% cheaper than 100 single calls (due to per-request overhead, not pricing). Most self-hosted models also benefit from batching for GPU utilization. Reranking: Vector search retrieves candidates. A reranker (like Cohere Rerank or a cross-encoder model) scores them more accurately. The two-stage pattern, retrieve 50 candidates with ANN, rerank to top 5, consistently outperforms single-stage retrieval with more candidates.

import cohere

co = cohere.Client("your-api-key")


def retrieve_and_rerank(query: str, collection_name: str, top_k: int = 5) -> list[dict]:
    # Stage 1: Fast vector search, get 50 candidates
    candidates = client.query_points(
        collection_name=collection_name,
        query=get_embedding(query),
        limit=50,
    )

    # Stage 2: Rerank with cross-encoder for accurate scoring
    docs = [p.payload["content"] for p in candidates.points]
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=top_k,
    )

    return [
        {
            "content": docs[r.index],
            "relevance_score": r.relevance_score,
            "vector_id": candidates.points[r.index].id,
        }
        for r in reranked.results
    ]

Monitoring and Observability

Track these metrics in production:

Query latency (p50, p95, p99): HNSW queries should be <10ms p95 for 1M vectors. If p95 creeps above 50ms, check memory pressure or increase ef_search.
Recall accuracy: Periodically run queries against a ground-truth dataset. If recall drops below 90%, the index may need rebuilding or your embeddings may have drifted.
Index size vs RAM: When the index exceeds available RAM, latency spikes. Monitor RSS and plan capacity ahead.
Embedding drift: If your content distribution changes (new topics, new vocabulary), embeddings from the old model may not represent the new content well. Track average similarity scores over time, a downward trend suggests the embedding model needs fine-tuning or replacement.

Metadata filtering before or after vector search: This distinction matters for performance. Pre-filtering (filter metadata, then search vectors) reduces the search space but can miss vectors that are semantically close but filtered out. Post-filtering (search all vectors, then filter results) gives better recall but wastes compute on irrelevant results. Qdrant and Pinecone do pre-filtering efficiently. pgvector does post-filtering at the SQL level. Test both on your data. Index tuning: HNSW has two critical parameters: m (number of connections per node, typically 16-64) and ef_construction (build-time search width, typically 100-400). Higher values improve recall but increase memory and build time. For production, set ef_construction high at index build time (you do this once) and tune ef_search at query time for the latency/recall tradeoff you need. A typical starting point: m=32, ef_construction=200, ef_search=128.

Common Pitfalls

Pitfall 1: Embedding queries and documents differently. Some models (like BGE) expect different prefixes for queries vs documents: "Represent this document for retrieval: ..." vs "Represent this query for searching: ...". If you forget the prefix, cosine similarity drops 10-15%. Read your model's documentation. Pitfall 2: Not storing original text alongside vectors. A vector database stores vectors and metadata, but you need the original text to pass to the LLM. Store the text chunk in metadata/payload, don't make a separate database call to retrieve it. The latency adds up. Pitfall 3: Over-indexing metadata. Every indexed metadata field adds memory and CPU overhead. Index the fields you actually filter on (category, date range, tenant ID). Don't index every field "just in case", the where clause you never use still costs you at upsert time. Pitfall 4: No deletion strategy. When source documents are updated or deleted, their old vectors persist in the database. Over time, this causes stale results and wasted storage. Implement a garbage collection process: track document versions, and when a new version is embedded, delete the old vectors. Pitfall 5: Testing with toy data, deploying with production data. A vector database that's fast with 10,000 vectors may be unacceptable at 10,000,000. Test at production scale, or at minimum, 10x your expected initial scale. Load test tools like Locust can simulate concurrent vector queries with realistic payloads. Pitfall 6: Ignoring the hybrid search gap. Pure vector search fails on exact-match queries. A user searching for "error ERR_4021" will get semantically similar results about errors, not the specific error code. If your corpus contains identifiers, codes, or proper nouns that users search for exactly, you need hybrid search (Weaviate) or a BM25 fallback alongside your vector store.

The bottom line: your embedding model and chunking strategy matter more than which database you pick. Get those right first, then choose the database that matches your operational reality, managed if you don't want to run infrastructure, pgvector if Postgres is already in your stack, Qdrant if you want control without complexity, Weaviate if hybrid search is critical.

Continue Reading

MCP Protocol in LLM Applications: A Practitioner's Guide

Building production MCP servers and clients with Python. Covers the JSON-RPC 2.0 wire protocol, transport layers (stdio, SSE, Streamable HTTP), filesystem tool implementation with path traversal protection, and connecting Claude to your custom tools.

Apr 7, 2026•17 min read

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Running quantized LLMs behind a FastAPI proxy with Ollama and vLLM backends. Covers model quantization tradeoffs, GGUF vs GPTQ vs AWQ, streaming responses, request queuing, Docker Compose deployment, and production monitoring.

Mar 31, 2026•24 min read

Agentic AI and Multi-Agent Systems: Building Beyond Single-Prompt Chains

Building production multi-agent systems from scratch. Covers ReAct, plan-and-execute, supervisor, and pipeline patterns with full Python implementations. Includes inter-agent communication, human-in-the-loop, memory systems, failure modes, and a real EdTech production architecture.

Feb 17, 2026•25 min read

back to blog

Vector Databases: A Practitioner's Comparison for Production AI

Tyler McDaniel

AI Engineer & IBM Business Partner

Feb 3, 202616 min read

#vector-database#embeddings#ai#pinecone#qdrant

Embedding Fundamentals

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

# These two will be close in vector space
v1 = get_embedding("How do I reset my password?")
v2 = get_embedding("I forgot my login credentials")

# This will be far from both
v3 = get_embedding("The weather in Tokyo is sunny")

Indexing Algorithms

HNSW (Hierarchical Navigable Small World)

Query latency: ~1-5ms for 1M vectors
Build time: Slower (graph construction is expensive)
Memory: High, the graph structure sits in RAM alongside vectors
Recall: Excellent (95-99%+ at typical settings)
Update cost: Moderate, new vectors can be inserted without full rebuild

HNSW dominates production vector search. Qdrant, Weaviate, and pgvector's hnsw index all use it.

IVF (Inverted File Index)

IVF partitions the vector space into clusters using k-means. At query time, it searches only the nearest clusters instead of the full dataset.

Query latency: 5-20ms for 1M vectors (depends on nprobe)
Build time: Faster than HNSW
Memory: Lower, vectors can stay on disk with cluster centroids in RAM
Recall: Good but lower than HNSW at equivalent speed
Update cost: Low for inserts, but clusters degrade over time and need periodic rebuilding

IVF is the go-to when you have more vectors than RAM. Pinecone uses a proprietary variation of IVF internally.

Product Quantization (PQ)

PQ compresses vectors by splitting them into sub-vectors and quantizing each sub-vector to its nearest centroid in a codebook. This reduces memory by 4-64x at the cost of some recall accuracy.

Original: [0.12, 0.34, 0.56, 0.78, 0.23, 0.45, 0.67, 0.89]  to  32 bytes
PQ (4 sub-vectors, 256 centroids): [42, 187, 3, 211]           to  4 bytes

Choosing an Embedding Model

The vector database is half the equation. The embedding model determines the quality of your vectors and thus the quality of your retrieval.

Model	Dimensions	MTEB Score	Cost	Latency	Notes
OpenAI `text-embedding-3-small`	1,536	62.3	$0.02/1M tokens	~50ms	Best price/performance ratio
OpenAI `text-embedding-3-large`	3,072	64.6	$0.13/1M tokens	~80ms	Highest quality from OpenAI
Cohere `embed-english-v3.0`	1,024	64.5	$0.10/1M tokens	~60ms	Excellent for search and RAG
Nomic `nomic-embed-text-v1.5`	768	62.3	Free (self-host)	~10ms	Open-source, self-hostable
BGE `bge-large-en-v1.5`	1,024	63.6	Free (self-host)	~15ms	Open-source, top MTEB performer
Voyage `voyage-large-2`	1,536	64.8	$0.12/1M tokens	~70ms	Strong for code and technical text

*Self-hosted latency depends on your GPU. These are estimates for an RTX 4090.

The Practitioner's Comparison

Feature	Pinecone	Weaviate	Qdrant	Chroma	pgvector
Deployment	Managed only	Self-hosted + Cloud	Self-hosted + Cloud	Self-hosted + Cloud	PostgreSQL extension
Index type	Proprietary	HNSW	HNSW + quantization	HNSW (via hnswlib)	HNSW + IVFFlat
Max dimensions	20,000	Unlimited	Unlimited	Unlimited	2,000
Metadata filtering	Yes (server-side)	Yes (inverted index)	Yes (payload index)	Yes (basic)	Yes (SQL WHERE)
Hybrid search	Sparse + dense	BM25 + vector	Sparse + dense	No	Full-text + vector
Multi-tenancy	Namespaces	Built-in	Collection-level	Collections	Schema/row-level
Disk-based index	No (all in RAM)	Yes (mmap)	Yes (mmap + on-disk)	No	Yes (PostgreSQL storage)
Operational overhead	Zero (managed)	Moderate	Low-Moderate	Low	Low (if you run Postgres)
Cost (1M vectors)	~$70/mo (s1)	~$25/mo (self-hosted)	~$20/mo (self-hosted)	Free (self-hosted)	Free (existing Postgres)
Maturity	Production-proven	Production-proven	Production-proven	Early but growing	Production-proven

When to Use What

Pinecone: When you don't want to think about infrastructure

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")

# Upsert vectors with metadata
index.upsert(
    vectors=[
        {
            "id": "doc-1",
            "values": get_embedding("LTI 1.3 uses OIDC for authentication"),
            "metadata": {"source": "docs", "category": "auth", "date": "2024-01-15"},
        },
    ],
    namespace="knowledge-base",
)

# Query with metadata filter
results = index.query(
    vector=get_embedding("How does LTI authentication work?"),
    top_k=5,
    namespace="knowledge-base",
    filter={"category": {"$eq": "auth"}},
    include_metadata=True,
)

The downside: cost scales linearly, no on-disk storage option, and you're fully vendor-locked. At 100M+ vectors, the bill gets serious.

Qdrant: When you need fine-grained control without the complexity

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance,
    VectorParams,
    PointStruct,
    Filter,
    FieldCondition,
    MatchValue,
)

client = QdrantClient(url="http://localhost:6333")

# Create collection with quantization
client.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
        on_disk=True,
    ),
    quantization_config={
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": True,
        }
    },
)

# Upsert with rich payload
client.upsert(
    collection_name="knowledge_base",
    points=[
        PointStruct(
            id=1,
            vector=get_embedding("LTI 1.3 uses OIDC for authentication"),
            payload={
                "source": "docs",
                "category": "auth",
                "date": "2024-01-15",
                "word_count": 450,
            },
        ),
    ],
)

# Query with payload filtering
results = client.query_points(
    collection_name="knowledge_base",
    query=get_embedding("How does LTI authentication work?"),
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="auth"))]
    ),
    limit=5,
)

pgvector: When you already run PostgreSQL

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1536),
    category TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);

-- Create HNSW index for cosine distance
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Query: nearest neighbors with metadata filter
SELECT id, content, category,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'auth'
ORDER BY embedding <=> $1::vector
LIMIT 5;

Weaviate: When you need hybrid search

import weaviate

client = weaviate.connect_to_local()

collection = client.collections.get("Document")

# Hybrid query: combines BM25 + vector search
response = collection.query.hybrid(
    query="LTI authentication OIDC flow",
    alpha=0.7,  # 0 = pure BM25, 1 = pure vector
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("category").equal("auth"),
    return_metadata=weaviate.classes.query.MetadataQuery(score=True),
)

for obj in response.objects:
    print(f"{obj.properties['title']} (score: {obj.metadata.score:.3f})")

Chroma: For prototyping and local development

Chroma is the SQLite of vector databases. Zero configuration, runs in-process, perfect for prototyping. I use it for local development of RAG pipelines and swap in Qdrant or pgvector for production.

import chromadb

client = chromadb.Client()
collection = client.create_collection("test")

collection.add(
    documents=["LTI 1.3 uses OIDC", "OAuth 2.0 token exchange"],
    ids=["doc1", "doc2"],
)

results = collection.query(
    query_texts=["authentication protocol"],
    n_results=2,
)

Production Deployment Patterns

Chunking Strategy

from tiktoken import encoding_for_model

enc = encoding_for_model("text-embedding-3-small")


def chunk_fixed(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start += chunk_size - overlap
    return chunks

Semantic chunking. Split at natural boundaries, paragraph breaks, section headers, topic shifts. Better retrieval quality because chunks are coherent units of meaning:

import re


def chunk_semantic(text: str, max_tokens: int = 512) -> list[str]:
    # Split on double newlines (paragraph boundaries)
    paragraphs = re.split(r"\n\n+", text)
    chunks = []
    current_chunk = []
    current_size = 0

    for para in paragraphs:
        para_tokens = len(enc.encode(para))
        if current_size + para_tokens > max_tokens and current_chunk:
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [para]
            current_size = para_tokens
        else:
            current_chunk.append(para)
            current_size += para_tokens

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))
    return chunks

def chunk_parent_child(
    text: str, parent_size: int = 1024, child_size: int = 256, child_overlap: int = 50
) -> list[dict]:
    """Create parent-child chunk pairs for retrieval."""
    parent_chunks = chunk_fixed(text, parent_size, overlap=100)
    results = []

    for parent_idx, parent in enumerate(parent_chunks):
        children = chunk_fixed(parent, child_size, child_overlap)
        for child_idx, child in enumerate(children):
            results.append({
                "parent_id": f"parent_{parent_idx}",
                "child_id": f"parent_{parent_idx}_child_{child_idx}",
                "parent_text": parent,
                "child_text": child,
                "embedding": get_embedding(child),  # Embed the child
            })

    return results

Store child embeddings in the vector database with parent_id as metadata. Search returns child matches; look up the parent for context:

results = client.query_points(
    collection_name="knowledge_base",
    query=get_embedding(user_query),
    limit=5,
)

# Retrieve unique parent chunks for full context
parent_ids = list(set(r.payload["parent_id"] for r in results.points))
parent_chunks = fetch_parents_from_store(parent_ids)

Embedding Pipeline Architecture

Don't embed documents synchronously in your API request path. Build an async pipeline:

Document Created/Updated
        │
        ▼
┌─────────────────────┐
│  Message Queue       │  (Redis, SQS, or Kafka)
│  "embed:doc-123"     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Embedding Worker    │  (Batches documents, calls embedding API)
│  - Chunk document    │
│  - Generate vectors  │
│  - Upsert to vecDB   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Vector Database     │  (Qdrant / pgvector / Pinecone)
└─────────────────────┘

import cohere

co = cohere.Client("your-api-key")


def retrieve_and_rerank(query: str, collection_name: str, top_k: int = 5) -> list[dict]:
    # Stage 1: Fast vector search, get 50 candidates
    candidates = client.query_points(
        collection_name=collection_name,
        query=get_embedding(query),
        limit=50,
    )

    # Stage 2: Rerank with cross-encoder for accurate scoring
    docs = [p.payload["content"] for p in candidates.points]
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=top_k,
    )

    return [
        {
            "content": docs[r.index],
            "relevance_score": r.relevance_score,
            "vector_id": candidates.points[r.index].id,
        }
        for r in reranked.results
    ]

Monitoring and Observability

Track these metrics in production:

Query latency (p50, p95, p99): HNSW queries should be <10ms p95 for 1M vectors. If p95 creeps above 50ms, check memory pressure or increase ef_search.
Recall accuracy: Periodically run queries against a ground-truth dataset. If recall drops below 90%, the index may need rebuilding or your embeddings may have drifted.
Index size vs RAM: When the index exceeds available RAM, latency spikes. Monitor RSS and plan capacity ahead.
Embedding drift: If your content distribution changes (new topics, new vocabulary), embeddings from the old model may not represent the new content well. Track average similarity scores over time, a downward trend suggests the embedding model needs fine-tuning or replacement.

Vector Databases: A Practitioner's Comparison for Production AI

Embedding Fundamentals

Indexing Algorithms

HNSW (Hierarchical Navigable Small World)

IVF (Inverted File Index)

Product Quantization (PQ)

Choosing an Embedding Model

The Practitioner's Comparison

When to Use What

Pinecone: When you don't want to think about infrastructure

Qdrant: When you need fine-grained control without the complexity

pgvector: When you already run PostgreSQL

Weaviate: When you need hybrid search

Chroma: For prototyping and local development

Production Deployment Patterns

Chunking Strategy

Embedding Pipeline Architecture

Monitoring and Observability

Common Pitfalls

Related Posts on To Stupid Too Quit

Continue Reading

MCP Protocol in LLM Applications: A Practitioner's Guide

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Agentic AI and Multi-Agent Systems: Building Beyond Single-Prompt Chains

Vector Databases: A Practitioner's Comparison for Production AI

Embedding Fundamentals

Indexing Algorithms

HNSW (Hierarchical Navigable Small World)

IVF (Inverted File Index)

Product Quantization (PQ)

Choosing an Embedding Model

The Practitioner's Comparison

When to Use What

Pinecone: When you don't want to think about infrastructure

Qdrant: When you need fine-grained control without the complexity

pgvector: When you already run PostgreSQL

Weaviate: When you need hybrid search

Chroma: For prototyping and local development

Production Deployment Patterns

Chunking Strategy

Embedding Pipeline Architecture

Monitoring and Observability

Common Pitfalls

Related Posts on To Stupid Too Quit

Continue Reading

MCP Protocol in LLM Applications: A Practitioner's Guide

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Agentic AI and Multi-Agent Systems: Building Beyond Single-Prompt Chains