Hands-on comparison of Pinecone, Qdrant, Weaviate, pgvector, and Chroma for production AI. Covers embedding fundamentals, indexing algorithms (HNSW, IVF, PQ), chunking strategies, reranking, and when each database fits.
Tyler McDaniel
AI Engineer & IBM Business Partner
Vector databases are the infrastructure behind every retrieval-augmented generation (RAG) pipeline, semantic search engine, and recommendation system built on embeddings. The concept is simple: store high-dimensional vectors, find the nearest neighbors. The implementation complexity determines whether your system returns results in 5ms or 500ms, costs $50/month or $5,000/month, and scales to a million vectors or a billion.
I've used five vector databases in production — Pinecone, Weaviate, Qdrant, Chroma, and pgvector. This is a practitioner's comparison based on actual deployment experience, not feature-list marketing.
Before comparing databases, you need to understand what you're storing. An embedding is a fixed-size numerical representation of unstructured data — text, images, audio — produced by a neural network. Two pieces of content that are semantically similar produce vectors that are close together in the embedding space.
from openai import OpenAIclient = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
These two will be close in vector space
v1 = get_embedding("How do I reset my password?")
v2 = get_embedding("I forgot my login credentials")
This will be far from both
v3 = get_embedding("The weather in Tokyo is sunny")
OpenAI's text-embedding-3-small produces 1,536-dimensional vectors. Each vector is 1,536 float32 values — about 6KB per vector. At 10 million documents, that's ~60GB of raw vector data before indexes.
Dimension matters. Higher dimensions capture more semantic nuance but cost more storage and compute. OpenAI's text-embedding-3-large outputs 3,072 dimensions. For most RAG use cases, 1,536 is sufficient. Some models like [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) produce 768 dimensions with competitive quality at lower cost.
Distance metrics matter. Cosine similarity is the default for normalized text embeddings. Euclidean (L2) distance works for image embeddings. Dot product is fastest but requires normalized vectors. Pick the metric that matches your embedding model's training objective — most text models use cosine.
The naive approach — comparing your query vector against every stored vector — is O(n). With a million vectors at 1,536 dimensions, that's ~1.5 billion floating-point multiplications per query. Vector databases use approximate nearest neighbor (ANN) algorithms to make this tractable.
[HNSW](https://arxiv.org/abs/1603.09320) builds a multi-layer graph where each node is a vector and edges connect nearby vectors. The search starts at the top layer (sparse, long edges for coarse navigation) and descends to the bottom layer (dense, short edges for fine-grained search).
HNSW dominates production vector search. Qdrant, Weaviate, and pgvector's hnsw index all use it.
IVF partitions the vector space into clusters using k-means. At query time, it searches only the nearest clusters instead of the full dataset.
nprobe)IVF is the go-to when you have more vectors than RAM. Pinecone uses a proprietary variation of IVF internally.
PQ compresses vectors by splitting them into sub-vectors and quantizing each sub-vector to its nearest centroid in a codebook. This reduces memory by 4-64x at the cost of some recall accuracy.
Original: [0.12, 0.34, 0.56, 0.78, 0.23, 0.45, 0.67, 0.89] → 32 bytes
PQ (4 sub-vectors, 256 centroids): [42, 187, 3, 211] → 4 bytes
Every production database supports PQ or a variant (scalar quantization, binary quantization). For datasets above 10M vectors, quantization is not optional — it's the difference between fitting in RAM or not.
The vector database is half the equation. The embedding model determines the quality of your vectors and thus the quality of your retrieval.
| Model | Dimensions | MTEB Score | Cost | Latency | Notes |
|-------|-----------|------------|------|---------|-------|
| OpenAI text-embedding-3-small | 1,536 | 62.3 | $0.02/1M tokens | ~50ms | Best price/performance ratio |
| OpenAI text-embedding-3-large | 3,072 | 64.6 | $0.13/1M tokens | ~80ms | Highest quality from OpenAI |
| Cohere embed-english-v3.0 | 1,024 | 64.5 | $0.10/1M tokens | ~60ms | Excellent for search and RAG |
| [Nomic nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | 768 | 62.3 | Free (self-host) | ~10ms* | Open-source, self-hostable |
| [BGE bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1,024 | 63.6 | Free (self-host) | ~15ms* | Open-source, top MTEB performer |
| Voyage voyage-large-2| 1,536 | 64.8 | $0.12/1M tokens | ~70ms | Strong for code and technical text |
*Self-hosted latency depends on your GPU. These are estimates for an RTX 4090.
My recommendation: Start withtext-embedding-3-small. It's cheap, fast, and good enough for most use cases. If retrieval quality isn't meeting your bar, switch to text-embedding-3-large or Cohere v3 — don't tune the vector database first. Embedding quality dominates retrieval quality more than index configuration.If you're cost-sensitive or have privacy constraints, self-host Nomic or BGE. Both run on consumer GPUs and produce competitive embeddings. See [Self-Hosting LLMs with FastAPI](https://tostupidtooquit.com/blog/self-hosting-llms-fastapi) for the deployment infrastructure — the same FastAPI + Docker setup works for embedding models. Critical rule: never mix embedding models in the same collection. Vectors from different models are incompatible — their embedding spaces are different. If you switch models, re-embed your entire corpus. This is expensive, so choose carefully and commit.
| Feature | [Pinecone](https://www.pinecone.io/) | [Weaviate](https://weaviate.io/) | [Qdrant](https://qdrant.tech/) | [Chroma](https://www.trychroma.com/) | [pgvector](https://github.com/pgvector/pgvector) |
|---------|---------|----------|--------|--------|----------|
| Deployment | Managed only | Self-hosted + Cloud | Self-hosted + Cloud | Self-hosted + Cloud | PostgreSQL extension |
| Index type | Proprietary | HNSW | HNSW + quantization | HNSW (via hnswlib) | HNSW + IVFFlat |
| Max dimensions | 20,000 | Unlimited | Unlimited | Unlimited | 2,000 |
| Metadata filtering | Yes (server-side) | Yes (inverted index) | Yes (payload index) | Yes (basic) | Yes (SQL WHERE) |
| Hybrid search | Sparse + dense | BM25 + vector | Sparse + dense | No | Full-text + vector |
| Multi-tenancy | Namespaces | Built-in | Collection-level | Collections | Schema/row-level |
| Disk-based index | No (all in RAM) | Yes (mmap) | Yes (mmap + on-disk) | No | Yes (PostgreSQL storage) |
| Operational overhead | Zero (managed) | Moderate | Low-Moderate | Low | Low (if you run Postgres) |
| Cost (1M vectors) | ~$70/mo (s1) | ~$25/mo (self-hosted) | ~$20/mo (self-hosted) | Free (self-hosted) | Free (existing Postgres) |
| Maturity | Production-proven | Production-proven | Production-proven | Early but growing | Production-proven |
Pinecone is fully managed. No servers to provision, no indexes to tune, no replication to configure. You get an API endpoint and it works. This is its entire value proposition — and it's a legitimate one. If your team is 3 engineers building a RAG feature and none of you want to become vector database operators, Pinecone is the right choice.
from pinecone import Pineconepc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")
Upsert vectors with metadata
index.upsert(
vectors=[
{
"id": "doc-1",
"values": get_embedding("LTI 1.3 uses OIDC for authentication"),
"metadata": {"source": "docs", "category": "auth", "date": "2024-01-15"},
},
],
namespace="knowledge-base",
)
Query with metadata filter
results = index.query(
vector=get_embedding("How does LTI authentication work?"),
top_k=5,
namespace="knowledge-base",
filter={"category": {"$eq": "auth"}},
include_metadata=True,
)
The downside: cost scales linearly, no on-disk storage option, and you're fully vendor-locked. At 100M+ vectors, the bill gets serious.
Qdrant is my default recommendation for teams that want to self-host. The API is clean, the documentation is excellent, and it handles the hard problems (quantization, on-disk storage, distributed deployment) without requiring a PhD in distributed systems.
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance,
VectorParams,
PointStruct,
Filter,
FieldCondition,
MatchValue,
)
client = QdrantClient(url="http://localhost:6333")
Create collection with quantization
client.create_collection(
collection_name="knowledge_base",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE,
on_disk=True,
),
quantization_config={
"scalar": {
"type": "int8",
"quantile": 0.99,
"always_ram": True,
}
},
)
Upsert with rich payload
client.upsert(
collection_name="knowledge_base",
points=[
PointStruct(
id=1,
vector=get_embedding("LTI 1.3 uses OIDC for authentication"),
payload={
"source": "docs",
"category": "auth",
"date": "2024-01-15",
"word_count": 450,
},
),
],
)
Query with payload filtering
results = client.query_points(
collection_name="knowledge_base",
query=get_embedding("How does LTI authentication work?"),
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="auth"))]
),
limit=5,
)
Qdrant's scalar quantization keeps the quantized vectors in RAM for fast distance computation while storing the full-precision vectors on disk for re-ranking. This gives HNSW-speed queries at IVF-level memory usage. For my [agentic AI systems](https://tostupidtooquit.com/blog/agentic-ai-multi-agent-systems), Qdrant handles the long-term memory store with sub-10ms recall latency.
If you already have a Postgres database, pgvector is the zero-infrastructure option. No new service to deploy, no new backup strategy, no new monitoring. Add the extension, create an index, query with SQL.
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536),
category TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
-- Create HNSW index for cosine distance
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Query: nearest neighbors with metadata filter
SELECT id, content, category,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'auth'
ORDER BY embedding <=> $1::vector
LIMIT 5;
The <=> operator computes cosine distance. The HNSW index makes it fast. The WHERE clause filters happen at the SQL level, which means you get the full power of PostgreSQL's query planner for metadata filtering — something purpose-built vector databases are still improving.
pgvector's limitations: The HNSW index lives in shared memory, so large indexes compete with your regular Postgres workload. Build times are slow for 10M+ vectors. And the 2,000-dimension limit means you can't use OpenAI's text-embedding-3-large (3,072 dimensions) without truncation. For most RAG use cases with 1,536-dimension embeddings and under 5M vectors, pgvector is performant and operationally free if Postgres is already in your stack.
Weaviate combines vector search with BM25 keyword search in a single query. This is important because pure vector search misses exact-match queries ("error code E4021") and pure keyword search misses semantic queries ("authentication failures"). Hybrid search does both:
import weaviateclient = weaviate.connect_to_local()
collection = client.collections.get("Document")
Hybrid query: combines BM25 + vector search
response = collection.query.hybrid(
query="LTI authentication OIDC flow",
alpha=0.7, # 0 = pure BM25, 1 = pure vector
limit=5,
filters=weaviate.classes.query.Filter.by_property("category").equal("auth"),
return_metadata=weaviate.classes.query.MetadataQuery(score=True),
)
for obj in response.objects:
print(f"{obj.properties['title']} (score: {obj.metadata.score:.3f})")
The alpha parameter controls the balance. In my experience, alpha=0.7 (70% vector, 30% keyword) works well for general-purpose RAG. For code search, drop it to alpha=0.5 because function names and variable names are exact-match patterns that BM25 handles better.
Chroma is the SQLite of vector databases. Zero configuration, runs in-process, perfect for prototyping. I use it for local development of RAG pipelines and swap in Qdrant or pgvector for production.
import chromadbclient = chromadb.Client()
collection = client.create_collection("test")
collection.add(
documents=["LTI 1.3 uses OIDC", "OAuth 2.0 token exchange"],
ids=["doc1", "doc2"],
)
results = collection.query(
query_texts=["authentication protocol"],
n_results=2,
)
Chroma handles embedding automatically (uses a local model by default). For production, use a dedicated embedding endpoint and manage vectors yourself. Don't deploy Chroma as your production vector store — it's not designed for multi-tenant, high-availability workloads.
Before vectors reach the database, documents must be chunked. How you chunk determines retrieval quality more than which database you use. Three strategies: Fixed-size chunking. Split every N tokens with M token overlap. Simple, predictable, works for homogeneous content. I use 512 tokens with 50 token overlap as a starting point:
from tiktoken import encoding_for_modelenc = encoding_for_model("text-embedding-3-small")
def chunk_fixed(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start += chunk_size - overlap
return chunks
Semantic chunking. Split at natural boundaries — paragraph breaks, section headers, topic shifts. Better retrieval quality because chunks are coherent units of meaning:
import re
def chunk_semantic(text: str, max_tokens: int = 512) -> list[str]:
# Split on double newlines (paragraph boundaries)
paragraphs = re.split(r"\n\n+", text)
chunks = []
current_chunk = []
current_size = 0
for para in paragraphs:
para_tokens = len(enc.encode(para))
if current_size + para_tokens > max_tokens and current_chunk:
chunks.append("\n\n".join(current_chunk))
current_chunk = [para]
current_size = para_tokens
else:
current_chunk.append(para)
current_size += para_tokens
if current_chunk:
chunks.append("\n\n".join(current_chunk))
return chunks
Parent-child chunking. Embed small chunks (256 tokens) for precise retrieval, but return the parent chunk (1024 tokens) for context. The small chunk matches the query accurately; the parent chunk gives the LLM enough context to generate a good response. This is the highest-quality strategy and what I use for production RAG systems:
def chunk_parent_child(
text: str, parent_size: int = 1024, child_size: int = 256, child_overlap: int = 50
) -> list[dict]:
"""Create parent-child chunk pairs for retrieval."""
parent_chunks = chunk_fixed(text, parent_size, overlap=100)
results = []
for parent_idx, parent in enumerate(parent_chunks):
children = chunk_fixed(parent, child_size, child_overlap)
for child_idx, child in enumerate(children):
results.append({
"parent_id": f"parent_{parent_idx}",
"child_id": f"parent_{parent_idx}_child_{child_idx}",
"parent_text": parent,
"child_text": child,
"embedding": get_embedding(child), # Embed the child
})
return results
Store child embeddings in the vector database with parent_id as metadata. Search returns child matches; look up the parent for context:
results = client.query_points(
collection_name="knowledge_base",
query=get_embedding(user_query),
limit=5,
)
Retrieve unique parent chunks for full context
parent_ids = list(set(r.payload["parent_id"] for r in results.points))
parent_chunks = fetch_parents_from_store(parent_ids)
Don't embed documents synchronously in your API request path. Build an async pipeline:
Document Created/Updated
│
▼
┌─────────────────────┐
│ Message Queue │ (Redis, SQS, or Kafka)
│ "embed:doc-123" │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Embedding Worker │ (Batches documents, calls embedding API)
│ - Chunk document │
│ - Generate vectors │
│ - Upsert to vecDB │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Vector Database │ (Qdrant / pgvector / Pinecone)
└─────────────────────┘
Batch embedding calls whenever possible. OpenAI's API accepts arrays of inputs — embedding 100 texts in one call is ~95% cheaper than 100 single calls (due to per-request overhead, not pricing). Most self-hosted models also benefit from batching for GPU utilization. Reranking: Vector search retrieves candidates. A reranker (like Cohere Rerank or a cross-encoder model) scores them more accurately. The two-stage pattern — retrieve 50 candidates with ANN, rerank to top 5 — consistently outperforms single-stage retrieval with more candidates.
import cohereco = cohere.Client("your-api-key")
def retrieve_and_rerank(query: str, collection_name: str, top_k: int = 5) -> list[dict]:
# Stage 1: Fast vector search — get 50 candidates
candidates = client.query_points(
collection_name=collection_name,
query=get_embedding(query),
limit=50,
)
# Stage 2: Rerank with cross-encoder for accurate scoring
docs = [p.payload["content"] for p in candidates.points]
reranked = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs,
top_n=top_k,
)
return [
{
"content": docs[r.index],
"relevance_score": r.relevance_score,
"vector_id": candidates.points[r.index].id,
}
for r in reranked.results
]
Track these metrics in production:
ef_search.m (number of connections per node, typically 16-64) and ef_construction (build-time search width, typically 100-400). Higher values improve recall but increase memory and build time. For production, set ef_construction high at index build time (you do this once) and tune ef_search at query time for the latency/recall tradeoff you need. A typical starting point: m=32, ef_construction=200, ef_search=128.
"Represent this document for retrieval: ..." vs "Represent this query for searching: ...". If you forget the prefix, cosine similarity drops 10-15%. Read your model's documentation.
Pitfall 2: Not storing original text alongside vectors. A vector database stores vectors and metadata, but you need the original text to pass to the LLM. Store the text chunk in metadata/payload — don't make a separate database call to retrieve it. The latency adds up.
Pitfall 3: Over-indexing metadata. Every indexed metadata field adds memory and CPU overhead. Index the fields you actually filter on (category, date range, tenant ID). Don't index every field "just in case" — the where clause you never use still costs you at upsert time.
Pitfall 4: No deletion strategy. When source documents are updated or deleted, their old vectors persist in the database. Over time, this causes stale results and wasted storage. Implement a garbage collection process: track document versions, and when a new version is embedded, delete the old vectors.
Pitfall 5: Testing with toy data, deploying with production data. A vector database that's fast with 10,000 vectors may be unacceptable at 10,000,000. Test at production scale — or at minimum, 10x your expected initial scale. Load test tools like [Locust](https://locust.io/) can simulate concurrent vector queries with realistic payloads.
Pitfall 6: Ignoring the hybrid search gap. Pure vector search fails on exact-match queries. A user searching for "error ERR_4021" will get semantically similar results about errors, not the specific error code. If your corpus contains identifiers, codes, or proper nouns that users search for exactly, you need hybrid search (Weaviate) or a BM25 fallback alongside your vector store.The bottom line: your embedding model and chunking strategy matter more than which database you pick. Get those right first, then choose the database that matches your operational reality — managed if you don't want to run infrastructure, pgvector if Postgres is already in your stack, Qdrant if you want control without complexity, Weaviate if hybrid search is critical.
---
Building production MCP servers and clients with Python. Covers the JSON-RPC 2.0 wire protocol, transport layers (stdio, SSE, Streamable HTTP), filesystem tool implementation with path traversal protection, and connecting Claude to your custom tools.
Running quantized LLMs behind a FastAPI proxy with Ollama and vLLM backends. Covers model quantization tradeoffs, GGUF vs GPTQ vs AWQ, streaming responses, request queuing, Docker Compose deployment, and production monitoring.
Building production multi-agent systems from scratch. Covers ReAct, plan-and-execute, supervisor, and pipeline patterns with full Python implementations. Includes inter-agent communication, human-in-the-loop, memory systems, failure modes, and a real EdTech production architecture.