Most vector database comparisons test on 10,000 vectors with synthetic data and a single query type. Then they publish a bar chart showing latency and call it a benchmark.
Here's what actually matters in production: how does filtering interact with ANN search? What's p99 latency at 10M vectors under concurrent load? What happens when the index needs to be rebuilt? Which databases support hybrid search natively vs as a bolt-on? And what does it cost when you scale from 1M to 100M vectors?
This is a production-focused comparison of the five databases every enterprise RAG team evaluates — Pinecone, pgvector, Azure AI Search, Weaviate, and Qdrant — across the dimensions that determine whether your system works at 3am on a Tuesday when query load spikes.
The Contenders
Architecture — How Each Database Works
Understanding the index architecture explains every performance characteristic downstream.
Pinecone
Pinecone is a purpose-built managed vector database. Vectors are stored in proprietary index structures on Pinecone's infrastructure. Two deployment modes:
- Serverless — pay per query and storage, no pod sizing. Cold start latency on first query after idle. Best for variable workloads.
- Pod-based — dedicated compute, predictable latency, no cold start. Required for p99 SLA guarantees.
Index type: proprietary ANN (based on HNSW internals, not publicly documented). Namespaces allow logical partitioning within a single index — useful for multi-tenant isolation without separate indexes.
pgvector
pgvector is a PostgreSQL extension. Vectors are stored as columns in standard PostgreSQL tables alongside all your relational data. Two index types:
- IVFFlat — Inverted File index, divides vector space into clusters (lists). Fast to build, moderate recall. Requires
ANALYZEafter bulk inserts. - HNSW (added in pgvector 0.5.0) — same graph-based ANN as dedicated vector DBs. Better recall, higher memory, slower to build.
The critical difference from dedicated vector DBs: pgvector runs inside PostgreSQL. This means joins, transactions, and SQL predicates — but also means you're sharing resources with your OLTP workload and fighting for the PostgreSQL buffer pool.
Azure AI Search
Azure AI Search is not a pure vector database — it's a hybrid search platform. It combines:
- HNSW for approximate vector search
- Inverted index (BM25) for full-text keyword search
- Semantic ranker (cross-encoder) for precision reranking
- Filtering via OData expressions on any indexed field
All three happen in one service, one API call, one result set merged via RRF. For RAG, this is the key advantage — you don't assemble a hybrid pipeline from parts; it's native.
Weaviate
Weaviate is an open source vector database written in Go. Native support for:
- HNSW index (configurable m, ef)
- BM25 (built-in, via the BM25 operator)
- Hybrid search (BM25 + vector, configurable alpha weighting)
- Multi-tenancy (tenant-per-class or tenant-per-shard)
- Multimodal vectors (text, image, audio via module system)
- GraphQL API (plus REST and gRPC)
Weaviate's module system is the differentiator — you can plug in vectorizers (OpenAI, Cohere, HuggingFace) directly into the database, which handles embedding at insert and query time.
Qdrant
Qdrant is an open source vector database written in Rust. Purpose-built for high-throughput vector search:
- HNSW with advanced quantization (scalar, product, binary)
- Sparse vector support (for BM25-style retrieval)
- Native hybrid search via RRF (dense + sparse)
- Named vectors — multiple vector representations per point
- Payload filtering with indexed payload fields
- On-disk HNSW for large indexes that exceed RAM
Qdrant's Rust implementation gives it the lowest memory footprint and highest single-node throughput of any open source vector database.
The Production Dimensions
1. ANN Algorithm and Recall
Notes:
- Pinecone serverless trades recall for cost — the index is optimized for storage efficiency
- pgvector HNSW recall matches dedicated databases but requires careful
ef_searchtuning per query - Qdrant achieves highest recall via configurable
hnsw_efand quantization that preserves precision - Azure AI Search recall is not configurable — Microsoft manages index parameters; typically 0.95–0.97
Recall is table stakes. What separates production systems is how recall interacts with filtering.
2. Filtering — Pre-filter vs Post-filter (The Most Important Dimension)
This is the dimension most comparisons get wrong. Filtering strategy determines retrieval correctness under metadata constraints — not raw ANN recall.
Post-filtering runs ANN over all vectors, then discards results that don't match the filter. If only 1% of your corpus matches the filter, you need to retrieve 1,000 candidates to get 10 valid results — or you miss relevant documents entirely.
Pre-filtering applies the metadata filter first, reducing the search space, then runs ANN within that filtered set. Correct results, but requires indexed payload fields and index structures that support filtered ANN.
| Database | Filtering Strategy | Production Impact |
|---|---|---|
| Pinecone | Metadata filter applied post-ANN (serverless) / pre-filter on pods with metadata index | Serverless struggles on high-selectivity filters; pod-based is correct |
| pgvector | SQL WHERE clause — true pre-filter via PostgreSQL query planner | Correct pre-filtering via SQL; query planner may choose seq scan over index for small filtered sets |
| Azure AI Search | OData $filter applied as pre-filter before HNSW | True pre-filter, fast, indexed fields mandatory |
| Weaviate | where filter — pre-filter via ACORN algorithm (v1.18+) | Pre-filter with ACORN; older versions post-filter |
| Qdrant | Payload filter — pre-filter with indexed payload fields | True pre-filter when payload fields are indexed; use create_payload_index |
Production rule: always index your filter fields and verify pre-filtering behavior. A vector database with post-filtering is silently wrong for any query that uses metadata filters — the most common RAG query pattern.
3. Hybrid Search Support
| Database | BM25 | Vector | Hybrid Native | Fusion |
|---|---|---|---|---|
| Pinecone | ✗ No | ✓ | ✗ — external BM25 required | Manual |
| pgvector | Via pg_search (ParadeDB) | ✓ | Partial — separate queries + manual merge | Manual RRF |
| Azure AI Search | ✓ Native | ✓ Native | ✓ Native — one API call | RRF automatic |
| Weaviate | ✓ Native BM25 | ✓ Native | ✓ Native — alpha weight control | Weighted fusion |
| Qdrant | Via sparse vectors (SPLADE/BM25) | ✓ Native | ✓ Native — RRF (v1.7+) | RRF automatic |
Pinecone's hybrid gap is significant. For production RAG, you need BM25 alongside vector search (loan codes, product IDs, named entities require exact term matching). With Pinecone, you run Elasticsearch or OpenSearch separately, merge results manually. That's two systems to operate, two failure domains, and latency from two network hops.
4. Latency at Scale
Test configuration: 10M vectors, 1536-dim, single metadata filter, top-10 results, p50/p99 under 100 concurrent queries.
Key observations:
- Qdrant — lowest latency across both p50 and p99. Rust + SIMD vectorization + zero-copy memory design. p99 stays under 50ms even under concurrency.
- pgvector — p99 degrades significantly under concurrent load. PostgreSQL's connection model and shared buffer pool become bottlenecks at high concurrency. Use PgBouncer connection pooling and dedicated read replicas for production.
- Pinecone pod — consistent p50/p99 gap (good). Cold start on serverless adds 200–800ms to first query after idle period.
- Azure AI Search — p99 includes BM25 + HNSW + semantic reranker in one call. Comparable p99 to dedicated vector DBs for the full hybrid pipeline.
- Weaviate — Go GC pauses can spike p99. Tune
GOGCand use dedicated memory for HNSW index to minimize GC impact.
5. Indexing Speed and Index Rebuild
Production RAG systems re-index continuously — new documents, updated guidelines, CDC from SQL. Indexing throughput and the behavior during re-indexing matter as much as query performance.
| Database | Indexing Throughput | Rebuild Behavior | Live Re-index? |
|---|---|---|---|
| Pinecone | ~500 vectors/sec (serverless) / ~2K/sec (pod) | No rebuild needed — upsert by ID | ✓ Live upsert, no downtime |
| pgvector IVFFlat | Fast bulk insert, but needs ANALYZE + VACUUM | Requires full rebuild for lists change | ✓ Live insert, index degrades without maintenance |
| pgvector HNSW | Slow — O(n log n) build | Full rebuild required for param changes | ✓ Live insert, no rebuild for new rows |
| Azure AI Search | ~1K docs/sec (indexer), batch API faster | Index updates are incremental | ✓ Live — indexer merges changes |
| Weaviate | ~3K vectors/sec (batch import) | HNSW rebuild on schema change | ✓ Live batch import |
| Qdrant | ~5K vectors/sec (batch), async indexing | Background HNSW optimization | ✓ Live — segments merge in background |
Qdrant's async indexing is critical for production: vectors are inserted immediately into a flat index (instant, 100% recall), then background HNSW optimization runs on segments. Queries always return results — even during heavy insert load — because the flat index is always current. Other databases can return stale or incomplete results if queried during index build.
pgvector IVFFlat degradation is the most common production issue: bulk inserts without VACUUM ANALYZE cause IVF cluster centroids to drift, reducing recall silently. Automate nightly VACUUM ANALYZE and monitor recall with a synthetic test set.
6. Multi-Tenancy
Enterprise RAG almost always requires multi-tenancy — different customers, business units, or user groups with isolated data.
- Pinecone namespaces — logical partitioning within one index. Fast namespace switch, shared compute. Data is not cryptographically isolated — metadata filter bypass risk exists.
- Azure AI Search + Weaviate — index-per-tenant is the enterprise pattern. Separate indexes, separate access keys, true isolation. Higher operational overhead but required for regulated industries (HIPAA, PCI).
- Qdrant collections — collection-per-tenant with payload filtering. gRPC API supports efficient tenant switching.
- pgvector — row-level security (RLS) in PostgreSQL is the isolation mechanism. Correct when implemented, but RLS misconfiguration is a common vulnerability. Requires security audit.
7. Cost at Scale
Monthly cost estimate — 10M vectors, 1536-dim, 1M queries/month:
| Database | Infrastructure | Estimated Monthly Cost | Notes |
|---|---|---|---|
| Pinecone Serverless | Managed | ~$120–180 | Storage + query units |
| Pinecone Pod (s1.x1) | Managed | ~$700 | Dedicated pod, predictable |
| pgvector | AWS r6g.2xlarge RDS | ~$350 | Multi-AZ, includes storage |
| Azure AI Search S2 | Managed | ~$500 | Includes BM25 + semantic ranker |
| Weaviate Cloud | Managed | ~$400–600 | Depends on node size |
| Qdrant Cloud | Managed | ~$200–350 | 1x4GB node |
| Qdrant Self-hosted | 8-core 32GB VM | ~$120–180 | Operational overhead |
| Weaviate Self-hosted | 8-core 32GB VM | ~$120–180 | Operational overhead |
Hidden costs to factor in:
- Pinecone: re-embedding on index migration (no index export), cold start mitigation (keep-alive pings)
- pgvector: DBA time for vacuum/analyze automation, connection pooler setup, replica lag monitoring
- Azure AI Search: semantic ranker is an add-on tier — S1 doesn't include it; S2+ required
- Self-hosted (Qdrant/Weaviate): operational overhead — backup automation, monitoring, on-call, upgrades
8. Enterprise Features
| Feature | Pinecone | pgvector | Azure AI Search | Weaviate | Qdrant |
|---|---|---|---|---|---|
| SOC 2 Type II | ✓ | Via RDS/Cloud SQL | ✓ | ✓ Cloud | ✓ Cloud |
| HIPAA | ✓ Enterprise | Via compliant PG host | ✓ | ✓ Enterprise | Self-hosted |
| Private networking | ✓ VPC | ✓ VPC/Private Link | ✓ Private Link | ✓ | ✓ |
| RBAC | ✓ API key scopes | PostgreSQL roles | ✓ Azure RBAC + AAD | ✓ | ✓ |
| Backup / PITR | ✓ Managed | ✓ RDS automated | ✓ Geo-redundant | ✓ Cloud | Manual / Cloud |
| Monitoring | ✓ Dashboard | PostgreSQL metrics | ✓ Azure Monitor | ✓ Prometheus | ✓ Prometheus |
| SLA uptime | 99.95% | Per cloud provider | 99.9% | 99.9% Cloud | Per cloud |
Full Comparison Matrix
| Dimension | Pinecone | pgvector | Azure AI Search | Weaviate | Qdrant |
|---|---|---|---|---|---|
| Index type | Proprietary ANN | IVFFlat / HNSW | HNSW | HNSW | HNSW + quantization |
| Hybrid search | ✗ External BM25 | Partial (ParadeDB) | ✓ Native | ✓ Native | ✓ Native |
| Pre-filtering | Pod only | ✓ SQL WHERE | ✓ OData | ✓ ACORN (v1.18+) | ✓ Payload index |
| Multimodal | ✗ | ✗ | ✓ Vision integration | ✓ Module system | Partial |
| Multi-tenancy | Namespaces | RLS | Index-per-tenant | Index-per-tenant | Collections |
| Managed option | ✓ | ✓ (RDS/Cloud SQL) | ✓ | ✓ | ✓ |
| Self-hosted | ✗ | ✓ | ✗ | ✓ | ✓ |
| p50 latency | 15–25ms | 30–60ms | 30–50ms | 20–35ms | 8–15ms |
| p99 under load | 60–80ms | 150–300ms | 100–140ms | 80–110ms | 35–55ms |
| Index throughput | ~500–2K/sec | ~1–2K/sec | ~1K/sec | ~3K/sec | ~5K/sec |
| Live re-index | ✓ Upsert | ✓ Insert | ✓ Incremental | ✓ Batch | ✓ Async segments |
| Cost at 10M vectors | $120–700/mo | $350/mo | $500/mo | $180–600/mo | $120–350/mo |
| Best for | Quick start, serverless | Existing PG stack | Azure enterprise | Multimodal, OSS | Max throughput, OSS |
When Each Database Wins
Choose Pinecone when:
- Your team has no infrastructure engineers and needs zero operational overhead
- You're prototyping or in early production and need to move fast
- Your workload is variable — serverless scales to zero and you pay per query
- You don't need hybrid search (or you already run Elasticsearch separately)
Watch out for: serverless cold starts in latency-sensitive paths, no hybrid search natively, no self-hosted option (vendor lock-in), cost unpredictability at query volume.
Choose pgvector when:
- You already run PostgreSQL and vectors belong with the relational data
- Your queries frequently join vector search with relational conditions (
WHERE loan_status = 'active' AND vector_similarity > 0.8) - Your team knows SQL and PostgreSQL operations — no new system to learn
- Data volume is under 5M vectors (pgvector starts straining above this on single-node)
Watch out for: p99 degradation under concurrent load, HNSW build time for large datasets, sharing buffer pool with OLTP workload, manual VACUUM ANALYZE discipline required.
# pgvector — create table and HNSW index
import psycopg2
conn = psycopg2.connect(DATABASE_URL)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT,
doc_id TEXT,
doc_type TEXT,
embedding vector(512),
created_at TIMESTAMPTZ DEFAULT now()
)
""")
# HNSW index — better recall than IVFFlat, slower to build
cur.execute("""
CREATE INDEX IF NOT EXISTS chunks_embedding_hnsw
ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200)
""")
# Index metadata for pre-filtering
cur.execute("CREATE INDEX IF NOT EXISTS chunks_doc_type ON chunks (doc_type)")
conn.commit()
# Query — pre-filter via SQL WHERE, then ANN
cur.execute("""
SELECT id, content, doc_id,
1 - (embedding <=> %s::vector) AS similarity
FROM chunks
WHERE doc_type = 'guideline' -- pre-filter
ORDER BY embedding <=> %s::vector -- ANN
LIMIT 10
""", (query_vector, query_vector))
# Set ef_search for this session for better recall
cur.execute("SET hnsw.ef_search = 100")
Choose Azure AI Search when:
- You're building on Azure and need hybrid search + semantic reranking in one service
- Compliance requires data residency within Azure regions (HIPAA, SOC 2, FedRAMP)
- Your stack is .NET / C# — Semantic Kernel integration is native
- You're indexing SharePoint, Azure SQL, Cosmos DB — native connectors exist
- You need enterprise RBAC via Azure Active Directory
Watch out for: semantic ranker requires S2+ tier (significant cost jump), no self-hosted option, index schema changes require index rebuild.
# Azure AI Search — full hybrid pipeline
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery, QueryType, QueryCaptionType
results = client.search(
search_text=query, # BM25
vector_queries=[VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=50,
fields="content_vector"
)], # HNSW
query_type=QueryType.SEMANTIC, # semantic reranker
semantic_configuration_name="default",
query_caption=QueryCaptionType.EXTRACTIVE,
filter="doc_type eq 'guideline' and doc_version eq '2026-Q1'", # pre-filter
top=5
)
Choose Weaviate when:
- You need multimodal search — text + images in the same index
- You want the vectorizer to live inside the database (auto-embedding at insert)
- GraphQL API fits your frontend/API layer
- You need hybrid search with tunable BM25/vector weighting (alpha parameter)
Watch out for: Go GC pauses at high throughput (tune GOGC), GraphQL overhead for simple queries (use gRPC instead), schema migrations require careful planning.
import weaviate
from weaviate.classes.query import MetadataQuery, HybridFusion
client = weaviate.connect_to_local()
collection = client.collections.get("MortgageChunks")
# Hybrid search with tunable alpha (0=BM25, 1=vector, 0.5=balanced)
results = collection.query.hybrid(
query="FHA loan DTI limit compensating factors",
alpha=0.7, # 70% vector, 30% BM25
fusion_type=HybridFusion.RELATIVE_SCORE,
filters=weaviate.classes.query.Filter.by_property("doc_type").equal("guideline"),
limit=10,
return_metadata=MetadataQuery(score=True, explain_score=True)
)
for obj in results.objects:
print(f"Score: {obj.metadata.score:.4f} | {obj.properties['doc_title']}")
Choose Qdrant when:
- Throughput and latency are the primary constraints
- You need sparse + dense hybrid search with RRF in a single database
- You need quantization to reduce memory footprint at scale (binary quantization = 32x memory reduction)
- You're building on-premise or air-gapped (Rust binary, no JVM/GC)
- Cost efficiency matters — highest performance per dollar of any open source option
from qdrant_client import QdrantClient
from qdrant_client.models import (
VectorParams, Distance, SparseVectorParams,
NamedVector, NamedSparseVector, SparseVector,
Prefetch, FusionQuery, Fusion, Filter, FieldCondition, MatchValue,
ScalarQuantizationConfig, ScalarType
)
client = QdrantClient("localhost", port=6333)
# Create collection with scalar quantization (4x memory reduction, ~2% recall drop)
client.create_collection(
collection_name="mortgage-chunks",
vectors_config={
"dense": VectorParams(size=512, distance=Distance.COSINE)
},
sparse_vectors_config={
"sparse": SparseVectorParams() # for BM25-style retrieval
},
quantization_config=ScalarQuantizationConfig(
type=ScalarType.INT8, # 4x memory reduction
quantile=0.99,
always_ram=True # keep quantized index in RAM
),
hnsw_config={"m": 16, "ef_construct": 200, "on_disk": False}
)
# Index payload field for pre-filtering
client.create_payload_index(
collection_name="mortgage-chunks",
field_name="doc_type",
field_schema="keyword"
)
# Hybrid search: sparse (BM25) + dense (vector) merged with RRF
results = client.query_points(
collection_name="mortgage-chunks",
prefetch=[
Prefetch(
query=NamedSparseVector(
name="sparse",
vector=SparseVector(indices=sparse_ids, values=sparse_weights)
),
filter=Filter(must=[FieldCondition(key="doc_type", match=MatchValue(value="guideline"))]),
limit=50
),
Prefetch(
query=query_vector,
using="dense",
filter=Filter(must=[FieldCondition(key="doc_type", match=MatchValue(value="guideline"))]),
limit=50
)
],
query=FusionQuery(fusion=Fusion.RRF),
limit=10
)
Production Architecture Patterns
Pattern 1 — Azure Enterprise Stack
Best for: enterprises already on Azure, regulated industries, .NET shops. One managed service replaces vector DB + BM25 + reranker.
Pattern 2 — High-Throughput Open Source Stack
Best for: teams prioritizing throughput, cost efficiency, or on-premise requirements.
Pattern 3 — Relational + Vector (pgvector)
Best for: when the retrieval query requires joining vectors with relational data in the same query.
What We Run at MortgageIQ
Primary: Azure AI Search S2 tier
Why: we're Azure-native, the data sources are SharePoint and Azure SQL (native connectors), compliance requires Azure-region data residency, and the semantic ranker is the decisive precision differentiator for regulatory text retrieval.
The hybrid in one call was the decision maker. The alternative was Qdrant + Elasticsearch + a custom RRF merger — three systems to operate, monitor, and keep in sync. Azure AI Search is one system, one SLA, one support contract.
What we'd use if not Azure:
- High throughput / cost-sensitive: Qdrant — nothing else matches it on p99 latency under concurrent load
- Existing PostgreSQL infrastructure: pgvector HNSW with PgBouncer — vectors alongside relational data with zero new operational surface
- Multimodal (text + property images): Weaviate — the module system handles multi-vector per document cleanly
What we'd avoid:
- Pinecone for production RAG without a separate BM25 system — the hybrid gap is too significant for a domain with exact-term queries (loan codes, regulation references)
- pgvector above 5M vectors on shared PostgreSQL — p99 degradation under load is real and hard to fix without dedicated read replicas
Key Takeaways
- Filtering strategy is the most important dimension no one benchmarks — post-filtering silently degrades recall on any query with metadata constraints. Verify pre-filtering behavior before committing to a database.
- Hybrid search is not optional for enterprise RAG — product codes, regulation numbers, and named entities require BM25. Pinecone's lack of native hybrid is its biggest production liability.
- Qdrant wins on raw throughput and p99 latency — Rust + async indexing + quantization produces the lowest latency of any open source or managed option.
- pgvector is the right answer when the query needs a JOIN — vector similarity + relational filter in one SQL query is impossible in any dedicated vector database.
- Azure AI Search is the right answer for Azure-native enterprise teams — hybrid + semantic reranker + native connectors + compliance in one managed service justifies the cost premium.
- p99 matters more than p50 in production — your users experience the slow queries, not the average. pgvector and Pinecone serverless have the worst p99 under concurrent load.