← All Posts
ai-mlApril 19, 2026vector-databasepineconepgvectorazure-ai-searchweaviateqdrantragembeddingsenterprise-aiproduction

Vector Database Showdown: Pinecone vs pgvector vs Azure AI Search vs Weaviate vs Qdrant in Production

A production-focused comparison of the five major vector databases — indexing speed, query latency, hybrid search, filtering precision, cost at scale, and enterprise readiness. What we actually run at MortgageIQ and when each database wins.

Most vector database comparisons test on 10,000 vectors with synthetic data and a single query type. Then they publish a bar chart showing latency and call it a benchmark.

Here's what actually matters in production: how does filtering interact with ANN search? What's p99 latency at 10M vectors under concurrent load? What happens when the index needs to be rebuilt? Which databases support hybrid search natively vs as a bolt-on? And what does it cost when you scale from 1M to 100M vectors?

This is a production-focused comparison of the five databases every enterprise RAG team evaluates — Pinecone, pgvector, Azure AI Search, Weaviate, and Qdrant — across the dimensions that determine whether your system works at 3am on a Tuesday when query load spikes.


The Contenders


Architecture — How Each Database Works

Understanding the index architecture explains every performance characteristic downstream.

Pinecone

Pinecone is a purpose-built managed vector database. Vectors are stored in proprietary index structures on Pinecone's infrastructure. Two deployment modes:

  • Serverless — pay per query and storage, no pod sizing. Cold start latency on first query after idle. Best for variable workloads.
  • Pod-based — dedicated compute, predictable latency, no cold start. Required for p99 SLA guarantees.

Index type: proprietary ANN (based on HNSW internals, not publicly documented). Namespaces allow logical partitioning within a single index — useful for multi-tenant isolation without separate indexes.

pgvector

pgvector is a PostgreSQL extension. Vectors are stored as columns in standard PostgreSQL tables alongside all your relational data. Two index types:

  • IVFFlat — Inverted File index, divides vector space into clusters (lists). Fast to build, moderate recall. Requires ANALYZE after bulk inserts.
  • HNSW (added in pgvector 0.5.0) — same graph-based ANN as dedicated vector DBs. Better recall, higher memory, slower to build.

The critical difference from dedicated vector DBs: pgvector runs inside PostgreSQL. This means joins, transactions, and SQL predicates — but also means you're sharing resources with your OLTP workload and fighting for the PostgreSQL buffer pool.

Azure AI Search

Azure AI Search is not a pure vector database — it's a hybrid search platform. It combines:

  • HNSW for approximate vector search
  • Inverted index (BM25) for full-text keyword search
  • Semantic ranker (cross-encoder) for precision reranking
  • Filtering via OData expressions on any indexed field

All three happen in one service, one API call, one result set merged via RRF. For RAG, this is the key advantage — you don't assemble a hybrid pipeline from parts; it's native.

Weaviate

Weaviate is an open source vector database written in Go. Native support for:

  • HNSW index (configurable m, ef)
  • BM25 (built-in, via the BM25 operator)
  • Hybrid search (BM25 + vector, configurable alpha weighting)
  • Multi-tenancy (tenant-per-class or tenant-per-shard)
  • Multimodal vectors (text, image, audio via module system)
  • GraphQL API (plus REST and gRPC)

Weaviate's module system is the differentiator — you can plug in vectorizers (OpenAI, Cohere, HuggingFace) directly into the database, which handles embedding at insert and query time.

Qdrant

Qdrant is an open source vector database written in Rust. Purpose-built for high-throughput vector search:

  • HNSW with advanced quantization (scalar, product, binary)
  • Sparse vector support (for BM25-style retrieval)
  • Native hybrid search via RRF (dense + sparse)
  • Named vectors — multiple vector representations per point
  • Payload filtering with indexed payload fields
  • On-disk HNSW for large indexes that exceed RAM

Qdrant's Rust implementation gives it the lowest memory footprint and highest single-node throughput of any open source vector database.


The Production Dimensions

1. ANN Algorithm and Recall

Notes:

  • Pinecone serverless trades recall for cost — the index is optimized for storage efficiency
  • pgvector HNSW recall matches dedicated databases but requires careful ef_search tuning per query
  • Qdrant achieves highest recall via configurable hnsw_ef and quantization that preserves precision
  • Azure AI Search recall is not configurable — Microsoft manages index parameters; typically 0.95–0.97

Recall is table stakes. What separates production systems is how recall interacts with filtering.


2. Filtering — Pre-filter vs Post-filter (The Most Important Dimension)

This is the dimension most comparisons get wrong. Filtering strategy determines retrieval correctness under metadata constraints — not raw ANN recall.

Post-filtering runs ANN over all vectors, then discards results that don't match the filter. If only 1% of your corpus matches the filter, you need to retrieve 1,000 candidates to get 10 valid results — or you miss relevant documents entirely.

Pre-filtering applies the metadata filter first, reducing the search space, then runs ANN within that filtered set. Correct results, but requires indexed payload fields and index structures that support filtered ANN.

DatabaseFiltering StrategyProduction Impact
PineconeMetadata filter applied post-ANN (serverless) / pre-filter on pods with metadata indexServerless struggles on high-selectivity filters; pod-based is correct
pgvectorSQL WHERE clause — true pre-filter via PostgreSQL query plannerCorrect pre-filtering via SQL; query planner may choose seq scan over index for small filtered sets
Azure AI SearchOData $filter applied as pre-filter before HNSWTrue pre-filter, fast, indexed fields mandatory
Weaviatewhere filter — pre-filter via ACORN algorithm (v1.18+)Pre-filter with ACORN; older versions post-filter
QdrantPayload filter — pre-filter with indexed payload fieldsTrue pre-filter when payload fields are indexed; use create_payload_index

Production rule: always index your filter fields and verify pre-filtering behavior. A vector database with post-filtering is silently wrong for any query that uses metadata filters — the most common RAG query pattern.


3. Hybrid Search Support

DatabaseBM25VectorHybrid NativeFusion
Pinecone✗ No✗ — external BM25 requiredManual
pgvectorVia pg_search (ParadeDB)Partial — separate queries + manual mergeManual RRF
Azure AI Search✓ Native✓ Native✓ Native — one API callRRF automatic
Weaviate✓ Native BM25✓ Native✓ Native — alpha weight controlWeighted fusion
QdrantVia sparse vectors (SPLADE/BM25)✓ Native✓ Native — RRF (v1.7+)RRF automatic

Pinecone's hybrid gap is significant. For production RAG, you need BM25 alongside vector search (loan codes, product IDs, named entities require exact term matching). With Pinecone, you run Elasticsearch or OpenSearch separately, merge results manually. That's two systems to operate, two failure domains, and latency from two network hops.


4. Latency at Scale

Test configuration: 10M vectors, 1536-dim, single metadata filter, top-10 results, p50/p99 under 100 concurrent queries.

Key observations:

  • Qdrant — lowest latency across both p50 and p99. Rust + SIMD vectorization + zero-copy memory design. p99 stays under 50ms even under concurrency.
  • pgvector — p99 degrades significantly under concurrent load. PostgreSQL's connection model and shared buffer pool become bottlenecks at high concurrency. Use PgBouncer connection pooling and dedicated read replicas for production.
  • Pinecone pod — consistent p50/p99 gap (good). Cold start on serverless adds 200–800ms to first query after idle period.
  • Azure AI Search — p99 includes BM25 + HNSW + semantic reranker in one call. Comparable p99 to dedicated vector DBs for the full hybrid pipeline.
  • Weaviate — Go GC pauses can spike p99. Tune GOGC and use dedicated memory for HNSW index to minimize GC impact.

5. Indexing Speed and Index Rebuild

Production RAG systems re-index continuously — new documents, updated guidelines, CDC from SQL. Indexing throughput and the behavior during re-indexing matter as much as query performance.

DatabaseIndexing ThroughputRebuild BehaviorLive Re-index?
Pinecone~500 vectors/sec (serverless) / ~2K/sec (pod)No rebuild needed — upsert by ID✓ Live upsert, no downtime
pgvector IVFFlatFast bulk insert, but needs ANALYZE + VACUUMRequires full rebuild for lists change✓ Live insert, index degrades without maintenance
pgvector HNSWSlow — O(n log n) buildFull rebuild required for param changes✓ Live insert, no rebuild for new rows
Azure AI Search~1K docs/sec (indexer), batch API fasterIndex updates are incremental✓ Live — indexer merges changes
Weaviate~3K vectors/sec (batch import)HNSW rebuild on schema change✓ Live batch import
Qdrant~5K vectors/sec (batch), async indexingBackground HNSW optimization✓ Live — segments merge in background

Qdrant's async indexing is critical for production: vectors are inserted immediately into a flat index (instant, 100% recall), then background HNSW optimization runs on segments. Queries always return results — even during heavy insert load — because the flat index is always current. Other databases can return stale or incomplete results if queried during index build.

pgvector IVFFlat degradation is the most common production issue: bulk inserts without VACUUM ANALYZE cause IVF cluster centroids to drift, reducing recall silently. Automate nightly VACUUM ANALYZE and monitor recall with a synthetic test set.


6. Multi-Tenancy

Enterprise RAG almost always requires multi-tenancy — different customers, business units, or user groups with isolated data.

  • Pinecone namespaces — logical partitioning within one index. Fast namespace switch, shared compute. Data is not cryptographically isolated — metadata filter bypass risk exists.
  • Azure AI Search + Weaviate — index-per-tenant is the enterprise pattern. Separate indexes, separate access keys, true isolation. Higher operational overhead but required for regulated industries (HIPAA, PCI).
  • Qdrant collections — collection-per-tenant with payload filtering. gRPC API supports efficient tenant switching.
  • pgvector — row-level security (RLS) in PostgreSQL is the isolation mechanism. Correct when implemented, but RLS misconfiguration is a common vulnerability. Requires security audit.

7. Cost at Scale

Monthly cost estimate — 10M vectors, 1536-dim, 1M queries/month:

DatabaseInfrastructureEstimated Monthly CostNotes
Pinecone ServerlessManaged~$120–180Storage + query units
Pinecone Pod (s1.x1)Managed~$700Dedicated pod, predictable
pgvectorAWS r6g.2xlarge RDS~$350Multi-AZ, includes storage
Azure AI Search S2Managed~$500Includes BM25 + semantic ranker
Weaviate CloudManaged~$400–600Depends on node size
Qdrant CloudManaged~$200–3501x4GB node
Qdrant Self-hosted8-core 32GB VM~$120–180Operational overhead
Weaviate Self-hosted8-core 32GB VM~$120–180Operational overhead

Hidden costs to factor in:

  • Pinecone: re-embedding on index migration (no index export), cold start mitigation (keep-alive pings)
  • pgvector: DBA time for vacuum/analyze automation, connection pooler setup, replica lag monitoring
  • Azure AI Search: semantic ranker is an add-on tier — S1 doesn't include it; S2+ required
  • Self-hosted (Qdrant/Weaviate): operational overhead — backup automation, monitoring, on-call, upgrades

8. Enterprise Features

FeaturePineconepgvectorAzure AI SearchWeaviateQdrant
SOC 2 Type IIVia RDS/Cloud SQL✓ Cloud✓ Cloud
HIPAA✓ EnterpriseVia compliant PG host✓ EnterpriseSelf-hosted
Private networking✓ VPC✓ VPC/Private Link✓ Private Link
RBAC✓ API key scopesPostgreSQL roles✓ Azure RBAC + AAD
Backup / PITR✓ Managed✓ RDS automated✓ Geo-redundant✓ CloudManual / Cloud
Monitoring✓ DashboardPostgreSQL metrics✓ Azure Monitor✓ Prometheus✓ Prometheus
SLA uptime99.95%Per cloud provider99.9%99.9% CloudPer cloud

Full Comparison Matrix

DimensionPineconepgvectorAzure AI SearchWeaviateQdrant
Index typeProprietary ANNIVFFlat / HNSWHNSWHNSWHNSW + quantization
Hybrid search✗ External BM25Partial (ParadeDB)✓ Native✓ Native✓ Native
Pre-filteringPod only✓ SQL WHERE✓ OData✓ ACORN (v1.18+)✓ Payload index
Multimodal✓ Vision integration✓ Module systemPartial
Multi-tenancyNamespacesRLSIndex-per-tenantIndex-per-tenantCollections
Managed option✓ (RDS/Cloud SQL)
Self-hosted
p50 latency15–25ms30–60ms30–50ms20–35ms8–15ms
p99 under load60–80ms150–300ms100–140ms80–110ms35–55ms
Index throughput~500–2K/sec~1–2K/sec~1K/sec~3K/sec~5K/sec
Live re-index✓ Upsert✓ Insert✓ Incremental✓ Batch✓ Async segments
Cost at 10M vectors$120–700/mo$350/mo$500/mo$180–600/mo$120–350/mo
Best forQuick start, serverlessExisting PG stackAzure enterpriseMultimodal, OSSMax throughput, OSS

When Each Database Wins

Choose Pinecone when:

  • Your team has no infrastructure engineers and needs zero operational overhead
  • You're prototyping or in early production and need to move fast
  • Your workload is variable — serverless scales to zero and you pay per query
  • You don't need hybrid search (or you already run Elasticsearch separately)

Watch out for: serverless cold starts in latency-sensitive paths, no hybrid search natively, no self-hosted option (vendor lock-in), cost unpredictability at query volume.


Choose pgvector when:

  • You already run PostgreSQL and vectors belong with the relational data
  • Your queries frequently join vector search with relational conditions (WHERE loan_status = 'active' AND vector_similarity > 0.8)
  • Your team knows SQL and PostgreSQL operations — no new system to learn
  • Data volume is under 5M vectors (pgvector starts straining above this on single-node)

Watch out for: p99 degradation under concurrent load, HNSW build time for large datasets, sharing buffer pool with OLTP workload, manual VACUUM ANALYZE discipline required.

# pgvector — create table and HNSW index
import psycopg2

conn = psycopg2.connect(DATABASE_URL)
cur = conn.cursor()

cur.execute("""
    CREATE TABLE IF NOT EXISTS chunks (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        content TEXT,
        doc_id TEXT,
        doc_type TEXT,
        embedding vector(512),
        created_at TIMESTAMPTZ DEFAULT now()
    )
""")

# HNSW index — better recall than IVFFlat, slower to build
cur.execute("""
    CREATE INDEX IF NOT EXISTS chunks_embedding_hnsw
    ON chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200)
""")

# Index metadata for pre-filtering
cur.execute("CREATE INDEX IF NOT EXISTS chunks_doc_type ON chunks (doc_type)")
conn.commit()

# Query — pre-filter via SQL WHERE, then ANN
cur.execute("""
    SELECT id, content, doc_id,
           1 - (embedding <=> %s::vector) AS similarity
    FROM chunks
    WHERE doc_type = 'guideline'           -- pre-filter
    ORDER BY embedding <=> %s::vector      -- ANN
    LIMIT 10
""", (query_vector, query_vector))

# Set ef_search for this session for better recall
cur.execute("SET hnsw.ef_search = 100")

Choose Azure AI Search when:

  • You're building on Azure and need hybrid search + semantic reranking in one service
  • Compliance requires data residency within Azure regions (HIPAA, SOC 2, FedRAMP)
  • Your stack is .NET / C# — Semantic Kernel integration is native
  • You're indexing SharePoint, Azure SQL, Cosmos DB — native connectors exist
  • You need enterprise RBAC via Azure Active Directory

Watch out for: semantic ranker requires S2+ tier (significant cost jump), no self-hosted option, index schema changes require index rebuild.

# Azure AI Search — full hybrid pipeline
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery, QueryType, QueryCaptionType

results = client.search(
    search_text=query,                              # BM25
    vector_queries=[VectorizedQuery(
        vector=query_vector,
        k_nearest_neighbors=50,
        fields="content_vector"
    )],                                             # HNSW
    query_type=QueryType.SEMANTIC,                  # semantic reranker
    semantic_configuration_name="default",
    query_caption=QueryCaptionType.EXTRACTIVE,
    filter="doc_type eq 'guideline' and doc_version eq '2026-Q1'",  # pre-filter
    top=5
)

Choose Weaviate when:

  • You need multimodal search — text + images in the same index
  • You want the vectorizer to live inside the database (auto-embedding at insert)
  • GraphQL API fits your frontend/API layer
  • You need hybrid search with tunable BM25/vector weighting (alpha parameter)

Watch out for: Go GC pauses at high throughput (tune GOGC), GraphQL overhead for simple queries (use gRPC instead), schema migrations require careful planning.

import weaviate
from weaviate.classes.query import MetadataQuery, HybridFusion

client = weaviate.connect_to_local()
collection = client.collections.get("MortgageChunks")

# Hybrid search with tunable alpha (0=BM25, 1=vector, 0.5=balanced)
results = collection.query.hybrid(
    query="FHA loan DTI limit compensating factors",
    alpha=0.7,                          # 70% vector, 30% BM25
    fusion_type=HybridFusion.RELATIVE_SCORE,
    filters=weaviate.classes.query.Filter.by_property("doc_type").equal("guideline"),
    limit=10,
    return_metadata=MetadataQuery(score=True, explain_score=True)
)

for obj in results.objects:
    print(f"Score: {obj.metadata.score:.4f} | {obj.properties['doc_title']}")

Choose Qdrant when:

  • Throughput and latency are the primary constraints
  • You need sparse + dense hybrid search with RRF in a single database
  • You need quantization to reduce memory footprint at scale (binary quantization = 32x memory reduction)
  • You're building on-premise or air-gapped (Rust binary, no JVM/GC)
  • Cost efficiency matters — highest performance per dollar of any open source option
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, SparseVectorParams,
    NamedVector, NamedSparseVector, SparseVector,
    Prefetch, FusionQuery, Fusion, Filter, FieldCondition, MatchValue,
    ScalarQuantizationConfig, ScalarType
)

client = QdrantClient("localhost", port=6333)

# Create collection with scalar quantization (4x memory reduction, ~2% recall drop)
client.create_collection(
    collection_name="mortgage-chunks",
    vectors_config={
        "dense": VectorParams(size=512, distance=Distance.COSINE)
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams()   # for BM25-style retrieval
    },
    quantization_config=ScalarQuantizationConfig(
        type=ScalarType.INT8,            # 4x memory reduction
        quantile=0.99,
        always_ram=True                  # keep quantized index in RAM
    ),
    hnsw_config={"m": 16, "ef_construct": 200, "on_disk": False}
)

# Index payload field for pre-filtering
client.create_payload_index(
    collection_name="mortgage-chunks",
    field_name="doc_type",
    field_schema="keyword"
)

# Hybrid search: sparse (BM25) + dense (vector) merged with RRF
results = client.query_points(
    collection_name="mortgage-chunks",
    prefetch=[
        Prefetch(
            query=NamedSparseVector(
                name="sparse",
                vector=SparseVector(indices=sparse_ids, values=sparse_weights)
            ),
            filter=Filter(must=[FieldCondition(key="doc_type", match=MatchValue(value="guideline"))]),
            limit=50
        ),
        Prefetch(
            query=query_vector,
            using="dense",
            filter=Filter(must=[FieldCondition(key="doc_type", match=MatchValue(value="guideline"))]),
            limit=50
        )
    ],
    query=FusionQuery(fusion=Fusion.RRF),
    limit=10
)

Production Architecture Patterns

Pattern 1 — Azure Enterprise Stack

Best for: enterprises already on Azure, regulated industries, .NET shops. One managed service replaces vector DB + BM25 + reranker.


Pattern 2 — High-Throughput Open Source Stack

Best for: teams prioritizing throughput, cost efficiency, or on-premise requirements.


Pattern 3 — Relational + Vector (pgvector)

Best for: when the retrieval query requires joining vectors with relational data in the same query.


What We Run at MortgageIQ

Primary: Azure AI Search S2 tier

Why: we're Azure-native, the data sources are SharePoint and Azure SQL (native connectors), compliance requires Azure-region data residency, and the semantic ranker is the decisive precision differentiator for regulatory text retrieval.

The hybrid in one call was the decision maker. The alternative was Qdrant + Elasticsearch + a custom RRF merger — three systems to operate, monitor, and keep in sync. Azure AI Search is one system, one SLA, one support contract.

What we'd use if not Azure:

  • High throughput / cost-sensitive: Qdrant — nothing else matches it on p99 latency under concurrent load
  • Existing PostgreSQL infrastructure: pgvector HNSW with PgBouncer — vectors alongside relational data with zero new operational surface
  • Multimodal (text + property images): Weaviate — the module system handles multi-vector per document cleanly

What we'd avoid:

  • Pinecone for production RAG without a separate BM25 system — the hybrid gap is too significant for a domain with exact-term queries (loan codes, regulation references)
  • pgvector above 5M vectors on shared PostgreSQL — p99 degradation under load is real and hard to fix without dedicated read replicas

Key Takeaways

  • Filtering strategy is the most important dimension no one benchmarks — post-filtering silently degrades recall on any query with metadata constraints. Verify pre-filtering behavior before committing to a database.
  • Hybrid search is not optional for enterprise RAG — product codes, regulation numbers, and named entities require BM25. Pinecone's lack of native hybrid is its biggest production liability.
  • Qdrant wins on raw throughput and p99 latency — Rust + async indexing + quantization produces the lowest latency of any open source or managed option.
  • pgvector is the right answer when the query needs a JOIN — vector similarity + relational filter in one SQL query is impossible in any dedicated vector database.
  • Azure AI Search is the right answer for Azure-native enterprise teams — hybrid + semantic reranker + native connectors + compliance in one managed service justifies the cost premium.
  • p99 matters more than p50 in production — your users experience the slow queries, not the average. pgvector and Pinecone serverless have the worst p99 under concurrent load.