Technical

Scaling RAG to 100K+ Documents Without Lag

Aug 25, 2025

Giga Team

Retrieval-Augmented Generation (RAG) is the bridge between large language models and the fresh, domain-specific data they need to deliver accurate answers. It’s the reason an AI can answer a question about your latest compliance policy or sales report, even if those documents were added minutes ago.

The challenge comes when the corpus grows into the hundreds of thousands or millions of documents. At that scale, retrieval latency becomes the difference between a seamless, human-like interaction and an experience that feels sluggish or broken.

This article breaks down how to architect, optimize, and operate a RAG pipeline that scales to massive corpora without lag, using real-world benchmarks, architectural patterns, and techniques applied by enterprise teams in production.

1. The Scaling Challenge

Scaling RAG is not just about adding more data. It’s about keeping latency low across the entire pipeline:

Embedding search latency: At scale, brute-force search is impractical. Approximate Nearest Neighbor (ANN) algorithms like HNSW and IVF+PQ make large-scale vector search feasible, but require tuning to balance recall, memory footprint, and speed.
Context assembly overhead: Fetching, re-ranking, and merging retrieved chunks into a prompt can be a hidden bottleneck, especially with long contexts or complex ranking logic.
LLM processing time: Even perfect retrieval loses its impact if the Time-to-First-Token (TTFT) is high. Token processing costs grow with context size, so inference optimizations matter as much as retrieval speed.

The takeaway: performance tuning must be holistic. Gains in one stage can be erased if another stage lags.

2. Architecting for Speed

Choosing the Right Vector Database

Your vector database is the foundation of your RAG latency profile. Performance varies widely between systems, both in P99 latency and queries per second (QPS).

Key insights from the benchmark:

ZillizCloud: 2.5 ms P99 latency and ~9,700 QPS — strong for high-throughput, low-latency needs.
Milvus: 2.2 ms latency with ~3,465 QPS — competitive on latency, slightly lower on throughput.
OpenSearch (force merge): Higher latency (~7.2 ms) but still competitive QPS for certain workloads.
Pinecone, Qdrant, OpenSearch (standard): Trade-offs in raw speed for operational simplicity or ecosystem integration.

Streaming Performance Matters

Static benchmarks only tell half the story. Many real-world RAG applications run on constantly updated corpora, so the database must handle concurrent reads and writes without collapsing under ingestion pressure.

Highlights:

ZillizCloud shows strong resilience, maintaining ~1,860 QPS at 1,000 rows/s ingestion.
Qdrant degrades minimally under ingestion.
Pinecone and OpenSearch experience sharper drops as ingestion rates increase.

Consolidated Benchmark Table

Database (Config)	P99 Latency (ms)	QPS (1M Static)	QPS (Static, 10M)	QPS (500 rows/s)	QPS (1000 rows/s)
ZillizCloud-8cu-perf	2.5	9,704	3,957	2,119	1,860
Milvus-16c64g-sq8	2.2	3,465	437	306	156
OpenSearch-16c128g-force	7.2	3,055	—	—	—
QdrantCloud-16c64g	6.4	1,242	447	394	348
Pinecone-p2.x8-1node	13.7	1,147	1,131	367	370
OpenSearch-16c128g	13.2	951	506	162	150

3. ANN Algorithm Choice

HNSW: Excellent speed-recall trade-off, versatile across datasets, but memory-heavy.
IVF+PQ: Lower memory footprint, fast for certain distributions, but more sensitive to parameter tuning.

The choice depends on embedding dimensionality, recall tolerance, and available hardware.

4. Scaling Patterns

Sharding and Replication

Sharding horizontally partitions the index for parallel queries. Replication provides fault tolerance and extra read throughput.

Caching

Use:

Query cache for repeated queries
Chunk cache for popular documents
Embedding cache if generating embeddings on the fly

Hybrid Retrieval

Blend semantic vector search with BM25 keyword search for precise and conceptually relevant results.

Pre-filtering

Filter on metadata before vector search to shrink the search space and improve speed.

5. Optimizing for the Last Millisecond

Parallel search across shards or threads
Parallel LLM inference across GPUs/nodes
TTFT optimization with streamlined pipelines, early exit strategies, and speculative decoding

These techniques stack to create noticeable responsiveness improvements in interactive applications.

6. Business Impact

User experience: Sub-second responses build trust and keep users engaged.
Operational efficiency: High QPS with low latency can mean fewer servers for the same load.
Real-world wins:
- Morningstar + Weaviate: Low-latency search over financial data for faster insights.
- Rubrik + Pinecone: Secure, low-latency search over billions of vectors for real-time data access.

7. Key Lessons

Latency is a pipeline problem, not just a DB problem.
Hybrid retrieval improves recall and relevance without excessive cost.
Streaming performance is critical for live-data RAG.
Match open-source vs proprietary to your operational needs and skills.
Always measure and optimize TTFT alongside retrieval latency.

Introducing DWR surveys

Why Sub-Second Voice AI Is the New Gold Standard

Let's Build Your Next Agent

Talk to us

Giga

Product

Demo

Company

About

Careers

Contact

Resources

Blog

Giga

Product

Demo

Company

About

Careers

Contact

Resources

Blog

Let's Build Your Next Agent

Talk to us

Giga

Product

Demo

Company

About

Careers

Contact

Resources

Blog

Giga

Demo

Company

Careers

Talk to us

Giga