Scaling RAG to 100K+ Documents Without Lag
Aug 25, 2025
Giga Team
Retrieval-Augmented Generation (RAG) is the bridge between large language models and the fresh, domain-specific data they need to deliver accurate answers. It’s the reason an AI can answer a question about your latest compliance policy or sales report, even if those documents were added minutes ago.
The challenge comes when the corpus grows into the hundreds of thousands or millions of documents. At that scale, retrieval latency becomes the difference between a seamless, human-like interaction and an experience that feels sluggish or broken.
This article breaks down how to architect, optimize, and operate a RAG pipeline that scales to massive corpora without lag, using real-world benchmarks, architectural patterns, and techniques applied by enterprise teams in production.
1. The Scaling Challenge
Scaling RAG is not just about adding more data. It’s about keeping latency low across the entire pipeline:
Embedding search latency: At scale, brute-force search is impractical. Approximate Nearest Neighbor (ANN) algorithms like HNSW and IVF+PQ make large-scale vector search feasible, but require tuning to balance recall, memory footprint, and speed.
Context assembly overhead: Fetching, re-ranking, and merging retrieved chunks into a prompt can be a hidden bottleneck, especially with long contexts or complex ranking logic.
LLM processing time: Even perfect retrieval loses its impact if the Time-to-First-Token (TTFT) is high. Token processing costs grow with context size, so inference optimizations matter as much as retrieval speed.
The takeaway: performance tuning must be holistic. Gains in one stage can be erased if another stage lags.
2. Architecting for Speed
Choosing the Right Vector Database
Your vector database is the foundation of your RAG latency profile. Performance varies widely between systems, both in P99 latency and queries per second (QPS).

Key insights from the benchmark:
ZillizCloud: 2.5 ms P99 latency and ~9,700 QPS — strong for high-throughput, low-latency needs.
Milvus: 2.2 ms latency with ~3,465 QPS — competitive on latency, slightly lower on throughput.
OpenSearch (force merge): Higher latency (~7.2 ms) but still competitive QPS for certain workloads.
Pinecone, Qdrant, OpenSearch (standard): Trade-offs in raw speed for operational simplicity or ecosystem integration.
Streaming Performance Matters
Static benchmarks only tell half the story. Many real-world RAG applications run on constantly updated corpora, so the database must handle concurrent reads and writes without collapsing under ingestion pressure.

Highlights:
ZillizCloud shows strong resilience, maintaining ~1,860 QPS at 1,000 rows/s ingestion.
Qdrant degrades minimally under ingestion.
Pinecone and OpenSearch experience sharper drops as ingestion rates increase.
Consolidated Benchmark Table
Database (Config) | P99 Latency (ms) | QPS (1M Static) | QPS (Static, 10M) | QPS (500 rows/s) | QPS (1000 rows/s) |
---|---|---|---|---|---|
ZillizCloud-8cu-perf | 2.5 | 9,704 | 3,957 | 2,119 | 1,860 |
Milvus-16c64g-sq8 | 2.2 | 3,465 | 437 | 306 | 156 |
OpenSearch-16c128g-force | 7.2 | 3,055 | — | — | — |
QdrantCloud-16c64g | 6.4 | 1,242 | 447 | 394 | 348 |
Pinecone-p2.x8-1node | 13.7 | 1,147 | 1,131 | 367 | 370 |
OpenSearch-16c128g | 13.2 | 951 | 506 | 162 | 150 |
3. ANN Algorithm Choice
HNSW: Excellent speed-recall trade-off, versatile across datasets, but memory-heavy.
IVF+PQ: Lower memory footprint, fast for certain distributions, but more sensitive to parameter tuning.
The choice depends on embedding dimensionality, recall tolerance, and available hardware.
4. Scaling Patterns
Sharding and Replication
Sharding horizontally partitions the index for parallel queries. Replication provides fault tolerance and extra read throughput.
Caching
Use:
Query cache for repeated queries
Chunk cache for popular documents
Embedding cache if generating embeddings on the fly
Hybrid Retrieval
Blend semantic vector search with BM25 keyword search for precise and conceptually relevant results.
Pre-filtering
Filter on metadata before vector search to shrink the search space and improve speed.
5. Optimizing for the Last Millisecond
Parallel search across shards or threads
Parallel LLM inference across GPUs/nodes
TTFT optimization with streamlined pipelines, early exit strategies, and speculative decoding
These techniques stack to create noticeable responsiveness improvements in interactive applications.
6. Business Impact
User experience: Sub-second responses build trust and keep users engaged.
Operational efficiency: High QPS with low latency can mean fewer servers for the same load.
Real-world wins:
Morningstar + Weaviate: Low-latency search over financial data for faster insights.
Rubrik + Pinecone: Secure, low-latency search over billions of vectors for real-time data access.
7. Key Lessons
Latency is a pipeline problem, not just a DB problem.
Hybrid retrieval improves recall and relevance without excessive cost.
Streaming performance is critical for live-data RAG.
Match open-source vs proprietary to your operational needs and skills.
Always measure and optimize TTFT alongside retrieval latency.