A Customer Asked Me About RAG Performance. I Ended Up Learning CUDA.

How This Started

Since I got my hands on a DGX Spark, I have been keeping a little list of things I want to explore with it. One item on that list has been sitting there for a while: "learn more about CUDA." You know how it is -- the item that sounds important but never feels urgent enough to actually start.

Then today, on a customer call, someone was curious about RAG performance. They have not even started building their agents yet, but they wanted to understand how it all works and what kind of performance they could expect from semantic search as their data grows. It was not an urgent problem to solve -- just a curious mind asking good questions. But that conversation gave me exactly the push I needed. I did not have to go down the CUDA rabbit hole, but I wanted to. What better way to finally learn CUDA than to dive into cuVS -- NVIDIA's library for GPU-accelerated vector search -- and figure out if throwing a GPU at semantic search actually makes a meaningful difference?

So that is what this post is about. I am going to share what I learned about cuVS and Milvus, from the very basics, because a few hours ago I did not know most of this myself.

First Things First: What Even Is Semantic Search?

If you already know this part, feel free to skip ahead. But when I started digging into cuVS, I realized I needed to be more precise about what "semantic search" actually means at the math level.

Semantic search is the idea of finding content by meaning, not by exact keyword matching. You take a piece of text (or an image, or audio), run it through a machine learning model, and get back a list of numbers -- an embedding vector. Texts with similar meanings produce vectors that are close together in this high-dimensional space. "How do I cook pasta?" and "What is the best pasta recipe?" will end up as vectors that are very close to each other. "Quantum physics equations" will be far away from both.

In a RAG pipeline, this is the retrieval step: you take the user's question, turn it into a vector, and search your knowledge base for the closest matching vectors. Those matches become the context you feed to the LLM. The faster and more accurate this search is, the better your RAG pipeline performs.

Now, this "find the closest vectors" operation is called nearest-neighbor search, and it is where the computational bottleneck lives. At small scale, no problem. But when your corpus has hundreds of thousands or millions of vectors? Each with hundreds of dimensions? Every single query needs to be compared against all of them. That is a lot of math.

The Brute-Force Problem and the Approximate Solution

The simplest approach is brute-force search: compare the query vector against every single vector in the corpus, compute the distance, rank them, return the closest ones. It always gives you the correct answer. The problem is that it is slow. With 1 million vectors of 384 dimensions each, that is 1 million distance calculations per query. It adds up fast.

To avoid scanning every vector, people have developed approximate nearest-neighbor (ANN) algorithms. These are clever data structures that let you skip large portions of the corpus during search. The trade-off? They might miss some of the true nearest neighbors. They are faster, but not always perfectly accurate.

The most common ones I came across are:

IVF (Inverted File Index): Think of it like sorting books into shelves by topic. It partitions all the vectors into clusters. At search time, instead of scanning every vector, you only look at the clusters that are closest to your query. The nlist parameter controls how many clusters to create, and nprobe controls how many clusters to actually scan at search time. More clusters scanned = better accuracy, but slower.
HNSW (Hierarchical Navigable Small World): This one builds a multi-layered graph where each vector is connected to its nearby neighbors. Searching works like navigating a city: you start on the highway (top layer, long-range connections) to get to the right neighborhood, then switch to local streets (bottom layer, short-range connections) to find the exact address. Very fast, very popular.
CAGRA (CUDA Approximate Graph-based Retrieval Architecture): Similar idea to HNSW -- it is also a graph -- but designed specifically for GPU parallelism. Instead of traversing the graph one node at a time like on a CPU, it explores many paths simultaneously across thousands of GPU cores.

So What Is cuVS?

OK, now we get to the good part.

cuVS (CUDA Vector Search) is an open-source library from NVIDIA's RAPIDS team. It takes the vector search algorithms I just described and runs them on the GPU. Instead of your CPU cores doing all the distance calculations, the thousands of parallel cores in an NVIDIA GPU do it.

cuVS ships three main index types:

Index	What It Does	Exact?
`brute_force`	Exhaustive pairwise distance on GPU	Yes
`ivf_flat`	Inverted file with flat storage on GPU	No (approximate)
`cagra`	GPU-native graph-based traversal	No (approximate)

And here is what I found really nice: you can use cuVS in two ways. You can call the Python API directly with GPU arrays and get raw performance. Or you can use it through a vector database like Milvus, which uses cuVS under the hood when you create GPU indexes. You do not even need to know cuVS is there -- you just change the index type from FLAT to GPU_BRUTE_FORCE and Milvus handles the rest.

I wanted to see how much faster GPU search really is compared to CPU. So I set up a benchmark.

The Setup

Hardware: NVIDIA DGX Spark -- Blackwell-architecture GB10 GPU with 128.5 GB of unified memory on an aarch64 system. This machine has been my playground for a few weeks now and I am still impressed every time I SSH into it.

Software:

Milvus as the vector database (serving both CPU and GPU index types)
cuVS Python library for direct GPU access
pymilvus for database interaction
CuPy for GPU array management

Benchmark design:

Corpus sizes: 10K, 100K, 500K, and 1M vectors
Embedding dimension: 384 (same dimension as all-MiniLM-L6-v2, a very common embedding model for RAG)
100 query vectors per run
Top-K = 10 neighbors
L2 (Euclidean) distance metric -- this measures the straight-line distance between two vectors in the embedding space. Smaller distance means more similar. I used L2 because GPU indexes in Milvus do not support cosine similarity directly.
3 warmup iterations, then 10 timed iterations per method

CPU vs GPU: Let the Numbers Talk

Time for the actual benchmark. I tested three CPU index types and three GPU index types, all through Milvus. The beautiful thing is that switching between them is just changing one string:

# CPU indexes
build_index(collection, "FLAT", {})
build_index(collection, "IVF_FLAT", {"nlist": 1024})
build_index(collection, "HNSW", {"M": 16, "efConstruction": 200})

# GPU indexes (cuVS-powered)
build_index(collection, "GPU_BRUTE_FORCE", {})
build_index(collection, "GPU_IVF_FLAT", {"nlist": 1024})
build_index(collection, "GPU_CAGRA", {"intermediate_graph_degree": 64, "graph_degree": 32})

Same API, same Milvus, same data. Just a different index type string. Milvus takes care of all the cuVS integration behind the scenes.

But First: How to Read the Results

Before I show the tables, let me explain the three metrics, because I had to look some of this up myself:

Avg Latency -- how long a batch of 100 queries takes on average. Lower is better.
QPS (Queries Per Second) -- throughput. How many individual queries the system can handle per second. Higher is better. If 100 queries take 10 ms, that is 10,000 QPS.
Recall@10 -- this one took me a minute to fully understand, so let me explain it properly.

We ask each index to return the 10 nearest neighbors for each query (Top-K = 10). But how do we know if those 10 results are actually the right 10 nearest neighbors? We need something to compare against. That something is the ground truth: the results from brute-force search, which checks every single vector and is guaranteed to find the true nearest neighbors.

Recall@10 is simply: out of the 10 true nearest neighbors, how many did the index actually find?

Here is a concrete example. Say the ground truth for a query is vectors [5, 12, 7, 99, 23, 41, 8, 67, 3, 55]. The approximate index returns [5, 12, 7, 99, 23, 80, 14, 67, 3, 55]. If you compare the two lists, 8 out of 10 match. So Recall@10 = 0.80.

Recall = 1.0 -- found all 10 correct neighbors. Perfect. Brute-force methods always get this.
Recall = 0.5 -- found 5 out of 10. Missed half.
Recall = 0.14 -- found about 1 or 2 out of 10. Not great.

This is the fundamental trade-off in vector search: exact methods (brute-force) give perfect recall but are slow. Approximate methods (IVF, HNSW, CAGRA) are fast but may miss some correct results. The art is finding the sweet spot where recall is good enough for your application and latency is low enough for your users.

Results at 100K Vectors

Method	Avg Latency	QPS	Recall@10
CPU FLAT	101.08 ms	989	1.000
CPU IVF_FLAT	11.21 ms	8,921	0.141
CPU HNSW	12.53 ms	7,983	0.378
GPU BRUTE_FORCE	7.16 ms	13,960	1.000
GPU IVF_FLAT	4.96 ms	20,172	0.068
GPU CAGRA	5.27 ms	18,966	--

Look at that first comparison. CPU brute-force: 101 ms. GPU brute-force: 7 ms. That is a 14x speedup with exactly the same perfect recall. You change one config string and search gets 14 times faster. That was the moment I thought: OK, this cuVS thing is real.

Scaling to 1M Vectors

But the real question from my customer was about scale. What happens when the corpus grows? At 1 million vectors:

Method	Avg Latency	QPS	Recall@10
CPU FLAT	1,070.27 ms	93	1.000
CPU IVF_FLAT	36.48 ms	2,741	0.113
CPU HNSW	25.49 ms	3,922	0.190
GPU BRUTE_FORCE	36.87 ms	2,712	1.000
GPU IVF_FLAT	14.88 ms	6,722	0.079
GPU CAGRA	7.61 ms	13,143	--

CPU FLAT takes over a second. Over a second! For 100 queries. That is completely unusable in a real-time RAG pipeline. GPU BRUTE_FORCE does the same job in 37 ms with perfect recall -- a 29x speedup. And GPU CAGRA is sustaining over 13K queries per second even at this scale.

This is the kind of answer I wish I had on that customer call.

Index Build Times

One more thing that matters in practice: how long does it take to build the index? If your knowledge base changes frequently, you do not want to wait forever for re-indexing:

Method	100K	1M
CPU FLAT	2.0 s	13.6 s
CPU HNSW	15.1 s	158.9 s
GPU BRUTE_FORCE	3.0 s	10.6 s
GPU CAGRA	3.0 s	23.1 s

CPU HNSW takes nearly 3 minutes to build an index over 1M vectors. GPU CAGRA does it in 23 seconds. Almost 7x faster. For a RAG system where documents get added and removed regularly, that difference matters a lot.

Bonus: Calling cuVS Directly (No Database)

I could not resist. Since I had cuVS installed on the DGX Spark, I wanted to see what happens if you bypass Milvus entirely and call cuVS directly with GPU arrays. How much overhead does the database add?

import cupy as cp
from cuvs.neighbors import brute_force as cuvs_bf

corpus_gpu = cp.asarray(corpus_vectors)
queries_gpu = cp.asarray(query_vectors)

# Build index directly on GPU arrays
bf_index = cuvs_bf.build(corpus_gpu, metric="sqeuclidean")

# Search
distances, neighbors = cuvs_bf.search(bf_index, queries_gpu, k=10)

At 100K vectors, cuVS direct brute_force runs in 3.62 ms -- nearly 2x faster than Milvus GPU_BRUTE_FORCE (7.16 ms). The gap is pure database overhead. But here is the interesting part -- at 1M vectors, the difference almost disappears:

Method	Avg Latency (1M vectors)	Speedup vs CPU FLAT
Milvus GPU BRUTE_FORCE	36.87 ms	29.0x
cuVS brute_force (direct)	36.39 ms	29.4x

At scale, the actual GPU computation dominates and the database overhead becomes noise. That is good news: you get nearly all of cuVS's raw performance through Milvus, without giving up database features like persistence, filtering, and replication.

Wait, Why Is the Recall So Bad?

If you have been looking at the tables carefully (and I hope you have), you probably noticed the recall numbers for approximate indexes are... terrible. IVF_FLAT at 0.14? HNSW at 0.38? I can tell you, when I first saw these numbers I thought something was broken.

It is not broken. The problem is the test data.

This benchmark uses uniformly random vectors -- every vector is generated with np.random.default_rng(42).random(...). Random vectors have no structure at all. They do not form clusters. All the distances between any two random vectors are roughly the same. And this is the absolute worst case scenario for approximate algorithms. Here is why:

IVF partitions data into clusters and then only searches a few clusters at query time. But if the data has no natural clusters, the partitioning is meaningless. The true nearest neighbors are scattered randomly across all clusters, and searching only a few of them misses most of the correct results.
HNSW and CAGRA build graphs by connecting nearby vectors. But in high-dimensional random data, the whole concept of "nearby" breaks down. Distances between any two random vectors converge to very similar values -- this is known as the curse of dimensionality. The graph connections become almost arbitrary, so following the graph does not reliably lead you to the true nearest neighbors.

With real-world embeddings from models like all-MiniLM-L6-v2 or OpenAI's models, the story is completely different. Vectors for similar content naturally cluster together, and approximate algorithms exploit that structure. In production systems, IVF and HNSW routinely achieve 0.95+ recall.

So do not panic about these recall numbers. The lesson here is that approximate indexes need structured data to work well -- which real embeddings always provide. The latency and throughput numbers remain valid because they measure raw computational performance: how fast the hardware crunches through distance calculations, regardless of data structure.

What I Learned: When to Reach for the GPU

After running all these benchmarks, here is my mental model for when GPU-accelerated search makes sense:

Use GPU indexes when:

You need exact search at scale. This is the clearest win. CPU brute-force becomes painful above 100K vectors. GPU BRUTE_FORCE is 29x faster at 1M vectors with perfect recall. If your RAG pipeline needs exact results and your corpus is large, this is a no-brainer.
Throughput is what you care about. GPUs love batched computation. If your application fires hundreds of concurrent queries -- think RAG with multiple chunks, batch recommendation pipelines, or multiple users searching at the same time -- GPU indexes will use the hardware much more efficiently than CPU cores.
You re-index frequently. GPU CAGRA builds a 1M-vector index in 23 seconds vs nearly 3 minutes for CPU HNSW. If your knowledge base changes a lot, this makes a real operational difference.
You are already on GPU hardware. If you are generating embeddings on the GPU (running a local model), keeping those vectors on the GPU for search avoids a costly round-trip back to CPU memory. The DGX Spark's unified memory makes this especially smooth.

Show Me the Code

The full benchmark notebook, the docker-compose file for Milvus, and all the setup instructions are available on GitHub: rhossi/cuvs-lab. You can clone it and run it on any machine with an NVIDIA GPU. If you do not have a GPU, the notebook will detect that and skip the GPU benchmarks automatically.

But the core pattern is almost too simple to believe -- switching from CPU to GPU in Milvus is a one-line change:

# CPU: brute-force search
collection.create_index("embedding", {
    "index_type": "FLAT",
    "metric_type": "L2",
    "params": {}
})

# GPU: same search, 29x faster
collection.create_index("embedding", {
    "index_type": "GPU_BRUTE_FORCE",
    "metric_type": "L2",
    "params": {}
})

That is it. All the complexity of CUDA kernels, GPU memory management, and parallel scheduling is hidden behind that one string. You change FLAT to GPU_BRUTE_FORCE and Milvus takes care of the rest.

One More Thing: cuVS Is Coming to PostgreSQL Too

While I was researching cuVS, I found something that got me really curious. There is a PostgreSQL extension called PGPU from EnterpriseDB that uses cuVS to GPU-accelerate vector index builds right inside Postgres. It integrates with the VectorChord indexing extension -- you call a single function and PGPU handles reading the data, computing centroids on the GPU, and feeding them back for index construction.

I have not tried it yet, but now I have to. GPU-accelerated vector search inside PostgreSQL? That is going straight to the top of my DGX Spark adventure list.

So that is my cuVS adventure. What started as "I should learn more about CUDA someday" turned into a pretty productive afternoon. The DGX Spark with its 128 GB of unified memory makes this kind of exploration very accessible -- you can index millions of vectors without worrying about GPU memory limits.

Next time that customer is ready to start building their agents, I will have something concrete to share with them. And if you are building RAG at scale, I hope the numbers in this post help you decide whether GPU-accelerated search is worth it for your workload. Spoiler: if your corpus is above 100K vectors and throughput matters, it probably is.