GPU-Accelerated Vector Search in PostgreSQL: From Hours to Seconds
I benchmarked GPU-accelerated vector index building in PostgreSQL using PGPU and cuVS on the NVIDIA GB10. Clustering went from 21 minutes to 9.7 seconds. Here are the full results, the setup, and when it makes sense for your workload.
📝 Update (Feb 23, 2026): After publishing this post, members of the NVIDIA cuVS team shared some additional context that helped me refine a few sections. I have updated those parts and marked them with 📝 below. Thanks to the cuVS team for taking the time. This is what I love about learning in public.
In my last post, I benchmarked GPU-accelerated vector search using cuVS and Milvus. At the end of that post I mentioned a PostgreSQL extension called PGPU that brings GPU-accelerated vector indexing directly into Postgres. I had not tried it yet. Now I have 🙂
The headline: building a vector index over 1 million embeddings went from 27 minutes to under 6. The k-means clustering step alone went from 21 minutes on CPU to 9.7 seconds on GPU. That is a 130× speedup.
In this post I will go through why that matters, how I tested it, and what the numbers look like at different scales.
Why Vector Search Matters
If you work with databases, you are probably used to exact match queries: find all orders where status = 'shipped'. Vector search works differently. Instead of looking for exact values, it finds the most similar items in a high-dimensional space. You take text, images, or user behavior, turn them into numerical vectors (called embeddings), store them in a database, and then search by proximity instead of equality.
This is already powering real workloads in many industries. Let me give you a few examples:
E-Commerce and Retail
Think about a product catalog with 10 million items, where each product has a 768-dimensional embedding generated by a vision-language model. When a customer uploads a photo of a couch they like, vector search finds the 20 most visually similar products in milliseconds. The same idea powers "customers also bought" recommendations, just using purchase-history embeddings instead of images.
Now, a retailer that processes 50,000 catalog updates per day needs to rebuild its vector index often. If that rebuild takes 3 hours on CPU, it becomes an overnight batch job. Cut it to 40 minutes with GPU acceleration and suddenly you can do it during a maintenance window.
Financial Services
Fraud detection systems take transaction patterns (amount, merchant category, time of day, geographic location) and encode them into embeddings. When a new transaction comes in, vector search finds the most similar historical patterns across billions of records. If a transaction is far from any known cluster, it gets flagged.
The problem is that every time the model gets retrained or new data comes in, the vector index has to be rebuilt across tens of millions of embeddings. The longer that takes, the longer the system is running on stale data.
Healthcare and Life Sciences
Drug discovery pipelines encode molecular structures as high-dimensional vectors and search for compounds that are structurally similar to a promising candidate, across libraries of millions of molecules. These datasets are big (millions of compounds, each with 1,000+ dimensional fingerprints) and the indexes need to be rebuilt as new compounds get screened. A rebuild that blocks a research pipeline for hours is not just slow, it costs real researcher time.
Media and Content Platforms
Streaming services encode every piece of content (videos, articles, podcasts) into embeddings for semantic search and personalization. Someone searching for "relaxing nature documentaries" does not need to match an exact title; vector search finds content with similar meaning. With catalogs growing by thousands of items daily and personalization models updating every week, index maintenance becomes a constant thing. Shorter rebuild cycles mean fresher results for users.
The pattern across all of these is the same: the bottleneck is not how fast you can search. It is how long it takes to build or rebuild the vector index. That is what GPU acceleration helps with.
The Test Bench
OK so to actually put numbers behind all of this, I built a fully containerized benchmarking environment. PostgreSQL with GPU-accelerated vector indexing, running on a single machine, reproducible with one docker compose build.
Hardware
I ran these benchmarks on an NVIDIA DGX Spark, but PGPU works on any NVIDIA GPU with NVIDIA CUDA support. If you have a workstation with a discrete GPU, a cloud instance with an NVIDIA A100 or NVIDIA H100, or any of the NVIDIA Grace-based systems, you can run the same tests. Your numbers will be different depending on compute capability and memory bandwidth, but the overall pattern (GPU clustering being way faster) should hold.
About production sizing: this post is about benchmarking and understanding performance characteristics, not about production architecture. If you are planning to take this to production, you will definitely want to test with your own data, your own query patterns, and your own hardware. But here are a few things worth keeping in mind:
- GPU memory is the main constraint. Your vector data needs to fit in GPU memory during the clustering phase (or be split into batches that do). On the DGX Spark and other coherent memory systems this is less of a worry since CPU and GPU share the same memory pool. On discrete GPUs, check your GPU memory against your dataset size.
- CPU and storage still matter. The assignment and index construction phase (phase 2) currently runs on CPU. Fast NVMe storage and enough CPU cores for VectorChord's build threads will help keep total build times down.
- Start with a realistic subset. Run the benchmark with a representative sample of your actual embeddings (not random vectors) at a few different sizes. This gives you a scaling curve you can extrapolate from, and you get to see how k-means behaves on your specific data distribution.
Software Stack
Everything runs in Docker containers, fully reproducible from a single docker compose build:
| Component | Version | Purpose |
|---|---|---|
| PostgreSQL | 17.8 | Database engine |
| pgvector | 0.8.1 | Vector data type (vector(768)) and distance operators |
| VectorChord | 1.1.0 | IVF vector index (vchordrq), the index type we are building |
| PGPU | 2.1.0 | GPU-accelerated centroid computation via NVIDIA cuVS |
| NVIDIA cuVS | 25.12 | GPU-accelerated vector search and clustering library (via conda) |
| CUDA Toolkit | 12.6 | GPU compute runtime |
How Vector Index Building Works
📝 Updated to clarify that the assignment step in phase 2 is also a distance computation, not purely I/O. The original version undersold how much compute happens after clustering.
VectorChord uses an IVF (Inverted File) index structure. Building this index has two main phases:
Clustering (training): Run k-means to divide all vectors into clusters and produce a set of centroids. This is compute-heavy: it goes over the entire dataset multiple times, computing distances from every vector to every centroid on each iteration.
Assignment and index construction: Assign each vector to its nearest centroid (which is itself a brute-force distance computation across all vectors) and then build the on-disk index structure.
Both phases involve heavy distance computations, but in this benchmark PGPU accelerates the clustering step by offloading k-means to the GPU using NVIDIA's cuVS library. It reads vectors from PostgreSQL, sends them to the GPU in batches, runs k-means clustering, writes the computed centroids back to a PostgreSQL table, and then hands those centroids to VectorChord's index builder for phase 2.
Why does the GPU do so well at k-means? Because each iteration involves millions of distance computations across hundreds of dimensions. That is a massively parallel workload, and GPUs are literally built for that kind of thing. Worth noting: the assignment step in phase 2 is the same kind of distance computation and could also benefit from GPU acceleration in the future.
Dataset
I generated synthetic random vectors directly in PostgreSQL:
| Parameter | Value |
|---|---|
| Row count | 1,000,000 |
| Dimensions | 768 (matches BERT / sentence-transformer embeddings) |
| Data type | vector(768) (pgvector) |
| Raw data size | ~2.9 GB |
| Distribution | Uniform random |
📝 Updated to better explain the limitations of synthetic data. The original version assumed build time alone tells the full story. It does not. Data distribution affects index quality, so build speed and recall/latency should be evaluated together.
An important note on synthetic data. I used uniform random vectors for this benchmark, and I want to be upfront about what that means and what it does not mean.
This benchmark measures index build speed only. It does not measure the quality of the resulting index (recall, query latency, or the recall/latency tradeoff). With uniform random data, the vectors have no natural cluster structure, which means the IVF partitions produced by k-means are not meaningful in the same way they would be with real embeddings. You cannot look at these build times and assume the resulting indexes would perform identically at query time, because the data distribution has a big impact on index quality.
Real-world embeddings from text, images, or user behavior have inherent cluster structure. This affects both how fast k-means converges and how well the resulting index partitions work for search. A proper apples-to-apples comparison of vector index builds should also measure the recall and latency of the resulting index, not just build time. The cuVS team has a good guide on how to do this correctly: Comparing Vector Indexes.
So take the build time numbers in this post for what they are: a measurement of how fast each path can run the index build pipeline, on this specific dataset, at this specific scale. For a complete evaluation you would want to repeat this with real embeddings and measure index quality too.
Here are the index parameters I used:
| Parameter | Value | Rationale |
|---|---|---|
lists | 4,000 | Partitions (within recommended range of 4*sqrt(1M) to 16*sqrt(1M)) |
sampling_factor | 250 | Samples per cluster for k-means (250 × 4,000 = 1M) |
kmeans_iterations | 10 | Standard convergence iterations |
build_threads (CPU) | 4 | CPU parallelism for the CPU baseline |
batch_size (GPU) | 100,000 | Vectors processed per GPU batch |
Results
GPU vs CPU: Index Build Time
Here is the main comparison at 1 million vectors:
| Method | Clustering Phase | Index Build Phase | Total Time |
|---|---|---|---|
| PGPU (GPU) | 9.7 seconds | ~342 seconds | 5 min 52 s |
| VectorChord (CPU, 4 threads) | ~1,260 seconds | ~342 seconds | 26 min 42 s |
Overall speedup: 4.6×. Total time went from nearly 27 minutes to under 6. But look at where the time actually went. The interesting part is in the clustering phase.
Where the GPU Wins: Clustering
The GPU finished k-means clustering of 1 million 768-dimensional vectors into 4,000 clusters in 9.7 seconds. The CPU, with 4 parallel threads, needed about 21 minutes for the same thing.
Clustering speedup: ~130×. 🚀
Let me break down what each path actually did:
- GPU (PGPU): Processed the data in 10 batches of 100K vectors, ran GPU k-means on each batch, then consolidated the 16,000 intermediate centroids down to 4,000 with a final GPU clustering pass. The whole pipeline (read data, compute on GPU, write centroids back to PostgreSQL) finished in under 10 seconds.
- CPU (VectorChord): Ran Lloyd's k-means algorithm with 4 threads for 10 full iterations over the entire dataset. Each iteration computes the distance from every vector to every centroid (1M × 4,000 × 768 = ~3 trillion floating-point operations per iteration), then reassigns clusters and recomputes centroids.
That is 3 trillion operations per iteration, times 10 iterations. On CPU, yeah, it takes a while 😅
The Rest of the Pipeline: Assignment and Index Construction
Both paths spend roughly the same time (~342 seconds) on the assignment and index construction step. This includes assigning each vector to its nearest centroid (a brute-force distance computation) and writing the on-disk index structure. In this version of PGPU, this step runs on CPU regardless of which path you took for clustering.
At 1M vectors, this shared step dominates the GPU path (342 out of 352 seconds). The GPU finishes clustering so fast that it spends most of its time waiting for the rest of the pipeline. So at this scale, the GPU advantage in total time is limited by this shared work. At larger scales, the clustering phase grows faster, so the GPU advantage in total time would increase. It is also worth noting that the assignment step itself is a distance computation that could benefit from GPU acceleration in future versions.
Scaling Behavior
I ran benchmarks at three dataset sizes to see how the advantage changes with scale:
| Dataset | GPU Total | CPU Total | Overall Speedup | GPU Clustering | CPU Clustering |
|---|---|---|---|---|---|
| 100K vectors (dim=768) | 7.1 s | 15.7 s | 2.2× | 0.7 s | ~9 s |
| 500K vectors (dim=768) | 54.9 s | 226.2 s | 4.1× | 2.7 s | ~180 s |
| 1M vectors (dim=768) | 351.6 s | 1,602.5 s | 4.6× | 9.7 s | ~1,260 s |
Three things I noticed:
GPU clustering scales sub-linearly. 10× more data (100K to 1M) only increased GPU clustering time from 0.7s to 9.7s (about 14×). Batching keeps the GPU busy and efficient.
CPU clustering scales super-linearly. The same 10× data increase pushed CPU clustering from ~9s to ~1,260s (about 140×). K-means on CPU gets way more expensive as both vector count and cluster count grow.
The speedup gets better with scale. 2.2× at 100K, 4.1× at 500K, 4.6× at 1M. If you extrapolate to 10M or 100M vectors, where CPU clustering could take hours, the GPU advantage would be really significant.
Key Takeaways
📝 Updated after feedback from the cuVS team to add more context around what these benchmarks show and where they have limits.
At this scale, GPU clustering made a clear difference. The 130× clustering speedup is what drove the 4.6× total speedup in our tests. At 1M vectors, k-means clustering was the most expensive part of the pipeline in the CPU path, and offloading it to the GPU cut that cost dramatically. That said, this is one specific scale (1M vectors), one specific version of VectorChord (1.1.0), and one specific configuration. The VectorChord project is actively evolving, and newer versions may shift where the bottlenecks are. The picture could look different at other scales or with other configurations.
Build speed is only part of the story. As I mentioned in the dataset section, these benchmarks measure build time only. A complete evaluation of GPU-accelerated indexing would also need to measure the recall and query latency of the resulting index. The data distribution matters a lot here. If you are seriously evaluating this for production, test with your own embeddings and measure the full picture.
Coherent memory matters more than I expected. This one surprised me. On a typical discrete GPU setup, moving data between CPU memory and GPU memory over PCIe is a real cost. Every batch of vectors needs to be copied from PostgreSQL's buffer pool in host RAM to GPU memory before the GPU can do anything, and then the results need to be copied back. With large datasets, this back-and-forth can eat into the clustering speedup quite a bit.
NVIDIA's coherent memory architecture changes this completely. On systems like the NVIDIA DGX Spark (powered by the NVIDIA GB10 Grace Blackwell Superchip), the NVIDIA GH200 Grace Hopper Superchip, and other NVIDIA Grace-based platforms, CPU and GPU share the same physical memory pool. There is no copy. The GPU just reads the vector data directly from the same memory PostgreSQL writes to. This is not just convenient, it removes a whole class of overhead from the pipeline. I think this is a big part of why the GPU path is so fast end-to-end in these benchmarks. Not just during the compute phase, but across the entire read-cluster-write cycle.
If you are evaluating GPU-accelerated database workloads, pay attention to the memory architecture of your platform. Coherent memory gives you the GPU compute advantage without the data movement tax.
The break-even point was low in our tests. Even at 100K vectors, a pretty small dataset, the GPU path was 2.2× faster. If your workload rebuilds vector indexes regularly (new data coming in, model updates, experimenting with different embedding models), GPU acceleration could save meaningful time even at modest scales.
When Does GPU Acceleration Make Sense?
It helps most when:
- Your embedding tables are large (1M+ rows) and clustering dominates the build time
- You rebuild indexes often: new data, retrained models, or experimenting with different embeddings
- Your vectors are high-dimensional (768+), which means more compute per k-means iteration
- You have tight maintenance windows and cannot wait hours for a rebuild
- You are on a coherent memory platform where there is no data transfer overhead
It matters less when:
- Your embedding table is small (<50K rows) and you rarely rebuild
- You use low-dimensional vectors (<128 dims) where CPU k-means is already fast enough
- Your bottleneck is query latency, not index build time (PGPU does not accelerate queries yet)
- You do not have access to an NVIDIA GPU in your database infrastructure
Try It Yourself
The whole benchmarking environment is open source and fully containerized. Clone the repo, follow the instructions in the README, and you should be up and running in a few minutes (well, after the Docker build finishes, which takes a bit 😄):
You will need an NVIDIA GPU with CUDA support and Docker with the NVIDIA Container Toolkit. The README has all the details.
What's Next
The 4.6× speedup at 1M vectors is a starting point, not a ceiling. Here is what I want to try next:
- Larger datasets (10M–100M vectors): The GPU clustering speedup grows with scale. At 100M vectors, CPU k-means could take hours. GPU should stay in the low minutes.
- Higher dimensions (1536, 3072): OpenAI's text-embedding-3-large produces 3072-dimensional vectors. More dimensions means more math per iteration. That is exactly what GPUs are good at.
- Multi-level indexes: VectorChord supports hierarchical IVF with two levels of partitioning (e.g.,
lists=[200, 8000]). The clustering work grows with the number of levels, which would make the GPU advantage even bigger. - GPU-accelerated assignment: The centroid assignment step in phase 2 is also a brute-force distance computation. Offloading it to the GPU could reduce the part of the pipeline that currently dominates the GPU path's total time.
- GPU-accelerated queries: PGPU currently only accelerates index building. Future versions could also run the distance computations during query time on the GPU.
- Real embeddings and index quality: Repeating these tests with real-world embeddings and measuring recall/latency alongside build time would give a much more complete picture. That is probably the most important next step.
Vector search workloads are growing fast. GPU-accelerated indexing in PostgreSQL is still early, but the tools are open source and you can test them today. I am looking forward to seeing how this space evolves.