Qdrant TurboQuant Deep Dive: Game-Changer Or Just Hype?

It’s a balancing act between memory usage and search accuracy. The default option is Float32, which offers top-notch clarity but demands a lot of memory. A straightforward fix is scalar quantization, where each value is shrunk down to fewer bits (giving about 4× compression) with just a small hit to accuracy. Binary quantization goes the extra mile, hitting a 32× compression ratio, but the search results may become unreliable because of data loss. Meanwhile, while product quantization is potentially more effective, it’s trickier to configure and handle in a live environment.

In the first week of May 2026, Qdrant rolled out TurboQuant, an innovative quantization approach. Their claim was: “TurboQuant slashes memory usage while keeping search quality steady”. TurboQuant seems like exactly the upgrade developers working on vector search have been waiting for.

Still, I was curious whether TurboQuant performs just as well when measured against varying datasets. Does it genuinely outperform standard quantization methods, or does its edge depend on what the data is?

I carried out a set of tests matching TurboQuant against widely recognized quantization methods, including scalar and binary quantization. The objective was to figure out where TurboQuant shines, where its downsides are, and whether it can serve as a reliable go-to approach for vector search.

I expect this will guide engineers, machine learning specialists, and users of vector databases in grasping where TurboQuant slots in relative to existing quantization methods, especially as they take things from testing into a full-scale system.

1. What is Quantization?

A Float32 value in a vector takes up 4 bytes. So, a 1536-dimension embedding consumes 6 KB per vector; scaling to a million vectors, the index alone eats up around 6 GB.

That’s where Quantization steps in. It compresses each vector value into a smaller byte size. Scalar quantization is the most common route. Under this method, the smallest and largest values in each dimension are identified, then this span is split into 255 evenly sized intervals. Every number is rounded to the nearest interval, and that interval’s index is stored as one byte instead of the original four.

That Float32 embedding is now a uint8 version, compressed to a quarter of what it was — a 4× reduction in storage.

Below, Figure 1 illustrates this procedure on a 6D vector.

Figure 1: Scalar quantization process and breakdown. The slight error, called quantization error, adds up across the dimensions during a dot product calculation, Image by author.

The minor error listed in the bottom row is the quantization error. It builds up across the vector’s six dimensions during a dot product, which is how similarity scores get slightly skewed.

For more aggressive compression like 8× (4-bit), 16× (2-bit), or 32× (1-bit), more reduction in vector size comes at the cost of a larger error. Figure 2 shows how a Float32 number deviates under various levels of quantization.

Figure 2: How different compression rates stack up against the original. *Image by author.*

The connection between compression and accuracy (or memory vs. recall) is clear: more compression results in lower recall.

2. The Real Issue Isn’t Compression Ratio

The actual question is: what structure within the vector survives compression?

Conventional quantizers take a brute-force approach. Scalar quantization uses a fixed grid across all dimensions, even if a dimension is pure noise. Binary quantization clips everything to just a sign bit. Neither technique distinguishes between dimensions carrying real signal and those that don’t.

Qdrant 1.18 reshapes the approach with its built-in TurboQuant. Drawing from a Google Research paper accepted at ICLR 2026, TurboQuant first rotates the vector before compressing it. This random rotation distributes variance evenly, so each bit is able to carry more meaningful information.

TurboQuant isn’t an improvement because it uses fewer bits. It’s an improvement because it makes the vector easier to compress before committing those bits.

Figure 3 below breaks down the differences between TurboQuant and the rest.

Scalar Quant forces the same grid onto every dimension. Think of it as making everyone wear the same shoe size, no matter their feet.
Binary Quant maps each value to 0 or 1 based on a threshold: 0 or higher becomes 1; anything below 0 becomes 0. It’s like reducing shoes to just “left or right” — fast and simple, but you lose all sense of shape and fit.
Product Quant builds custom codebooks for separate subspaces. Tailoring a shoe for each foot — an excellent fit, but far more resource-intensive.

TurboQuant evens out the dimensions before compression, then fits them with a single, well-designed codebook. Think of it as resizing every foot to a standard and offering one perfect shoe design that fits all.

*Figure 3: Four quantization types compared — Scalar, Binary, Product, and TurboQuant. *Image by author* *with* help of ChatGPT.*

3. TurboQuant at a Glance: Rotate First, Compress Second

Vectors produced by an embedding model aren’t random; they contain underlying patterns.

A 1536-dimensional embedding often holds its key signal in only a fraction of coordinates. The leftover dimensions tend to bring in noise, making distance measurements weaker and less dependable.

3.1 The TurboQuant Pipeline

The approach is straightforward: before compressing anything, apply a random orthogonal rotation. That

Rotations preserve distances — they simply distribute the energy so that each dimension holds about the same amount of information. Once rotated, a pre-built codebook can compress every dimension uniformly, without any per-dimension fine-tuning or training on your own dataset.

See Figure 4 below for a summary of the pipeline.

*Figure 4: The TurboQuant pipeline — the rotation step makes the coordinates more predictable before any bits are allocated. *Image by author* *with* help of ChatGPT.*

3.2 How Does Rotation Affect the Coordinates?

*Figure 5: Before and after applying TurboQuant’s rotation — energy is spread evenly across all dimensions, while distances remain the same.* *Image by author.*

As shown in Figure 5, prior to rotation, a small number of dimensions hold most of the signal energy, while the remaining dimensions contain weaker signal and often more noise.

Once the rotation is applied, every dimension ends up holding roughly the same energy level and the same amount of information.

But does this energy redistribution truly preserve meaningful information and maintain the relationship between vectors compared to their original form?

To verify this, I ran a quick experiment with two 4D vectors. I applied TurboQuant’s transformation to Vector A, then at inference time rotated Vector B using the identical rotation matrix and measured cosine similarity within the rotated space. That similarity was then compared against the cosine similarity of the original Vector A versus the original Vector B.

3.3 The Standard TurboQuant Process

Figure 6: A visual walkthrough of TurboQuant. *Image by author*

As illustrated in Figure 6, after running TurboQuant on the original Vector A, the distance between the transformed Vector A and Vector B stays almost the same as the distance between the originals. This confirms that the essential geometric relationship between vectors is preserved and recall performance remains high.

3.4 How Qdrant Implements TurboQuant Inside the Database

Qdrant handles this through two distinct processes:

3.4.1 The Indexing Process

*Figure 7: The flow of indexing a vector with TurboQuant in Qdrant. *Image by author* *with* help of ChatGPT.*

The indexing pipeline, shown in Figure 7, takes a vector through these steps:

original vector → normalize or prepare based on the metric → pad dimensions if required → apply Hadamard rotation → optional per-coordinate calibration: x → (x + shift) · scale → Lloyd-Max centroid assignment → generate packed TurboQuant codes

For TurboQuant specifically, Qdrant persists the details summarized in Table 1:

*Table 1: Data Qdrant stores for TurboQuant. Source: author*

One key addition from Qdrant is Length Renormalization, also called the Scaling Factor. This takes place after quantization — Qdrant measures how much shorter the quantized reconstruction is compared to the original vector, saves that ratio as a per-vector scaling factor, and applies it during query-time scoring.

Scaling factor = original_length / centroid_reconstruction_length

Why is Length Renormalization Necessary?

After quantization, there is a consistent pattern:

The quantized vector still points in the correct direction, but its magnitude is too short

In other words, quantization error systematically reduces the length of every vector. At query time, when computing the dot product between a quantized database vector and a rotated-and-encoded query vector, the result involves a slightly-shorter-than-it-should-be vector, producing a score that is consistently too low. Qdrant refers to this as the “recall-degrading bias”.

The fix is to multiply by a correction factor during scoring instead of modifying the stored vectors. This approach is straightforward and effective.

3.4.2 The Query-Time Process

*Figure 8: How queries are matched against TurboQuant-encoded vectors in Qdrant. *Image by author* *with* help of ChatGPT.*

Figure 8 outlines the query pipeline. The query vector is rotated and transformed into a SIMD-friendly scoring format. Qdrant then uses asymmetric scoring to compare this encoded query directly against the packed TurboQuant codes stored for each database vector.

After computing the raw score, the stored scaling factor is applied as a multiplier.

4. Which Method Should You Try First?

Qdrant supports several quantization strategies, and TurboQuant itself offers multiple bit-compression levels including bits4, bits2, bits1.5, and bits1.

As stated in their documentation, lower bit depths achieve greater compression but sacrifice some accuracy.

Figure 9 provides a decision guide for choosing the right compression method if you’re unsure where to start.

*Figure 9: Decision flowchart — begin at the top and follow your constraints. The green box marks the recommended default starting point. *Image by author*, based on Qdrant article at*

https://qdrant.tech/blog/qdrant-1.18.x/,

5. Getting Started: Your First Experiment

To switch on TurboQuant, simply update one setting in your Qdrant setup. There’s no need to touch your existing collections.

Refer to the Python code below for a quick setup.

from qdrant_client import QdrantClient, models
client = QdrantClient("localhost", port=6333)

# New collection — just one config change
client.create_collection(
   collection_name="my_collection",
   vectors_config=models.VectorParams(
       size=1536,
       distance=models.Distance.COSINE,
   ),
   quantization_config=models.TurboQuantization(
       turbo=models.TurboQuantQuantizationConfig(
           bits=models.TurboQuantBitSize.BITS4,
           always_ram=True,
       )
   ),
)

# Existing collection — apply the update without rebuilding vectors
client.update_collection(
   collection_name="existing_collection",
   quantization_config=models.TurboQuantization(
       turbo=models.TurboQuantQuantizationConfig(
           bits=models.TurboQuantBitSize.BITS4,
           always_ram=True,
       )
   ),
)

For extra configuration options, visit the Qdrant TurboQuant documentation.

6. Benchmark: Putting Theory to the Test

To see how TurboQuant stacks up against other Qdrant quantizers using real-world embeddings, I ran a series of tests across varying scales (10K, 50K, and 100K vectors), comparing multiple quantization strategies.

6.1 Why Use the DBpedia Dataset?

The DBpedia embeddings dataset (License: CC-BY-SA 4.0 and GNU Free Documentation License) was selected because of its highly anisotropic nature, with a coordinate variance ratio of 233.5x. In simple terms, a small number of dimensions hold most of the meaningful data, while the rest contribute mostly noise. This kind of distribution is exactly where TurboQuant’s rotation technique should offer the biggest advantage, and where scalar quantization’s rigid grid tends to waste the most bits.

For specifics on the test environment, see Appendix section 9.2.

6.2 Recall Performance at Different Scales

Figure 10 below illustrates the recall results.

*Figure 10: Recall@10 at 50K and 100K vectors. Source: author*

Four key takeaways stand out:

TQ recall stays steady as the dataset scales up. Binary Quantization’s recall drops from 0.916 to 0.78 when the dataset size doubles, but TurboQuant variants hold their ground far better. The rotation step ensures each bit captures more useful information, making TQ more resilient to corpus growth.
Most TQ variants match Float32 and Scalar Quantization in recall. Apart from TQ 1-bit and TQ 4-bit, TurboQuant results stay broadly on par with the Float32 baseline and Scalar Quantization.
TQ 4-bit delivers the best balance of accuracy and compression. It achieves recall close to Scalar Quantization while using about half the storage: 8× compression compared to Scalar’s 4×. At 100K vectors, TQ 4-bit hits 0.965 recall, just 1.5 points behind Scalar’s 0.980. With rescoring applied, the difference vanishes: 0.996 for TQ 4-bit versus 0.993 for Scalar.
Rescoring closes most of the recall gap, even under aggressive compression (TQ 1-bit). TQ 1-bit sees a notable boost from rescoring. Binary Quantization with rescoring can perform well on smaller datasets, but its recall drops off more quickly as the dataset grows.

6.3 Latency Performance at Different Scales

Figure 11 below shows the latency results.

*Figure 11: Median query latency at 50K and 100K vectors. Source: author*

The latency picture is straightforward: rescoring adds a small overhead, but it’s minimal. At 100K vectors, TQ 4-bit with rescore completes in 6.4 ms, outpacing Float32 at 7.6 ms and trailing Scalar Quantization at 6.8 ms by only a slim margin.
Across all TQ variants, rescoring increases latency slightly but still remains faster than the Float32 baseline.

6.4 Storage Footprint

Figure 12 below compares the storage requirements of each quantization method.

*Figure 12: Storage size across methods. Solid bars = quantized index in RAM. Hatched = original float32 on disk (rescore only). Source: author*

TQ 1-bit matches Binary Quantization in storage: both come in at 18 MB, roughly 32× compression.
TQ 2-bit and TQ 4-bit require more space to retain more detail. TQ 2-bit uses about double the storage of TQ 1-bit, and TQ 4-bit uses roughly 4× more. Even so, both remain significantly smaller than Scalar Quantization.

6.5 Index Building Time

Figure 13 below presents the index build time results.

*Figure 13: Index build time covers HNSW construction, quantization, and calibration. Source: author*

TQ is the quickest to build at 64s for 50K vectors and 179s for 100K vectors, largely because extracting sign bits is a lightweight operation.
TQ 4-bit takes 57s / 224s, and TQ 1.5-bit takes 75s / 239s. Both are on par with or faster than Float32 (110s / 289s), suggesting that the rotation and codebook calibration steps add only a modest indexing cost.
TQ 2-bit is the slowest to build (73s / 357s). This may stem from an uncommon bit-packing pattern or implementation-specific overhead. Even so, it still finishes indexing 100K vectors in under 6 minutes.

Indexing time is sensitive to the surrounding environment, so treat these figures as indicative rather than definitive. Actual results will depend on CPU, memory bandwidth, disk I/O, parallelism, and overall machine load during the run.

7. Practical Takeaways

All in all, TurboQuant shows real promise when the goal is to balance compression strength with reliable retrieval quality. The findings reveal that compressed formats don’t all perform equally well as the dataset scales. Certain methods see a sharp drop in recall, while others stay remarkably close to the Float32 baseline.

TQ 2-bit and TQ 4-bit maintain steady recall as the corpus expands. In contrast, Binary Quantization and TQ 1-bit see a more pronounced decline with larger datasets. This indicates that TurboQuant’s rotation step plays a key role in retaining meaningful information per bit, making these TQ 2-bit and TQ 4-bit variants more resilient as data volume grows.
TQ 4-bit strikes the ideal balance between recall and compression ratio. TQ 4-bit achieves recall nearly on par with Scalar Quantization but at roughly twice the compression level — Scalar Quantization delivers around 4× compression, while TQ 4-bit reaches around 8×. In practical terms, TQ 4-bit cuts the memory footprint roughly in half.
TQ 1.5-bit paired with rescoring is the top pick for maximum compression: It delivers around 24× compression while preserving recall near Float32 levels after rescoring is applied. This is ideal when storage is the primary bottleneck but the system must still deliver acceptable retrieval quality. Without rescoring, heavy compression sacrifices too much detail. With rescoring, most of that performance gap is recoverable.
TurboQuant combined with rescoring is the most reliable approach when balancing latency and accuracy — consistent with established best practices. Rescoring does introduce some extra latency, and its benefits are most noticeable under extreme compression. Still, it represents a sensible tradeoff, giving the system a path to apply heavier compression without a significant drop in retrieval quality.

To sum up, TurboQuant goes beyond just shrinking memory usage. TQ 4-bit works best as a general-purpose choice, while TQ 1.5-bit with rescoring is the smarter pick when compression is the overriding concern. The recommended pattern is to pair TurboQuant with rescoring.

Important: These figures are not meant to serve as production guidelines — they should inform your own evaluation. Always benchmark performance on your own embeddings, queries, hardware setup, and recall targets before moving into production.

8. TurboQuant’s Limitations

*Figure 14: Limitations of TurboQuant implementation on Qdrant. Image by author*

TurboQuant moves the compression tradeoff in a better direction — but it doesn’t eliminate it entirely.

It’s also still early. Launched on May 11, 2026, real-world production track records are still thin. The prudent approach is straightforward: run your own benchmarks first, then decide whether it should become the default.

Below are some important limitations worth considering. A consolidated overview is shown in Figure 14:

The first limitation is maturity. Qdrant’s benchmark numbers are encouraging, but your environment may tell a different story. Your embedding model, query patterns, filtering logic, and data distribution might not align with the benchmark datasets. So TurboQuant should be viewed as a compelling option — not an automatic swap-in.

TurboQuant can also be slower than Binary Quantization at equal storage sizes. This is relevant if your primary concern is throughput or raw speed. When speed matters more than recall, Binary Quantization remains the stronger candidate. TurboQuant shines when the priority is achieving better recall within a tight memory budget.

There’s also a calibration overhead. TurboQuant requires a one-time calibration step for each segment. This typically takes seconds rather than minutes, but it’s still an added cost to account for. If your system generates many segments or rebuilds indexes frequently, this extra step should be factored into planning.

Distance metric is another constraint. TurboQuant works optimally with L2, dot product, and cosine similarity — rotation preserves these distance relationships effectively. However, it does not maintain L1 (Manhattan) distance in the same manner. L1 and Manhattan distances can still be used, but each comparison requires full vector reconstruction, which can slow queries down. If Manhattan distance is critical for your workloads, Scalar Quantization is the safer bet.

As the test results demonstrate, TQ 1-bit is not a reliable choice. TQ 1-bit provides extremely high compression, but recall can fall too far. The rotation step offers some help, yet 1 bit per dimension is frequently insufficient to preserve enough structure at scale. Consider adding rescoring if TQ 1-bit fails to meet expectations. Alternatively, TQ 1.5-bit appears to be a more practical floor — it still delivers strong compression while keeping recall far more stable, making it a safer pick than TQ 1-bit for aggressive compression scenarios.

The core message isn’t “always use TurboQuant.” It’s to measure what matters for your specific data. TurboQuant improves the direction of the tradeoff, helping reduce recall loss before the bit budget is exhausted. But compression is never free — you still need to weigh memory, speed, recall, and distance behavior against each other.

In short, TurboQuant is a powerful new option — especially effective with rescoring and moderate bit-depths. But it shouldn’t be adopted without testing. Benchmark it thoroughly on your own embeddings and evaluate carefully before migrating to production.

9. Appendix:

9.1 Quantization Support in Popular Vector Databases

Figure 15 below summarizes the quantization offerings across four leading vector databases for easy comparison.

Qdrant was among the first platforms to bring TurboQuant to market.

*Figure 15: Quantization support matrix across Qdrant, Pinecone, Weaviate, Milvus, and pgvector. Source: author*

9.2 Test Environment

Machine: Apple M3, 16 GB RAM, macOS 15.6.1
Testing database:
- Qdrant v1.18.0, single-node Docker, no resource limits
- HNSW with Default (m=16, ef_construct=100)
- Distance: Cosine
Dataset:

Top Posts

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Sensing the Skies: IoT’s Silent Revolution in Aerospace Safety Checks

5 Agentic AI Power-Ups: Unlock Free Intelligence Now

Qdrant TurboQuant Deep Dive: Game-Changer or Just Hype?

Dale-Proof AI Learns Perfect MNIST, Near-CIFAR-10 Vision—No Backpropagation Needed

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Beyond the Main Branch: Streamlining AI Workflows with Git Worktrees

The AI Safety Capital Rising: Beyond Silicon Valley’s Shadow

Beyond Context Engineering: The Loop Experiment Running Blind Without an LLM

The Agent Security Chasm: 54% of Enterprises Battling AI Breaches While Credentials Freely Roam

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Sensing the Skies: IoT’s Silent Revolution in Aerospace Safety Checks

5 Agentic AI Power-Ups: Unlock Free Intelligence Now

Dale-Proof AI Learns Perfect MNIST, Near-CIFAR-10 Vision—No Backpropagation Needed

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

Trending

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Sensing the Skies: IoT’s Silent Revolution in Aerospace Safety Checks

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Qdrant TurboQuant Deep Dive: Game-Changer or Just Hype?

1. What is Quantization?

2. The Real Issue Isn’t Compression Ratio

3. TurboQuant at a Glance: Rotate First, Compress Second

3.1 The TurboQuant Pipeline

3.2 How Does Rotation Affect the Coordinates?

3.3 The Standard TurboQuant Process

3.4 How Qdrant Implements TurboQuant Inside the Database

3.4.1 The Indexing Process

3.4.2 The Query-Time Process

4. Which Method Should You Try First?

5. Getting Started: Your First Experiment

6. Benchmark: Putting Theory to the Test

6.1 Why Use the DBpedia Dataset?

6.2 Recall Performance at Different Scales

6.3 Latency Performance at Different Scales

6.4 Storage Footprint

6.5 Index Building Time

7. Practical Takeaways

8. TurboQuant’s Limitations

9. Appendix:

9.1 Quantization Support in Popular Vector Databases

9.2 Test Environment

10. Resources

Related Posts