As a long-time maintainer of Jaeger, I’ve seen users ask for ClickHouse support repeatedly over the years. With Jaeger v2.18.0, we’ve finally made it happen. What thrills me most isn’t simply that ClickHouse is now an option—it’s that its architecture feels almost purpose-built for large-scale telemetry. It effortlessly absorbs massive, append-only write streams and executes complex analytical aggregations in milliseconds, giving teams a highly efficient, battle-tested storage backend.
For those unfamiliar with the project, Jaeger is a graduated Cloud Native Computing Foundation (CNCF) distributed tracing platform designed to monitor and troubleshoot intricate microservices environments. It follows requests across service boundaries to reveal latency bottlenecks and root causes, ultimately helping teams shorten their mean time to repair (MTTR). By integrating ClickHouse natively, Jaeger can now take advantage of columnar storage to deliver lightning-fast query speeds and high-ratio data compression for billions of spans.
In this article, I’ll cover why ClickHouse is an excellent fit for trace storage, how the underlying schema is structured, and how you can start using it with Jaeger right away.
Why columnar storage comes out ahead
At its heart, the tracing challenge has two sides: storing enormous volumes of semi-structured event data and then searching that data rapidly across many dimensions—service, operation, tags, duration, time range, and trace ID. Cassandra and Elasticsearch have both served the Jaeger community reliably, but they carry operational costs. Indexing overhead introduces latency and added expense. Scaling grows complicated. Retention decisions force difficult compromises.
High-throughput ingestion and low-latency queries
ClickHouse is a column-oriented OLAP database engineered for precisely these demands: high-throughput ingestion, aggressive compression, and fast analytical queries. For tracing, this is nearly a perfect match. Trace data is inherently repetitive—the same service names, operation names, status codes, and tags recur constantly. A columnar layout excels with that kind of repetition.
“Trace data is inherently repetitive—the same service names, operation names, status codes, and tags recur constantly. A columnar layout excels with that kind of repetition.”
Compression that truly makes a difference
We observed substantial compression improvements on trace data. Service names like “auth-service” or “payment-gateway” show up hundreds of thousands of times. The same goes for operation names, tag keys, and status codes. In a row-oriented database, that redundancy remains uncompressed. In a column-oriented one, ClickHouse clusters identical values together, making them extremely easy to compress. The outcome? An 8.6× compression ratio on the spans table in our benchmarks.
Real-time analytics
ClickHouse also enables more sophisticated analytical queries on trace data. Because aggregations run highly efficiently on columnar storage, Jaeger v2.18 introduces native ClickHouse SPM methods to compute service-level latency, call rates, and error rates directly from your stored spans. This means teams can produce core health and performance metrics for their microservices straight from trace data, without relying on an external metrics pipeline.
Designing the schema
Schema design was where things became challenging. We needed to optimize for Jaeger’s core query patterns: trace lookup by trace ID, service, and operation; attribute filtering; time-range queries; and the aggregation that powers the Service Performance Monitoring (SPM) feature. These requirements don’t all point in the same direction.
There’s an outstanding earlier post by Ha Anh Vu that benchmarked ClickHouse schemas for Jaeger v1, and that research laid important groundwork. However, Jaeger v2 adopts the OpenTelemetry data model, which required us to revisit several earlier decisions.
The design space is thoroughly documented in an Architectural Decision Record (ADR). The sections below explore some of the most important decisions worth understanding.
Trade-offs in primary key selection
In ClickHouse, the primary key doesn’t enforce uniqueness. Instead, it determines the on-disk sort order and drives a sparse index (one index entry per 8,192-row granule). Choosing it is the single most consequential decision in the schema.
We had two candidates for the primary key:
- Optimize for trace retrieval: sort by trace_id. Every span belonging to a trace lands in one contiguous block, so GetTrace becomes a single seek plus a sequential read. However, search queries pay the price for this optimization, since the service_name and operation_name filters cannot leverage the primary key index at all.
- Optimize for search (our choice): sort by (service_name, name, start_time). Search queries that filter by service, operation, and a time window become direct primary-key lookups.
The decision boiled down to an asymmetric trade-off. Sorting by trace_id makes search performance poor, but sorting by (service_name, name, start_time) hurts trace retrieval far less, because we can recover most of the lost performance with two inexpensive mechanisms:
- A bloom_filter skip index on trace_id, which lets the engine prove a granule cannot contain a given ID without actually reading it.
- A trace_id_timestamps materialized view that tells the search path the time bounds for each matching trace, so the subsequent GetTraces call can prune partitions and granules.
An earlier benchmark using the schema sorted by trace_id clearly demonstrated the asymmetry. Trace retrieval ran at about 27 ms, but a search query took roughly 880 ms. Re-sorting by (service_name, name, start_time) pushed trace retrieval to around 100 ms (slower, but still well within interactive thresholds) while bringing multi-filter search down to about 140 ms.
Storing typed attributes
In Jaeger v1, tags were always strings. The v2 reader API accepts a typed map, where attributes can be Bool, Int64, Float64, String, or one of the complex types (Bytes, Slice, Map). We need to query across these types, so the storage layer can’t collapse everything to strings.
The schema takes advantage of ClickHouse’s Nested column per primitive type, repeated
Think of a nested mini-table nestled within every row; each one can carry its own collection of attribute names and values. This design enables attribute filters to adopt the same query semantics used for standard table queries.
That said, it’s important to recognize that searches relying solely on attributes are naturally more resource-intensive, since they can’t fully utilize ClickHouse’s primary index. The table’s index is tuned for top-level structural fields—namely service, operation, and time. For the best query performance and to avoid heavy column scans, users should always pair attribute filters with these fields to narrow down the data ClickHouse needs to examine.
Materialized views
Certain Jaeger queries don’t align well with the spans table’s sort order. For instance, the Jaeger UI must swiftly load the complete list of known service names and operations, while trace lookups often require efficient access to trace time windows.
Instead of relying on costly full-table scans to serve these queries, we employ materialized views that precompute the results. In ClickHouse, materialized views automatically process incoming inserts from a source table and direct the transformed output into purpose-built target tables.
This technique is applied to accelerate queries for service names, operations, and trace ID timestamp ranges.
Five levels of attributes
There’s a subtle technical hurdle in the span’s schema that may not be immediately apparent: how the storage layer resolves attribute lookups. For example, when filtering for http.status_code=200, the system can’t inherently determine whether “200” is a string, an integer, a span-level attribute, or a resource-level attribute. Depending on the originating service, the same logical key could be stored under str_attributes or int_attributes, and it might reside at any of the five data tiers: resource, scope, span, event, or link.
To address this, we maintain a dedicated attribute_metadata table, fed by materialized views built on top of the spans table. This lets the reader resolve the filter key at query time and target only the columns corresponding to the types and levels that were actually recorded.
Span throughput at scale
We evaluated the ClickHouse backend using 10 million spans spread across 1 million traces on a single-node setup. The benchmark assessed ingestion throughput, compression efficiency, trace retrieval speed, and filtered search latency.
The backend maintained over 50k spans/sec during ingestion, delivered an 8.6× compression ratio on the spans table, and shrank span data from nearly 6 GiB down to approximately 722 MiB on disk. Trace retrieval averaged around 100 ms, while most search queries completed in under 50 ms. More involved filtered queries finished in roughly 140 ms.
“The backend maintained over 50k spans/sec during ingestion, delivered an 8.6× compression ratio on the spans table, and shrank span data from nearly 6 GiB down to approximately 722 MiB on disk.”
These results are promising, but they should be interpreted within the context of the benchmark environment and dataset used. The complete methodology, configuration specifics, and query details are documented in the benchmarking report.
Getting started
ClickHouse support is available as an alpha storage backend beginning with Jaeger v2.18.0. You’ll need a running ClickHouse instance and the Jaeger v2 configuration for the ClickHouse backend. The full instructions are outlined in the setup guide.
Serving as a Jaeger maintainer has been one of the most fulfilling aspects of my career thus far. If you’d like to discuss this work, contribute, or report issues, please open one on GitHub or join us in the CNCF #jaeger Slack channel.



