# Linkerd and OpenTelemetry: Two Layers of Observability, One Pipeline
If you already run an OpenTelemetry pipeline, you have good visibility into what your applications are doing. This blog post is about what you don’t see yet: the east-west traffic between your services, measured at the network layer with zero changes to your application code.
Linkerd’s proxy provides those metrics. Once a workload is meshed, the proxy immediately emits golden metrics for every inbound and outbound request. No need for instrumentation, SDK calls, or image rebuild. This blog post shows what those metrics look like, where they overlap with OTel, where they don’t, and how to wire them into an existing OTel Collector pipeline so both layers land in the same backend. If you come from the mesh side and are wondering what OTel adds, you’ll learn that too.
## The Setup
The reference environment is K3s v1.34.6 (single node), Linkerd 2.19+ (tested on edge-26.5.5, June 2026), the OpenTelemetry Demo (Astronomy Shop) as the meshed workload, OTel Collector contrib 0.118.0 as a DaemonSet, and VictoriaMetrics with Grafana. The working Collector config and Grafana dashboard are available as downloads at the end of this post.
## What OTel Covers
The OpenTelemetry specification defines three signal types: traces, metrics, and logs. **Traces** follow a request across service boundaries and give you the full call graph. **Metrics** capture numeric measurements over time: counters, gauges, and histograms. **Logs** are the structured events your application emits.
What auto-instrumentation gives you depends on the language and framework: HTTP request counts, database call durations, and message queue depths. What you write yourself are the business-layer signals: number of orders placed, items added to cart, and payment failures. The OpenTelemetry Demo is a good example: it emits `app_cart_add_item_latency_seconds`, `app_payment_transactions_total`, `app_recommendations_counter_total`, and a handful of other service-specific metrics that no infrastructure layer could infer.
All of this lives at the application layer. OTel instruments your code, but it can’t# Comparing Mesh Layer and Application Layer Metrics with OpenTelemetry
When operating a service mesh alongside an application instrumented with OpenTelemetry, you effectively have two independent observers recording the same events. Understanding what each one sees — and where they diverge — is critical for building reliable monitoring and alerting.
## The Overlap: Same Operation, Two Measurements
Both the mesh proxy and the application instrumentation record latency for the same operations, but from different vantage points. On the mesh side, `response_latency_ms_bucket` measures the elapsed time between a request’s headers being received and its response stream completing. On the app side, `app_cart_add_item_latency_seconds_bucket` measures the time the cart service’s own instrumentation recorded for the same operation.
When plotted together on a single chart, the difference becomes visible. A Grafana timeseries panel titled “p99 latency: mesh layer vs. app layer” shows multiple colored lines representing mesh p99 latency per pod across the otel-demo namespace, with a distinct line at the bottom representing app p99 latency for `cart.add_item` operations, measured in milliseconds.
The mesh and app measurements will not be identical. The proxy measures network-level timing; the application measures its own internal processing. The gap between them can surface network overhead, queuing, or slow middleware.
### Which Measurement to Trust for What
– **For mTLS identity and east-west success rate**, trust mesh metrics. The proxy is the authoritative source because it observes the actual connection, not what the app reports about itself.
– **For business semantics and custom dimensions**, trust OTel app metrics. Only your code knows that a request was a “checkout” for a “platinum” customer.
– **For root cause**, trust distributed traces. The mesh sees failure rates; traces show you the call graph and tell you which span failed.
## The Non-Overlap: Where Mesh and App Metrics Complement Each Other
OTel covers what the application knows: custom business metrics, per-request traces, app-layer events, and infrastructure metrics you wire up explicitly. Mesh metrics cover L7 east-west service-to-service traffic, as observed by the proxy.
The cleanest way to see the difference is with a failure. In the OpenTelemetry Demo, the frontend service makes regular calls to the ad service. The mesh sees these (labels trimmed for readability):
“`
response_total{direction=”outbound”, status_code=”200″, dst_service=”ad”,
classification=”failure”, grpc_status=”14″, …} 6
“`
Notice the label pair: `status_code=”200″` and `grpc_status=”14″`. The HTTP layer reports success; the gRPC status is UNAVAILABLE. In gRPC, the status code travels in the response trailers, separately from the HTTP status line, so if you only alert on HTTP status codes, this failure is invisible. The proxy reads that status and classifies the response as a failure anyway. The mesh knows the call failed and how many times, but it does not know why.
A Jaeger trace for the same operation tells the rest of the story. The trace timeline for a `user_get_ads` request shows the span tree with load-generator calling frontend-proxy, which calls frontend. The deepest span, `oteldemo.AdService/GetAds` on the frontend service, is highlighted in red with tags showing `error=true` and `grpc.error_message “14 UNAVAILABLE: client 10.42.0.216:52176: server: 10.42.0.217:4143.”`
The trace shows the exact span that failed, the exact error message, and the client and server addresses involved. The mesh flags the issue while the trace shows you the root cause.
The same separation applies to business metrics. `app_payment_transactions_total` and `app_recommendations_counter_total` show up on the app side because the OTel Demo’s own instrumentation emits them. No proxy can infer that a request was a payment or a recommendation. That domain knowledge lives in the code.
## The Integration Pattern
The goal is to get mesh metrics into the same backend as your OTel metrics, tagged so you can tell them apart. The mechanism is a dedicated pipeline in your OTel Collector.
The `prometheus/mesh` receiver uses Kubernetes pod discovery to find every container named `linkerd-proxy` and scrapes port 4191. The `filter/mesh` processor, written in the OpenTelemetry Transformation Language (OTTL), keeps only the five core metric families (`response_total`, `response_latency_ms.*`, `tcp_open_connections`, `tcp_read_bytes_total`, `tcp_write_bytes_total`). The `resource/mesh` processor inserts `layer=mesh` on every series, and `k8sattributes` enriches each one with pod, namespace, deployment, and node metadata. The full pipeline flows through `memory_limiter`, `filter/mesh`, `resource/mesh`, `resourcedetection`, `k8sattributes`, and `batch` before reaching the `prometheusremotewrite` exporter.
“`yaml
receivers:
prometheus/mesh:
config:
scrape_configs:
– job_name: linkerd-mesh
scrape_interval: 30s
kubernetes_sd_configs:
– role: pod
relabel_configs:
– source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: linkerd-proxy
– source_labels: [__meta_kubernetes_pod_ip]
action: replace
target_label: __address__
regex: (.+)
replacement: $1:4191
processors:
filter/mesh:
error_mode: ignore
metrics:
metric:
– ‘not(name == “response_total” or IsMatch(name, “response_latency_ms.*”) or name == “tcp_open_connections” or name == “tcp_read_bytes_total” or name == “tcp_write_bytes_total”)’
resource/mesh:
attributes:
– key: layer
value: mesh
action: insert
service:
pipelines:
metrics/mesh:
receivers: [prometheus/mesh]
processors: [memory_limiter, filter/mesh, resource/mesh, resourcedetection, k8sattributes, batch]
exporters: [prometheusremotewrite]
“`
This configuration ensures mesh metrics arrive at your observability backend carrying the same dimensional metadata as your OTel application metrics, with the `layer=mesh` attribute making the two sources trivially distinguishable in dashboards and alerts.
—
*Original article: [Comparing Mesh Layer and Application Layer Metrics with OpenTelemetry](https://www.cncf.io/blog/2026/06/09/comparing-mesh-layer-and-application-layer-metrics-with-opentelemetry/), Cloud Native Computing Foundation (CNCF).*# Connecting an OTel Demo to a Service Mesh Reference Pipeline: Lessons Learned
## Overview: Merging Observability Pipelines
Running Linkerd mesh metrics alongside an OpenTelemetry-instrumented demo application can produce clean, unified Grafana dashboards—but the Collector configuration running the mesh pipeline requires careful attention to detail. The following notes are based on a practical lab setup that combined Linkerd mesh metrics (piped through the OpenTelemetry Collector into VictoriaMetrics) with the OTel Demo’s built-in Prometheus, both visualized in a single Grafana dashboard.
—
## The Setup
The mesh pipeline is designed to stay separate from the OTel Demo’s existing metrics infrastructure. There is a dedicated Collector handling mesh metrics, a dedicated VictoriaMetrics instance storing them, and the OTel Demo’s bundled Prometheus continues to scrape application-level metrics untouched. Grafana reads both as separate datasources, which mirrors exactly how the downloadable dashboard’s mixed datasource panel is wired: query A pulls `response_latency_ms_bucket{layer=”mesh”}` from VictoriaMetrics, and query B pulls `app_cart_add_item_latency_seconds_bucket` from the Prometheus datasource. When both inputs are prometheus-type in Grafana, it is easy to accidentally point them at the same datasource. The mesh input must be mapped to VictoriaMetrics and the app input to Prometheus.
This reference stack uses VictoriaMetrics and the OTel Demo’s built-in Grafana. The pattern works with any Prometheus-compatible backend. The stack also includes VictoriaLogs for application logs, though that pipeline is outside the scope of this post.
—
## A Costly Configuration Gotcha
One issue cost meaningful debugging time during this setup: a relabel rule that uses `$1:4191` as its replacement value can trip over the Collector’s `$`-based environment variable expansion on certain versions, causing the config to be rejected at startup. This was encountered on contrib 0.104.0, where the `confmap.unifyEnvVarExpansion` feature gate was the culprit (workaround then: `–feature-gates=-confmap.unifyEnvVarExpansion`). The same failure is reported upstream on contrib 0.112.0, which rejects even an escaped `$$1:$$2` replacement pattern. Contrib 0.118.0, the version this lab pins, accepts the config without issue. **Pin your image tag.**
All Collector configuration referenced here is also maintained in the myOTel reference stack, from which this lab’s setup was adapted. The namespace and storage class may differ, but the pipeline structure is the same.
—
## A Note on Cardinality
Proxy metrics come with a large number of labels. A single `response_latency_ms_bucket` series, as stored in VictoriaMetrics after enrichment, carries 35 labels: `direction`, `tls`, `client_id`, `authz_kind`, `route_name`, `srv_name`, `le`, and many more. Some labels are emitted by the proxy itself; others are added by the `k8sattributes` and resource processors. Every combination of those label values constitutes its own series—once per histogram bucket, on every meshed pod.
This lab’s scrape covers 30 meshed pods: the 25 OTel Demo pods, the pipeline’s own Collector and VictoriaMetrics, and Linkerd’s 3 control-plane pods (since the scrape retains every `linkerd-proxy` container in the cluster). `response_latency_ms_bucket` alone produced 5,642 series, and the entire `job=”linkerd-mesh”` scrape totaled 9,280 series after filtering. Both numbers are post-filter. **Filtering by metric family shortens the name list, but it does nothing to the label cardinality inside a family you keep. A histogram you keep is a histogram you pay for.**
—
## Where Filtering Happens: Two Approaches
Without filtering, the proxy exposes 163 distinct metric families. After the filter/mesh processor retains only 5 families, 11 metric names arrive in VictoriaMetrics: the 5 families export as 7 names (each histogram splits into `_bucket`, `_count`, and `_sum`), 3 more leak from `control_response_latency_ms_*`, and the final one is the exporter-generated `target_info`.
There are two places to perform this filtering, both tested on contrib 0.118.0 in this lab:
### Approach 1: `metric_relabel_configs` in the Prometheus Receiver
“`yaml
metric_relabel_configs:
– source_labels: [__name__]
action: keep
regex: “response_total|response_latency_ms.*|tcp_.*”
“`
With this rule active and no OTTL filter, **15 metric names** flow into VictoriaMetrics: 9 proxy names matching the regex, plus 6 names the keep rule never sees. Those 6 are the scrape’s own synthetic series (`up`, `scrape_duration_seconds`, `scrape_samples_scraped`, `scrape_samples_post_metric_relabeling`, `scrape_series_added`)—metric relabeling does not apply to these—and `target_info`, which the `prometheusremotewrite` exporter generates after the processors run.
### Approach 2: OTTL Filter/Mesh Processor
The OTTL filter/mesh processor from the integration section lands at **11 names**: 163 families in, 11 out. The gap between 11 and 15 matters when choosing between them:
– The relabel regex is fully anchored, so `response_latency_ms.*` does not admit `control_response_latency_ms_*`; OTTL’s `IsMatch` is unanchored, so those 3 control-plane names leak through it.
– In the other direction, `tcp_.*` in the keep rule admits every TCP family the proxy exposes (`tcp_open_total` and `tcp_close_total` included), while the OTTL list names its 3 TCP metrics explicitly.
– Only the OTTL filter drops the synthetic scrape series.
This reference config ships the OTTL filter: the keep list lives in the pipeline next to the processors that tag and enrich the data, and the synthetic series stay out of the backend. The relabel route is also a valid choice—it filters at scrape time, before samples enter the pipeline. Just write the regex with anchoring in mind and expect `up` and the `scrape_*` series to flow through.
—
## Where Each Layer Earns Its Place
OTel gives you what your application knows about itself: business events, custom dimensions, and distributed traces that follow a request across every service boundary. Linkerd gives you what the network knows: every east-west request between your meshed services, with mTLS identity, success rate, and latency, with zero changes to your code.
The two are complementary. An OTel pipeline without mesh metrics is missing the service topology layer. A mesh without OTel instrumentation can tell you a request failed but not why.
If you are already running OTel, adding the mesh pipeline is one Collector config change and a namespace annotation. The Grafana dashboard referenced in this post shows both layers on the chart the moment you do.
—
*Original article: [Connecting an OTel Demo to a Service Mesh Reference Pipeline](https://www.cncf.io/blog/2026/06/connecting-an-otel-demo-to-a-service-mesh-reference-pipeline/), CNCF Blog*



