As modern system architectures become more and more complex, the cloud-native community is grappling with a quiet yet urgent problem: we’re being overwhelmed by the very telemetry data we generate. Instrumenting applications and gathering signals has never been simpler, but are we truly extracting meaningful insights, or merely accumulating mountains of data?
At the recent Observability Summit North America held in Minneapolis, a panel of industry practitioners came together to tackle this very issue. This article distills the key strategies, mindset shifts, and lessons shared during the discussion, aiming to help engineering teams zero in on the telemetry that genuinely matters.
The root issue: Excessive data collection and “green” observability
For a long time, the default approach to observability was straightforward: instrument everything and sort through it later. Yet real-world experience consistently reveals that roughly half of all collected metrics are never actually queried or used. This unchecked accumulation of data does more than inflate storage costs — it imposes significant engineering overhead, amplifies alert fatigue, and adds to the mental burden engineers carry during live incidents.
An important yet often neglected dimension of this problem is green observability. Every metric that gets stored, indexed, and processed demands real computational power, disk space, and energy. Cutting down on telemetry waste isn’t merely a cost-saving tactic for infrastructure — it actively reduces the carbon and environmental impact of cloud-native ecosystems.
To build systems that are both sustainable and highly reliable, observability must be treated as a foundational design consideration from the very start. Teams should deliberately define what a healthy system looks like and pinpoint exactly which signals are necessary to catch structural drift before deploying to production.
Steering through an incident: From fragmented signals to an observability mesh
When a production incident strikes, the objective isn’t to examine every piece of data — it’s to locate the information needed to rapidly gauge user impact and trace the root cause. Modern open-standards frameworks such as OpenTelemetry categorize these data points into fundamental signals:
- Traces (and Spans): Chart the path of a transaction as it flows through distributed services, highlighting latency bottlenecks, errors, or broken downstream dependencies.
- Metrics: Monitor performance trends over time (like CPU usage or request throughput) to detect anomalies and measure the scope of impact.
- Logs: Capture timestamped text records that reveal precisely what happened during a failure event.
- Profiles: Offer code-level insight into how resources are allocated (such as memory consumption or CPU execution hotspots), clarifying why a given service is running slowly or consuming excessive resources.
Instead of treating these as separate, standalone diagnostic tools, the community is moving toward an observability mesh. In this interconnected model, metrics link directly to traces, traces embed associated logs, and logs connect back to resource profiles. During a live incident, this cross-signal linkage dramatically cuts down the friction of switching between tools. For initial detection, teams can lean on a dependable foundational layer like RED metrics (Rate, Errors, Duration) to quickly pinpoint the failing service before diving deeper into the mesh.
Finding the right balance: Zero-code versus manual instrumentation
How should teams cleanly produce and handle this telemetry data? An open ecosystem depends on standardized building blocks: semantic conventions for consistent labeling, entry-point APIs, SDK implementations, and open protocols like OTLP for delivering data to a backend. But deciding how to instrument your applications means weighing the trade-offs between automatic and hands-on approaches:
Zero-code instrumentation
Zero-code (or automatic) instrumentation lets you set up language-specific SDKs or leverage platform operators to gather telemetry without touching your application’s source code. This approach is perfect for rapid initial deployments or when dealing with third-party software you can’t modify. Advanced solutions like OpenTelemetry eBPF instrumentation (OBI) provide strong visibility into requests, databases, and message queues while enabling correlation between network data and application context. That said, zero-code methods can’t capture internal business logic. And because they hook in automatically, they carry the risk of producing enormous, unmanageable volumes of data if not properly configured.
Manual instrumentation
Manual instrumentation hands full control to engineers, letting them shape tracing precision around their specific business logic and critical custom domains. This targeted approach makes it easier to design traces, logs, and metrics that work together to tell a coherent causal story. On the flip side, manual instrumentation is slow to implement, adds long-term maintenance costs to the codebase, and can result in inconsistent telemetry coverage if development teams don’t maintain strict discipline across different programming languages. There’s also a real danger of over-instrumenting the code, which introduces noisy, low-value details that actually hinder active debugging.
Many teams try to roll out fully manual frameworks right from the start, only to stall and lose executive support because of sluggling progress and spiraling costs. A more practical path is to begin with zero-code auto-instrumentation to immediately establish a telemetry baseline, then examine the data flowing through your pipelines and gradually layer in manual instrumentation wherever deeper context is truly needed.
Day 2: Pipeline optimization strategies

Once telemetry data is being gathered at scale, the focus should shift to refining your data pipelines. This empowers platform teams to respond swiftly to surges in data volume without requiring application teams to repeatedly modify and redeploy their code.
Here are several effective strategies for reducing data volume within an open collector pipeline:
- Intelligent Sampling: Step beyond basic random sampling, which risks losing vital error signals. Adopt tail-based or pattern-based sampling approaches that filter out routine, successful requests while ensuring every anomaly or failure is fully captured.
- Handling High Cardinality: Refrain from tagging system metrics with highly unique identifiers such as user_id or request_id, as these can cause a sudden explosion in dimensions that overwhelms backend query systems. Instead, apply transform processors to anonymize unique IDs (for example, replacing specific URL parameters with a generic $ placeholder), remove unnecessary attributes, or condense detailed IP addresses into broader subnet ranges.
- Cardinality Controls: Deploy pipeline processors that continuously track incoming attribute values. When a particular label exceeds a set uniqueness limit, the pipeline automatically discards that attribute to safeguard metric performance.
- Log Consolidation: Utilize processors that detect duplicate log entries generated within a brief time frame, merging them into a single log record with an accurate count of occurrences.
- Infrastructure Metadata Enrichment: Reduce the burden on individual agents by separating per-service metadata gathering. Instead, establish consistent semantic conventions and centrally inject shared infrastructure or container orchestrator labels within the collection pipeline.
Exploring the probabilistic frontier: Agentic and AI-driven workflows
The panel wrapped up by tackling a major shift in architecture: monitoring Agentic and LLM-powered workflows.
Conventional microservices follow predictable logic, where we measure clear success criteria, explicit network errors, and repeatable failure conditions. AI systems disrupt these norms. They function in probabilistic settings where identical prompts can produce vastly different outputs, errors are often qualitative rather than technical, and “success” hinges on the quality of the response.
As a result, our approach to telemetry must evolve. While traditional metrics like latency and error rates remain important, observability must extend to analyzing semantic prompt/response patterns and assessing decision quality rather than merely system availability. Tracing must follow the entire journey from user prompt to LLM model, through iterative tool and agent calls, down to legacy backend microservices, and back up to a final evaluation stage.
Ultimately, this shifts our central question from “Is the application performing quickly?” to “Is our system delivering cost-effective, dependable, and accurate results?”
Key insights from the panel
- Connect Network and Application Data: Incidents rarely stay confined to the software layer. Using open tools (such as eBPF-based instrumentation) to seamlessly tie application performance to the actual network paths between your users and your cluster is essential for swift root cause identification.
- Stay Informed on Emerging Architectural Standards: The community is actively developing solutions to tackle data scaling challenges. Watch for new approaches like retroactive sampling, which enables systems to make a centralized sampling decision upfront and then retrieve detailed, granular trace data as needed.
- Maximize Pipeline Flexibility: Avoid embedding filter logic directly within individual services. Depend on scalable collection components to dynamically shape, deduplicate, route, and control your telemetry volume. Periodically review your architecture by asking a critical question: “If this particular data stream ceased tomorrow, what would we truly lose?”



