Why Metrics Might Be Holding Your Observability Strategy Back
Metrics have been the foundation of observability for decades. But what if that foundation is cracked?
Ang Li, director of engineering at Observe, argues that metrics are a relic of outdated infrastructure constraints. They were necessary when systems couldn't handle large volumes of log data. Now, they might be doing more harm than good.
"Metrics force you to predefine what you think will matter before an incident even happens," Li explains. "If you guess wrong, you're effectively blind."
The Problem with Predicting Problems
The core issue isn't just philosophical. It's architectural.
Metrics require teams to decide upfront which labels, aggregation windows, and dimensions matter. But modern distributed systems are unpredictable. A metric might tell you API latency spiked, but it won't tell you the spike only affected requests from a specific region with a particular data pattern.
That context gets stripped during roll up. The user, request, payload, deployment, or dependency that triggered the failure disappears. You're left with a number that says something is wrong, but not why.
Li recalls troubleshooting out-of-memory errors at Observe. Engineers stared at pages of garbage collection metrics without clear insight. "Having individual request traces and logs is still critical to identify the culprit requests that brought down the system," he says.
How Cloud Economics Changed the Game
The technical landscape has shifted dramatically. Cloud infrastructure now makes it economically viable to store and analyze raw telemetry data at scale.
Observe's architecture centers on treating all telemetry as structured event data. They use Amazon S3 for storage with 10x compression, providing low-cost long-term retention for petabytes of data. Above that sits what they call the O11y Knowledge Graph, which transforms raw data into recognizable entities like customers, shopping carts, pods, and containers.
This approach required solving three key engineering challenges: scalable ingestion that adjusts to dynamic volume peaks, flexible transformation pipelines that model different system aspects, and intelligent query parallelization that handles 10GB the same way it handles 10TB.
"The customer should not care whether they are querying 10GB of data or 10TB," Li says. "The results should return in similar time frames."
Making Raw Data Analysis Viable
Processing raw telemetry instead of pre-aggregated metrics sounds computationally expensive. But several optimizations make it practical.
Columnar databases like Snowflake enable higher compression ratios than row-based systems. This reduces both storage costs and query scanning overhead. Observe also skips indexing by default, which is why search-based tools like Splunk become prohibitively expensive.
Aggressive query caching reuses existing results. Dashboard refreshes only re-query the changed portions, saving significant computational costs.
The Role of Metrics Going Forward
This isn't about eliminating metrics entirely. They still have value for infrastructure-level monitoring where detailed information is unavailable, like CPU utilization and OS metrics. They're efficient for capacity planning and catching broad systemic issues.
The shift happens when you need to understand why something broke in complex distributed systems.
"The real answer is combining metrics for high-level indicators and observability for deep contextual investigation," Li explains. "You need to recognize where you need richer data to maintain reliability at scale."
What This Means for Engineering Teams
Moving from metrics-first to direct data analysis changes daily workflows. Instead of jumping between dashboards and guessing which metric might be relevant, engineers can start with the actual data behind the behavior.
This makes troubleshooting more investigative and less reactive. Teams spend less time maintaining dashboards and more time understanding system behavior.
But it requires discipline. Raw data collection demands good instrumentation practices. Schema drifts can break downstream pipelines or alerting rules. Important log and trace messages need protection against accidental changes through unit tests.
AI as the Investigative Partner
AI is becoming crucial for navigating raw telemetry data. But not in the way you might expect.
"The power of AI lies in the capability to explore the data in steps to narrow down the root cause," Li says. AI can launch an initial query to aggregate results, then drill deeper based on what it finds, repeating the process iteratively.
Human engineers could follow the same routine, but it's time-consuming and difficult to execute rigorously. AI replicates the investigative process a skilled engineer would take during an incident.
For trustworthiness, Observe presents every step of the reasoning path. Engineers can modify queries and conduct their own analysis to confirm AI findings.
The Migration Question
Legacy observability vendors struggle with this shift because their rigid data models don't fit the diverse schema of raw data. They lack capabilities for advanced analytics like joins and window functions over massive datasets.
Li's advice for engineering leaders considering a migration? "Think back to a time when you were troubleshooting a tough production issue, only to realize the most relevant raw logs weren't captured because of a cost-saving decision."
Modern cloud economics have eliminated that tradeoff. The question isn't whether you can afford to store raw telemetry data. It's whether you can afford not to.