Modern systems generate petabytes of telemetry data, but most teams are still guessing what matters

Question

Modern systems generate petabytes of telemetry data, but most teams are still guessing what matters

Tom SmithverifiedBackerLeader posted Oct 21, 2025 3 min read

Why Metrics Might Be Holding Your Observability Strategy Back

Metrics have been the foundation of observability for decades. But what if that foundation is cracked?

Ang Li, director of engineering at Observe, argues that metrics are a relic of outdated infrastructure constraints. They were necessary when systems couldn't handle large volumes of log data. Now, they might be doing more harm than good.

"Metrics force you to predefine what you think will matter before an incident even happens," Li explains. "If you guess wrong, you're effectively blind."

The Problem with Predicting Problems

The core issue isn't just philosophical. It's architectural.

Metrics require teams to decide upfront which labels, aggregation windows, and dimensions matter. But modern distributed systems are unpredictable. A metric might tell you API latency spiked, but it won't tell you the spike only affected requests from a specific region with a particular data pattern.

That context gets stripped during roll up. The user, request, payload, deployment, or dependency that triggered the failure disappears. You're left with a number that says something is wrong, but not why.

Li recalls troubleshooting out-of-memory errors at Observe. Engineers stared at pages of garbage collection metrics without clear insight. "Having individual request traces and logs is still critical to identify the culprit requests that brought down the system," he says.

How Cloud Economics Changed the Game

The technical landscape has shifted dramatically. Cloud infrastructure now makes it economically viable to store and analyze raw telemetry data at scale.

Observe's architecture centers on treating all telemetry as structured event data. They use Amazon S3 for storage with 10x compression, providing low-cost long-term retention for petabytes of data. Above that sits what they call the O11y Knowledge Graph, which transforms raw data into recognizable entities like customers, shopping carts, pods, and containers.

This approach required solving three key engineering challenges: scalable ingestion that adjusts to dynamic volume peaks, flexible transformation pipelines that model different system aspects, and intelligent query parallelization that handles 10GB the same way it handles 10TB.

"The customer should not care whether they are querying 10GB of data or 10TB," Li says. "The results should return in similar time frames."

Making Raw Data Analysis Viable

Processing raw telemetry instead of pre-aggregated metrics sounds computationally expensive. But several optimizations make it practical.

Columnar databases like Snowflake enable higher compression ratios than row-based systems. This reduces both storage costs and query scanning overhead. Observe also skips indexing by default, which is why search-based tools like Splunk become prohibitively expensive.

Aggressive query caching reuses existing results. Dashboard refreshes only re-query the changed portions, saving significant computational costs.

The Role of Metrics Going Forward

This isn't about eliminating metrics entirely. They still have value for infrastructure-level monitoring where detailed information is unavailable, like CPU utilization and OS metrics. They're efficient for capacity planning and catching broad systemic issues.

The shift happens when you need to understand why something broke in complex distributed systems.

"The real answer is combining metrics for high-level indicators and observability for deep contextual investigation," Li explains. "You need to recognize where you need richer data to maintain reliability at scale."

What This Means for Engineering Teams

Moving from metrics-first to direct data analysis changes daily workflows. Instead of jumping between dashboards and guessing which metric might be relevant, engineers can start with the actual data behind the behavior.

This makes troubleshooting more investigative and less reactive. Teams spend less time maintaining dashboards and more time understanding system behavior.

But it requires discipline. Raw data collection demands good instrumentation practices. Schema drifts can break downstream pipelines or alerting rules. Important log and trace messages need protection against accidental changes through unit tests.

AI as the Investigative Partner

AI is becoming crucial for navigating raw telemetry data. But not in the way you might expect.

"The power of AI lies in the capability to explore the data in steps to narrow down the root cause," Li says. AI can launch an initial query to aggregate results, then drill deeper based on what it finds, repeating the process iteratively.

Human engineers could follow the same routine, but it's time-consuming and difficult to execute rigorously. AI replicates the investigative process a skilled engineer would take during an incident.

For trustworthiness, Observe presents every step of the reasoning path. Engineers can modify queries and conduct their own analysis to confirm AI findings.

The Migration Question

Legacy observability vendors struggle with this shift because their rigid data models don't fit the diverse schema of raw data. They lack capabilities for advanced analytics like joins and window functions over massive datasets.

Li's advice for engineering leaders considering a migration? "Think back to a time when you were troubleshooting a tough production issue, only to realize the most relevant raw logs weren't captured because of a cost-saving decision."

Modern cloud economics have eliminated that tradeoff. The question isn't whether you can afford to store raw telemetry data. It's whether you can afford not to.

1 Comment

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Muzzamil Abbas · Answer 1 · 2025-10-22T13:21:20+0000

Tom Smith’s framing of metrics as a kind of pre-commitment to what you think will matter before reality hits was the bit that stuck with me. Interesting take. In what ways do you think teams will push back emotionally on abandoning that sense of control even if the data case is obvious?

	Why most people quit AWS Ijay - Feb 3
	What Is an Availability Zone Explained Simply Ijay - Feb 12
	Optimizing the Clinical Interface: Data Management for Efficient Medical Outcomes Huifer - Jan 26
	Dynatrace Perform 2026: What Developers Need to Know About the Platform's Latest Updates Tom Smithverified - Jan 28
	Beyond the Crisis: Why Engineering Your Personal Health Baseline Matters Huifer - Jan 24

Modern systems generate petabytes of telemetry data, but most teams are still guessing what matters

Why Metrics Might Be Holding Your Observability Strategy Back

The Problem with Predicting Problems

How Cloud Economics Changed the Game

Making Raw Data Analysis Viable

The Role of Metrics Going Forward

What This Means for Engineering Teams

AI as the Investigative Partner

The Migration Question

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Why most people quit AWS

What Is an Availability Zone Explained Simply

Optimizing the Clinical Interface: Data Management for Efficient Medical Outcomes

Dynatrace Perform 2026: What Developers Need to Know About the Platform's Latest Updates

Beyond the Crisis: Why Engineering Your Personal Health Baseline Matters

More From Tom Smith

F5 AI Remediate Closes the Gap Between Finding AI Vulnerabilities and Fixing Them

F5 Insight Gives Ops Teams a Prioritized To-Do List — Not Just Another Dashboard

CData Says Accuracy Is the Real Barrier to Enterprise AI — And the Numbers Back It Up

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 3,518 amazing developers

Don't have an account? Sign up

OR

Modern systems generate petabytes of telemetry data, but most teams are still guessing what matters

Why Metrics Might Be Holding Your Observability Strategy Back

The Problem with Predicting Problems

How Cloud Economics Changed the Game

Making Raw Data Analysis Viable

The Role of Metrics Going Forward

What This Means for Engineering Teams

AI as the Investigative Partner

The Migration Question

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Tom Smith

Related Jobs

Commenters (This Week)