Prometheus has become the de facto standard for metrics collection in cloud-native environments. Its pull-based model, powerful query language, and deep Kubernetes integration make it an obvious choice for platform teams. But as organizations scale — more services, more replicas, more labels — Prometheus starts showing cracks. Queries slow down, memory usage balloons, and what was once a reliable monitoring backbone becomes an operational liability. This article examines exactly why that happens and what you can do about it, from quick tactical fixes to full architectural overhauls.
The Cardinality Problem: Why It Kills Prometheus
Cardinality is the single most important concept to understand when troubleshooting Prometheus scalability. In the context of time series databases, cardinality refers to the total number of unique label combinations that exist across all your metrics. Every unique combination creates a distinct time series, and Prometheus must store, index, and query each of them independently.
Consider a simple HTTP request counter: http_requests_total. If you label it with method (GET, POST, PUT, DELETE), status_code (200, 201, 400, 404, 500, 503), and endpoint (50 distinct API paths), you already have 4 × 6 × 50 = 1,200 time series from a single metric. Now add a customer_id label with 10,000 distinct values. You have just created 12 million time series from one counter.
This is the cardinality explosion pattern, and it is the most common cause of Prometheus degradation in production. The problem is compounded by labels that have unbounded or high-entropy values:
- User IDs or session tokens embedded in labels
- Request IDs or trace IDs (effectively infinite cardinality)
- Pod names without proper aggregation, especially in autoscaling environments
- Free-form error messages or SQL query strings
- IP addresses, particularly in environments with high churn
The relationship between cardinality and resource consumption is not linear — it is roughly proportional but carries significant overhead per series in memory indexing structures. Prometheus stores its head block (the most recent data) entirely in memory. Each time series in the head block requires approximately 3–4 KB of RAM for the series itself plus index entries. A Prometheus instance with 1 million active time series will typically consume 4–6 GB of RAM just for the head block, before accounting for query processing overhead.
Memory Explosion Patterns and Real Symptoms
Memory issues in Prometheus rarely announce themselves cleanly. Instead, they manifest through a cascade of symptoms that are easy to misdiagnose. Understanding the failure modes helps you identify the root cause faster and apply the right remedy.
The Head Block Growth Pattern
Prometheus keeps a two-hour window of data in memory as the head block before compacting it to disk. If your series count grows continuously — which happens when pod churn creates new series faster than old ones expire — the head block never shrinks. You can monitor this directly with prometheus_tsdb_head_series and prometheus_tsdb_head_chunks. A healthy instance shows this number plateauing. A cardinality problem shows it growing monotonically until OOM.
Query Timeout Cascades
As series count grows, even well-written PromQL queries that worked fine at 100k series become unbearably slow at 1M. Grafana dashboards start timing out, alert evaluation lags behind schedule, and Alertmanager begins receiving delayed or duplicated firing alerts. The prometheus_rule_evaluation_duration_seconds metric is a reliable early warning — when p99 evaluation time for your recording rules exceeds your evaluation interval, you have a problem.
Scrape Failures Under Memory Pressure
When Prometheus is under heavy memory pressure, its Go garbage collector starts spending more time collecting, which introduces latency into the scrape loop. Scrapes begin timing out, causing gaps in your data. This creates a deceptive situation where you have gaps in metrics precisely when your system is under stress — exactly when you need monitoring most. Watch up metric drops and prometheus_target_scrapes_exceeded_sample_limit_total for these patterns.
Compaction Pressure
High cardinality also stresses the TSDB compaction process. Prometheus compacts head block data into persistent blocks every two hours. With millions of series, compaction can take tens of seconds to minutes, during which write performance degrades. prometheus_tsdb_compaction_duration_seconds rising above 30 seconds is a warning sign. Compaction failures leave orphaned blocks on disk, gradually consuming storage and potentially corrupting the TSDB if left unaddressed.
Short-Term Fixes: Tactical Remediation
When you are dealing with a Prometheus instance under active stress, you need immediate relief before you can implement architectural changes. These techniques can be applied quickly and provide meaningful headroom while longer-term solutions are planned.
Recording Rules: Pre-Computing Aggregations
Recording rules are the most underutilized tool in the Prometheus toolbox. They allow you to pre-compute expensive PromQL expressions and store the results as new time series. The key benefit for scalability is that you can aggregate away high-cardinality dimensions, dramatically reducing the number of series that dashboards and alerts need to query at runtime.
Consider an example where you have per-pod HTTP request rates with labels for pod, namespace, service, method, and status_code. Your dashboards mostly need service-level aggregations, not per-pod breakdowns. A recording rule can produce that aggregation once per evaluation interval:
groups:
- name: http_aggregations
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: |
sum by (job, namespace, method, status_code) (
rate(http_requests_total[5m])
)
- record: job:http_request_duration_seconds:p99_5m
expr: |
histogram_quantile(0.99,
sum by (job, namespace, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
- record: namespace:http_requests_total:rate5m
expr: |
sum by (namespace, status_code) (
rate(http_requests_total[5m])
)
Notice that the pod label is dropped in all three rules. If you had 500 pods, you have just reduced the cardinality of these series by a factor of 500. Dashboards querying job:http_requests_total:rate5m instead of computing rate(http_requests_total[5m]) on the fly will return results orders of magnitude faster.
The naming convention level:metric:operations is the Prometheus community standard. Following it consistently makes recording rules self-documenting and helps teams understand the aggregation level at a glance.
Metric Dropping via Relabeling
Relabeling gives you surgical control over what metrics Prometheus actually ingests. There are two stages where relabeling applies: relabel_configs (applied before scraping, based on target metadata) and metric_relabel_configs (applied after scraping, based on scraped metric names and labels). For cardinality control, metric_relabel_configs is your primary tool.
Dropping entire metric families that you do not use is the most impactful change you can make. Many exporters emit dozens of metrics that are irrelevant for most use cases:
scrape_configs:
- job_name: kubernetes-pods
metric_relabel_configs:
# Drop metrics we never query
- source_labels: [__name__]
regex: 'go_gc_.*|go_memstats_.*|process_.*'
action: drop
# Drop high-cardinality label values while keeping the metric
- source_labels: [__name__, pod]
regex: 'http_requests_total;.*'
target_label: pod
replacement: ''
# Drop entire time series based on label combinations
- source_labels: [__name__, le]
regex: 'http_request_duration_seconds_bucket;(\+Inf|100|250|500)'
action: keep
# Replace high-cardinality endpoint paths with normalized versions
- source_labels: [endpoint]
regex: '/api/v1/users/[0-9]+'
target_label: endpoint
replacement: '/api/v1/users/:id'
Be careful with metric_relabel_configs — they are applied per scraped sample, so computationally expensive regex patterns across high-frequency scrapes can add CPU overhead. Test regex patterns and prefer anchored, non-backtracking expressions.
Cardinality Limits as a Safety Net
Prometheus 2.x introduced per-scrape sample limits as a defensive mechanism. These do not solve cardinality problems but prevent a single misbehaving exporter from taking down your entire Prometheus instance:
global:
# Global limit across all scrapes
sample_limit: 0 # 0 = no limit
scrape_configs:
- job_name: application-pods
# Reject scrapes that return more than 50k samples
sample_limit: 50000
# Limit unique label sets per scrape
label_limit: 64
# Limit label name and value lengths
label_name_length_limit: 256
label_value_length_limit: 1024
kubernetes_sd_configs:
- role: pod
When a scrape exceeds sample_limit, Prometheus rejects the entire scrape and marks the target as having failed. This is a hard circuit breaker, not a graceful degradation — the target’s up metric goes to 0. Set limits conservatively above your expected maximum to avoid false positives, and alert on prometheus_target_scrapes_exceeded_sample_limit_total > 0.
Architectural Solutions: Federation and Remote Write
Once you have exhausted tactical optimizations or when your scale genuinely exceeds what a single Prometheus instance can handle, architectural changes become necessary. Prometheus offers two built-in mechanisms for scaling horizontally: federation and remote_write.
Federation: Hierarchical Scraping
Prometheus federation allows one Prometheus instance to scrape aggregated metrics from other Prometheus instances via the /federate endpoint. In a typical setup, leaf-level Prometheus instances collect raw metrics from targets, while a global Prometheus instance federates pre-aggregated recording rule results from the leaves.
# Global Prometheus configuration federating from regional instances
scrape_configs:
- job_name: federate-eu-west
scrape_interval: 15s
honor_labels: true
metrics_path: /federate
params:
match[]:
# Only federate pre-aggregated recording rule metrics
- '{__name__=~"job:.*"}'
- '{__name__=~"namespace:.*"}'
- '{__name__=~"cluster:.*"}'
# Federate key infrastructure alerts
- 'up{job="kubernetes-apiservers"}'
static_configs:
- targets:
- prometheus-eu-west.monitoring.svc:9090
- prometheus-us-east.monitoring.svc:9090
- prometheus-ap-south.monitoring.svc:9090
Federation works well for multi-region global dashboards and cross-cluster alerting on aggregated signals. Its limitations are significant, though: the /federate endpoint is a point-in-time snapshot, so you cannot run range queries against federated data effectively. It also creates a single point of failure at the global layer and does not provide true long-term storage. For those requirements, remote_write is the better path.
Remote Write: Streaming to Durable Storage
Remote write allows Prometheus to stream all ingested samples to an external storage backend in real time. The external backend handles long-term retention, multi-tenancy, and global query federation. Prometheus itself becomes a stateless collection agent that maintains only a short local retention window for resilience against network outages.
remote_write:
- url: https://thanos-receive.monitoring.svc:19291/api/v1/receive
# Authentication for the remote endpoint
basic_auth:
username: prometheus
password_file: /etc/prometheus/secrets/remote-write-password
# Tune the write queue for throughput vs. latency
queue_config:
# Number of shards (parallel write connections)
max_shards: 200
min_shards: 1
# Samples to batch before flushing
max_samples_per_send: 500
# Time to wait before flushing an incomplete batch
batch_send_deadline: 5s
# In-memory buffer capacity per shard
capacity: 2500
# How long to retry failed writes
min_backoff: 30ms
max_backoff: 5s
# Metadata configuration
metadata_config:
send: true
send_interval: 1m
# Filter what gets remote-written (reduce egress)
write_relabel_configs:
- source_labels: [__name__]
regex: 'go_gc_.*|go_memstats_.*'
action: drop
The queue_config tuning is critical and frequently misunderstood. Each shard maintains its own connection to the remote endpoint and its own in-memory queue. Increasing max_shards increases parallelism and throughput but also increases memory consumption and load on the remote endpoint. The right values depend heavily on your sample ingestion rate and network latency to the remote endpoint. Monitor prometheus_remote_storage_queue_highest_sent_timestamp_seconds versus prometheus_remote_storage_highest_timestamp_in_seconds — the lag between them tells you how far behind your remote write queue is.
Long-Term Solutions: Thanos vs Grafana Mimir vs VictoriaMetrics
For production systems that need long-term storage, global query capability, high availability, and genuine horizontal scalability, purpose-built solutions are the right answer. Three projects dominate this space: Thanos, Grafana Mimir, and VictoriaMetrics. They share similar goals but differ significantly in architecture, operational complexity, and trade-offs.
| Criterion | Thanos | Grafana Mimir | VictoriaMetrics |
|---|
| Architecture | Sidecar + object store; modular components | Fully distributed; Cortex-derived microservices | Single binary or cluster mode |
| Storage backend | Any S3-compatible object store | Any S3-compatible object store | Own TSDB format on local or object store |
| PromQL compatibility | Full PromQL; own query engine | Full PromQL; Mimir-specific extensions | MetricsQL (PromQL superset) |
| Operational complexity | Medium — multiple components, each simple | High — many microservices with complex config | Low — minimal components, simple config |
| Ingest scalability | Scales via Thanos Receive fan-out | Horizontally scalable distributors + ingesters | Excellent; handles millions of samples/sec per node |
| Query performance | Good; Store Gateway caches object store data | Good; query sharding and caching built in | Excellent; highly optimized query engine |
| Multi-tenancy | Limited; tenant isolation via external labels | Native; per-tenant limits and isolation | Enterprise only; basic in cluster mode |
| Deduplication | Built-in; replica dedup at query time | Built-in; ingest-time and query-time dedup | Built-in; dedup with downsampling |
| Downsampling | Yes; Thanos Compactor handles it | Yes; configurable per tenant | Yes; automatic with vmbackupmanager |
| License | Apache 2.0 (fully open source) | AGPL-3.0 (open source) + enterprise tier | Apache 2.0 (community); proprietary enterprise |
| Best fit | Teams already running Prometheus wanting minimal disruption | Large orgs needing multi-tenant SaaS-grade monitoring | Teams prioritizing simplicity and raw performance |
Thanos: The Incremental Path
Thanos integrates with existing Prometheus deployments through a sidecar process that runs alongside each Prometheus pod. The sidecar uploads completed TSDB blocks to object storage (S3, GCS, Azure Blob) and exposes a gRPC Store API that Thanos Query uses to federate queries across all Prometheus instances plus historical data in the object store. This makes Thanos the lowest-friction path for teams with existing Prometheus infrastructure.
Thanos Receive is an alternative ingest path that accepts remote_write directly, which is useful when you want to decouple Prometheus instances from the query layer or implement active-active HA without relying on Prometheus replication. Thanos Compactor handles block compaction and downsampling on the object store, creating 5-minute and 1-hour resolution downsamples automatically for efficient long-range queries.
Grafana Mimir: Enterprise-Grade Multi-Tenancy
Mimir is a fork of Cortex, rewritten by Grafana Labs to address operational complexity issues in Cortex’s architecture. It follows the same microservices pattern — Distributor, Ingester, Querier, Query Frontend, Store Gateway, Compactor, Ruler — but with significantly improved defaults and a monolithic deployment mode that simplifies small-scale deployments. Mimir’s headline feature is native multi-tenan