Prometheus Scalability: High Cardinality and How to Fix It

posted Originally published at alexandre-vazquez.com 11 min read

Prometheus has become the de facto standard for metrics collection in cloud-native environments. Its pull-based model, powerful query language, and deep Kubernetes integration make it an obvious choice for platform teams. But as organizations scale — more services, more replicas, more labels — Prometheus starts showing cracks. Queries slow down, memory usage balloons, and what was once a reliable monitoring backbone becomes an operational liability. This article examines exactly why that happens and what you can do about it, from quick tactical fixes to full architectural overhauls.

The Cardinality Problem: Why It Kills Prometheus

Cardinality is the single most important concept to understand when troubleshooting Prometheus scalability. In the context of time series databases, cardinality refers to the total number of unique label combinations that exist across all your metrics. Every unique combination creates a distinct time series, and Prometheus must store, index, and query each of them independently.

Consider a simple HTTP request counter: http_requests_total. If you label it with method (GET, POST, PUT, DELETE), status_code (200, 201, 400, 404, 500, 503), and endpoint (50 distinct API paths), you already have 4 × 6 × 50 = 1,200 time series from a single metric. Now add a customer_id label with 10,000 distinct values. You have just created 12 million time series from one counter.

This is the cardinality explosion pattern, and it is the most common cause of Prometheus degradation in production. The problem is compounded by labels that have unbounded or high-entropy values:

  • User IDs or session tokens embedded in labels
  • Request IDs or trace IDs (effectively infinite cardinality)
  • Pod names without proper aggregation, especially in autoscaling environments
  • Free-form error messages or SQL query strings
  • IP addresses, particularly in environments with high churn

The relationship between cardinality and resource consumption is not linear — it is roughly proportional but carries significant overhead per series in memory indexing structures. Prometheus stores its head block (the most recent data) entirely in memory. Each time series in the head block requires approximately 3–4 KB of RAM for the series itself plus index entries. A Prometheus instance with 1 million active time series will typically consume 4–6 GB of RAM just for the head block, before accounting for query processing overhead.

Memory Explosion Patterns and Real Symptoms

Memory issues in Prometheus rarely announce themselves cleanly. Instead, they manifest through a cascade of symptoms that are easy to misdiagnose. Understanding the failure modes helps you identify the root cause faster and apply the right remedy.

The Head Block Growth Pattern

Prometheus keeps a two-hour window of data in memory as the head block before compacting it to disk. If your series count grows continuously — which happens when pod churn creates new series faster than old ones expire — the head block never shrinks. You can monitor this directly with prometheus_tsdb_head_series and prometheus_tsdb_head_chunks. A healthy instance shows this number plateauing. A cardinality problem shows it growing monotonically until OOM.

Query Timeout Cascades

As series count grows, even well-written PromQL queries that worked fine at 100k series become unbearably slow at 1M. Grafana dashboards start timing out, alert evaluation lags behind schedule, and Alertmanager begins receiving delayed or duplicated firing alerts. The prometheus_rule_evaluation_duration_seconds metric is a reliable early warning — when p99 evaluation time for your recording rules exceeds your evaluation interval, you have a problem.

Scrape Failures Under Memory Pressure

When Prometheus is under heavy memory pressure, its Go garbage collector starts spending more time collecting, which introduces latency into the scrape loop. Scrapes begin timing out, causing gaps in your data. This creates a deceptive situation where you have gaps in metrics precisely when your system is under stress — exactly when you need monitoring most. Watch up metric drops and prometheus_target_scrapes_exceeded_sample_limit_total for these patterns.

Compaction Pressure

High cardinality also stresses the TSDB compaction process. Prometheus compacts head block data into persistent blocks every two hours. With millions of series, compaction can take tens of seconds to minutes, during which write performance degrades. prometheus_tsdb_compaction_duration_seconds rising above 30 seconds is a warning sign. Compaction failures leave orphaned blocks on disk, gradually consuming storage and potentially corrupting the TSDB if left unaddressed.

Short-Term Fixes: Tactical Remediation

When you are dealing with a Prometheus instance under active stress, you need immediate relief before you can implement architectural changes. These techniques can be applied quickly and provide meaningful headroom while longer-term solutions are planned.

Recording Rules: Pre-Computing Aggregations

Recording rules are the most underutilized tool in the Prometheus toolbox. They allow you to pre-compute expensive PromQL expressions and store the results as new time series. The key benefit for scalability is that you can aggregate away high-cardinality dimensions, dramatically reducing the number of series that dashboards and alerts need to query at runtime.

Consider an example where you have per-pod HTTP request rates with labels for pod, namespace, service, method, and status_code. Your dashboards mostly need service-level aggregations, not per-pod breakdowns. A recording rule can produce that aggregation once per evaluation interval:

groups:
  - name: http_aggregations
    interval: 30s
    rules:
      - record: job:http_requests_total:rate5m
        expr: |
          sum by (job, namespace, method, status_code) (
            rate(http_requests_total[5m])
          )

      - record: job:http_request_duration_seconds:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, namespace, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      - record: namespace:http_requests_total:rate5m
        expr: |
          sum by (namespace, status_code) (
            rate(http_requests_total[5m])
          )

Notice that the pod label is dropped in all three rules. If you had 500 pods, you have just reduced the cardinality of these series by a factor of 500. Dashboards querying job:http_requests_total:rate5m instead of computing rate(http_requests_total[5m]) on the fly will return results orders of magnitude faster.

The naming convention level:metric:operations is the Prometheus community standard. Following it consistently makes recording rules self-documenting and helps teams understand the aggregation level at a glance.

Metric Dropping via Relabeling

Relabeling gives you surgical control over what metrics Prometheus actually ingests. There are two stages where relabeling applies: relabel_configs (applied before scraping, based on target metadata) and metric_relabel_configs (applied after scraping, based on scraped metric names and labels). For cardinality control, metric_relabel_configs is your primary tool.

Dropping entire metric families that you do not use is the most impactful change you can make. Many exporters emit dozens of metrics that are irrelevant for most use cases:

scrape_configs:
  - job_name: kubernetes-pods
    metric_relabel_configs:
      # Drop metrics we never query
      - source_labels: [__name__]
        regex: 'go_gc_.*|go_memstats_.*|process_.*'
        action: drop

      # Drop high-cardinality label values while keeping the metric
      - source_labels: [__name__, pod]
        regex: 'http_requests_total;.*'
        target_label: pod
        replacement: ''

      # Drop entire time series based on label combinations
      - source_labels: [__name__, le]
        regex: 'http_request_duration_seconds_bucket;(\+Inf|100|250|500)'
        action: keep

      # Replace high-cardinality endpoint paths with normalized versions
      - source_labels: [endpoint]
        regex: '/api/v1/users/[0-9]+'
        target_label: endpoint
        replacement: '/api/v1/users/:id'

Be careful with metric_relabel_configs — they are applied per scraped sample, so computationally expensive regex patterns across high-frequency scrapes can add CPU overhead. Test regex patterns and prefer anchored, non-backtracking expressions.

Cardinality Limits as a Safety Net

Prometheus 2.x introduced per-scrape sample limits as a defensive mechanism. These do not solve cardinality problems but prevent a single misbehaving exporter from taking down your entire Prometheus instance:

global:
  # Global limit across all scrapes
  sample_limit: 0  # 0 = no limit

scrape_configs:
  - job_name: application-pods
    # Reject scrapes that return more than 50k samples
    sample_limit: 50000

    # Limit unique label sets per scrape
    label_limit: 64

    # Limit label name and value lengths
    label_name_length_limit: 256
    label_value_length_limit: 1024

    kubernetes_sd_configs:
      - role: pod

When a scrape exceeds sample_limit, Prometheus rejects the entire scrape and marks the target as having failed. This is a hard circuit breaker, not a graceful degradation — the target’s up metric goes to 0. Set limits conservatively above your expected maximum to avoid false positives, and alert on prometheus_target_scrapes_exceeded_sample_limit_total > 0.

Architectural Solutions: Federation and Remote Write

Once you have exhausted tactical optimizations or when your scale genuinely exceeds what a single Prometheus instance can handle, architectural changes become necessary. Prometheus offers two built-in mechanisms for scaling horizontally: federation and remote_write.

Federation: Hierarchical Scraping

Prometheus federation allows one Prometheus instance to scrape aggregated metrics from other Prometheus instances via the /federate endpoint. In a typical setup, leaf-level Prometheus instances collect raw metrics from targets, while a global Prometheus instance federates pre-aggregated recording rule results from the leaves.

# Global Prometheus configuration federating from regional instances
scrape_configs:
  - job_name: federate-eu-west
    scrape_interval: 15s
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        # Only federate pre-aggregated recording rule metrics
        - '{__name__=~"job:.*"}'
        - '{__name__=~"namespace:.*"}'
        - '{__name__=~"cluster:.*"}'
        # Federate key infrastructure alerts
        - 'up{job="kubernetes-apiservers"}'
    static_configs:
      - targets:
          - prometheus-eu-west.monitoring.svc:9090
          - prometheus-us-east.monitoring.svc:9090
          - prometheus-ap-south.monitoring.svc:9090

Federation works well for multi-region global dashboards and cross-cluster alerting on aggregated signals. Its limitations are significant, though: the /federate endpoint is a point-in-time snapshot, so you cannot run range queries against federated data effectively. It also creates a single point of failure at the global layer and does not provide true long-term storage. For those requirements, remote_write is the better path.

Remote Write: Streaming to Durable Storage

Remote write allows Prometheus to stream all ingested samples to an external storage backend in real time. The external backend handles long-term retention, multi-tenancy, and global query federation. Prometheus itself becomes a stateless collection agent that maintains only a short local retention window for resilience against network outages.

remote_write:
  - url: https://thanos-receive.monitoring.svc:19291/api/v1/receive
    # Authentication for the remote endpoint
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/secrets/remote-write-password

    # Tune the write queue for throughput vs. latency
    queue_config:
      # Number of shards (parallel write connections)
      max_shards: 200
      min_shards: 1
      # Samples to batch before flushing
      max_samples_per_send: 500
      # Time to wait before flushing an incomplete batch
      batch_send_deadline: 5s
      # In-memory buffer capacity per shard
      capacity: 2500
      # How long to retry failed writes
      min_backoff: 30ms
      max_backoff: 5s

    # Metadata configuration
    metadata_config:
      send: true
      send_interval: 1m

    # Filter what gets remote-written (reduce egress)
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_gc_.*|go_memstats_.*'
        action: drop

The queue_config tuning is critical and frequently misunderstood. Each shard maintains its own connection to the remote endpoint and its own in-memory queue. Increasing max_shards increases parallelism and throughput but also increases memory consumption and load on the remote endpoint. The right values depend heavily on your sample ingestion rate and network latency to the remote endpoint. Monitor prometheus_remote_storage_queue_highest_sent_timestamp_seconds versus prometheus_remote_storage_highest_timestamp_in_seconds — the lag between them tells you how far behind your remote write queue is.

Long-Term Solutions: Thanos vs Grafana Mimir vs VictoriaMetrics

For production systems that need long-term storage, global query capability, high availability, and genuine horizontal scalability, purpose-built solutions are the right answer. Three projects dominate this space: Thanos, Grafana Mimir, and VictoriaMetrics. They share similar goals but differ significantly in architecture, operational complexity, and trade-offs.

CriterionThanosGrafana MimirVictoriaMetrics
ArchitectureSidecar + object store; modular componentsFully distributed; Cortex-derived microservicesSingle binary or cluster mode
Storage backendAny S3-compatible object storeAny S3-compatible object storeOwn TSDB format on local or object store
PromQL compatibilityFull PromQL; own query engineFull PromQL; Mimir-specific extensionsMetricsQL (PromQL superset)
Operational complexityMedium — multiple components, each simpleHigh — many microservices with complex configLow — minimal components, simple config
Ingest scalabilityScales via Thanos Receive fan-outHorizontally scalable distributors + ingestersExcellent; handles millions of samples/sec per node
Query performanceGood; Store Gateway caches object store dataGood; query sharding and caching built inExcellent; highly optimized query engine
Multi-tenancyLimited; tenant isolation via external labelsNative; per-tenant limits and isolationEnterprise only; basic in cluster mode
DeduplicationBuilt-in; replica dedup at query timeBuilt-in; ingest-time and query-time dedupBuilt-in; dedup with downsampling
DownsamplingYes; Thanos Compactor handles itYes; configurable per tenantYes; automatic with vmbackupmanager
LicenseApache 2.0 (fully open source)AGPL-3.0 (open source) + enterprise tierApache 2.0 (community); proprietary enterprise
Best fitTeams already running Prometheus wanting minimal disruptionLarge orgs needing multi-tenant SaaS-grade monitoringTeams prioritizing simplicity and raw performance

Thanos: The Incremental Path

Thanos integrates with existing Prometheus deployments through a sidecar process that runs alongside each Prometheus pod. The sidecar uploads completed TSDB blocks to object storage (S3, GCS, Azure Blob) and exposes a gRPC Store API that Thanos Query uses to federate queries across all Prometheus instances plus historical data in the object store. This makes Thanos the lowest-friction path for teams with existing Prometheus infrastructure.

Thanos Receive is an alternative ingest path that accepts remote_write directly, which is useful when you want to decouple Prometheus instances from the query layer or implement active-active HA without relying on Prometheus replication. Thanos Compactor handles block compaction and downsampling on the object store, creating 5-minute and 1-hour resolution downsamples automatically for efficient long-range queries.

Grafana Mimir: Enterprise-Grade Multi-Tenancy

Mimir is a fork of Cortex, rewritten by Grafana Labs to address operational complexity issues in Cortex’s architecture. It follows the same microservices pattern — Distributor, Ingester, Querier, Query Frontend, Store Gateway, Compactor, Ruler — but with significantly improved defaults and a monolithic deployment mode that simplifies small-scale deployments. Mimir’s headline feature is native multi-tenan

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

Breaking the AI Data Bottleneck: How Hammerspace's AI Data Platform Eliminates Migration Nightmares

Tom Smithverified - Mar 16

How Nginx Architecture is Designed and How They Work

Ganesh Kumar - Apr 8

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

3 comments
2 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!