Grafana Dashboard Metrics Pitfalls

In DevOps, observability is our eyes and ears. We display metrics on beautiful Grafana dashboards, place them on monitors, and rely on them to decide if our systems are running smoothly. But what if the lines on those screens are silently warping the data, hiding actual spikes or magnifying minor deviations?

In this guide, we'll cover common PromQL query mistakes and Prometheus scrape mechanics that lead to corrupted dashboard graphs, and explore how to fix them.

The Golden Rule: Dashboards are only as good as the queries powering them. An incorrect rate calculation or step sizing can mask brief outages entirely.

Mistake 1: Applying `rate()` to Gauges

Prometheus categorizes metrics into distinct types: Counters (which only increase, resetting to 0 on restart) and Gauges (which can fluctuate up and down, like memory usage or temperature).

A major error is writing a rate query on a gauge:

PromQL — Invalid Gauge Rate

# INCORRECT: Calculating change speed on fluctuating gauge metrics
rate(node_memory_Active_bytes[5m])

Why it's wrong: The rate() function calculates per-second increases and is specifically designed to handle counter resets. If a gauge decreases (e.g. memory goes from 8GB to 4GB), rate() assumes a counter reset occurred. It ignores the drop, resulting in incorrect calculations.

The Fix: For gauges, use functions like deriv() to measure rates of change, or simple aggregation functions like avg_over_time().

PromQL — Correct Gauge Deriv

# CORRECT: Calculate the derivative of memory bytes over a 5m window
deriv(node_memory_Active_bytes[5m])

Mistake 2: Scrape Intervals Smaller than the Rate Window

Imagine your Prometheus instances scrape targets once every 60 seconds. You configure a Grafana graph with the following query:

PromQL — Unaligned Rate Window

# DANGEROUS: Querying with a window smaller than your collection interval
rate(http_requests_total[30s])

Why it's wrong: A rate window of 30s requires at least two data points to calculate a rate of change. With a 60s scrape interval, most 30-second windows contain only one data point. This causes calculations to fail, leading to fragmented graphs with missing sections.

The Fix: Ensure your rate window is at least 4 times your scrape interval. If you scrape every 15s, use a minimum window of 1m:

PromQL — Aligned Rate Window

# CORRECT: Rate window accommodates 4 scrape iterations (15s scrape * 4 = 1m)
rate(http_requests_total[1m])

Mistake 3: Nesting `rate()` inside aggregators

When summing cluster metrics, the order of operations in PromQL is critical. Mixing up functions leads to calculation errors when counters reset:

PromQL — Summing Counters Incorrectly

# INCORRECT: Aggregating counters before processing rate resets
rate(sum(http_requests_total)[5m])

Why it's wrong: sum() strips label signatures, returning a combined value. If one target container restarts, the aggregated sum drops. The outer rate() assumes the entire system reset, resulting in incorrect spikes in your dashboard charts.

The Fix: Always calculate the rate of individual targets first to handle restarts cleanly, and then sum the resulting rates:

PromQL — Summing Rates Correctly

# CORRECT: Rate calculated per target, then aggregated by sum
sum(rate(http_requests_total[5m]))

Conclusion

Understanding how Prometheus processes data ensures your dashboards remain accurate. By matching rate windows to scrape intervals, separating counter metrics from gauges, and applying aggregations in the correct order, you'll build reliable dashboards that display accurate system trends.

Review your dashboard queries and verify that your system metrics represent actual server activity!

Why your Grafana dashboards are lying to you

Mistake 1: Applying `rate()` to Gauges

Mistake 2: Scrape Intervals Smaller than the Rate Window

Mistake 3: Nesting `rate()` inside aggregators

Conclusion

About the Author

Mistake 1: Applying rate() to Gauges

Mistake 2: Scrape Intervals Smaller than the Rate Window

Mistake 3: Nesting rate() inside aggregators

Conclusion

About the Author

Mistake 1: Applying `rate()` to Gauges

Mistake 3: Nesting `rate()` inside aggregators