In DevOps, observability is our eyes and ears. We display metrics on beautiful Grafana dashboards, place them on monitors, and rely on them to decide if our systems are running smoothly. But what if the lines on those screens are silently warping the data, hiding actual spikes or magnifying minor deviations?
In this guide, we'll cover common PromQL query mistakes and Prometheus scrape mechanics that lead to corrupted dashboard graphs, and explore how to fix them.
The Golden Rule: Dashboards are only as good as the queries powering them. An incorrect rate calculation or step sizing can mask brief outages entirely.
Mistake 1: Applying rate() to Gauges
Prometheus categorizes metrics into distinct types: Counters (which only increase, resetting to 0 on restart) and Gauges (which can fluctuate up and down, like memory usage or temperature).
A major error is writing a rate query on a gauge:
# INCORRECT: Calculating change speed on fluctuating gauge metrics rate(node_memory_Active_bytes[5m])
Why it's wrong: The rate() function calculates per-second increases and is specifically designed to handle counter resets. If a gauge decreases (e.g. memory goes from 8GB to 4GB), rate() assumes a counter reset occurred. It ignores the drop, resulting in incorrect calculations.
The Fix: For gauges, use functions like deriv() to measure rates of change, or simple aggregation functions like avg_over_time().
# CORRECT: Calculate the derivative of memory bytes over a 5m window deriv(node_memory_Active_bytes[5m])
Mistake 2: Scrape Intervals Smaller than the Rate Window
Imagine your Prometheus instances scrape targets once every 60 seconds. You configure a Grafana graph with the following query:
# DANGEROUS: Querying with a window smaller than your collection interval rate(http_requests_total[30s])
Why it's wrong: A rate window of 30s requires at least two data points to calculate a rate of change. With a 60s scrape interval, most 30-second windows contain only one data point. This causes calculations to fail, leading to fragmented graphs with missing sections.
The Fix: Ensure your rate window is at least 4 times your scrape interval. If you scrape every 15s, use a minimum window of 1m:
# CORRECT: Rate window accommodates 4 scrape iterations (15s scrape * 4 = 1m) rate(http_requests_total[1m])
Mistake 3: Nesting rate() inside aggregators
When summing cluster metrics, the order of operations in PromQL is critical. Mixing up functions leads to calculation errors when counters reset:
# INCORRECT: Aggregating counters before processing rate resets rate(sum(http_requests_total)[5m])
Why it's wrong: sum() strips label signatures, returning a combined value. If one target container restarts, the aggregated sum drops. The outer rate() assumes the entire system reset, resulting in incorrect spikes in your dashboard charts.
The Fix: Always calculate the rate of individual targets first to handle restarts cleanly, and then sum the resulting rates:
# CORRECT: Rate calculated per target, then aggregated by sum sum(rate(http_requests_total[5m]))
Conclusion
Understanding how Prometheus processes data ensures your dashboards remain accurate. By matching rate windows to scrape intervals, separating counter metrics from gauges, and applying aggregations in the correct order, you'll build reliable dashboards that display accurate system trends.
Review your dashboard queries and verify that your system metrics represent actual server activity!