Last active
January 7, 2022 03:33
-
-
Save thealmightygrant/1b78388377e1e67e4d87e9d90081c12d to your computer and use it in GitHub Desktop.
A Deep Dive into an Alert on Prometheus
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<section> | |
<h1>A Deep Dive into an Alert on Prometheus</h1> | |
<h2>Grant Sherrick</h2> | |
</section> | |
<section id="where-we-started"> | |
<h2>We started with an alert.</h2> | |
<br> | |
<pre><code>- alert: EntitledErrorRateIncreased | |
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h]) | |
* 60 * 60) > 5 | |
for: 1m | |
labels: | |
app: entitled | |
annotations: | |
description: '{{$labels.instance}} of job {{$labels.job}} has experienced increased | |
error rates for more than 1 minute.' | |
summary: Instance {{$labels.instance}} is experiencing increased error rates</code></pre> | |
</section> | |
<section id="an-aside-on-rate"> | |
<h2>A brief aside of <code>rate()</code></h2> | |
<br> | |
<div class="fragment"> | |
<p style="text-align: left; margin-left: 1.3em;"><code>rate(v range-vector)</code> calculates the per-second average rate of increase of the time series in the range vector.</p> | |
<p style="text-align: left; margin-left: 1.3em;">rate should only be used with counters.</p> | |
</div> | |
<p style="text-align: left; margin-left: 1.3em;"><a href="https://github.com/prometheus/prometheus/blob/release-2.0/promql/functions.go#L135">The rate computation</a></p> | |
</section> | |
<section id="issue1"> | |
<h2>This alert had a few issues.</h2> | |
<br> | |
<h4 style="text-align: left; margin-left: 1.3em;">1. It's not specific enough.</h4> | |
<pre><code>- alert: EntitledErrorRateIncreased | |
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h]) | |
* 60 * 60) > 5</code></pre> | |
<a href="https://prometheus.commonstack.io/graph?g0.range_input=1h&g0.expr=http_requests_total%7Bapp%3D%22entitled%22%2Cpath!%3D%22%2Fmetrics%2Fhealthcheck%22%7D&g0.tab=1"><pre class="fragment"><code>http_requests_total{app="entitled",path!="/metrics/healthcheck"}</code></pre></a> | |
</section> | |
<section id="issue2"> | |
<h2>This alert had a few issues.</h2> | |
<br> | |
<h4 style="text-align: left; margin-left: 1.3em;">2. What does <code>rate[1h]</code> <code>for: 1m</code> mean?</h4> | |
<pre><code>- alert: EntitledErrorRateIncreased | |
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h]) | |
* 60 * 60) > 5 | |
for: 1m</code></pre> | |
<div class="fragment"> | |
<p>In order to fire: this alert has to have been firing for 1 minute. The alert is averaged over the past hour...</p> | |
</dev> | |
</section> | |
<section id="issue3"> | |
<h2>This alert had a few issues.</h2> | |
<br> | |
<h4 style="text-align: left; margin-left: 1.3em;">3. Why do we care if the error rate > 5?</h4> | |
<pre><code>- alert: EntitledErrorRateIncreased | |
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h]) | |
* 60 * 60) > 5</code></pre> | |
<div class="fragment"> | |
<p>We've seen an average of at least 5 5XX responses per hour over all paths</p> | |
</dev> | |
</section> | |
<section id="issue4"> | |
<h2>This alert had a few issues.</h2> | |
<br> | |
<h4 style="text-align: left; margin-left: 1.3em;">4. The description and summary are not clear and do not show the data we'd like to see.</h4> | |
<pre><code>annotations: | |
description: '{{$labels.instance}} of job {{$labels.job}} has experienced increased | |
error rates for more than 1 minute.' | |
summary: Instance {{$labels.instance}} is experiencing increased error rates</code></pre> | |
<div class="fragment"> | |
<p><a href="https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators">Aggregation Operators, like sum, rate, min, max, etc.</a></p> | |
<p><a href="https://prometheus.commonstack.io/graph?g0.range_input=1h&g0.expr=sum(rate(http_requests_total%7Bapp%3D%22entitled%22%2Cpath!%3D%22%2Fmetrics%2Fhealthcheck%22%7D%5B1h%5D)%20*%2060%20*%2060)%20&g0.tab=1">There's not even instance data in this alert.</a></p> | |
</dev> | |
</section> | |
<section id="issue4"> | |
<h2>Let's <a href="https://prometheus.commonstack.io/graph?g0.range_input=1h&g0.expr=sum(rate(http_requests_total%7Bapp%3D%22entitled%22%2Cpath!%3D%22%2Fmetrics%2Fhealthcheck%22%7D%5B1h%5D)%20*%2060%20*%2060)%20&g0.tab=1">fix</a> it!</h2> | |
</section> | |
<section id="conclusion"> | |
<h2>Thanks!</h2> | |
</section> | |
<section id="useful-links"> | |
<h2>Some Useful Links:</h2> | |
<ul> | |
<li><a href="https://github.com/google/re2/wiki/Syntax">RE2, the regular expression syntax for p8s</a></li> | |
<li><a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">Querying Basics with Prometheus</a></li> | |
<li><a href="https://www.weave.works/blog/">WeaveWorks Blog</a></li> | |
</ul> | |
</section> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment