Excluding down targets in Prometheus

August 11, 2023

Introduction

Prometheus is an open source metrics and alerting solution used to monitor a wide range of things. Unlike many classic network monitoring systems Prometheus is focused on colleting metrics. This makes it quite easy to do things that would be difficult in others systems. For example excluding down targets in Prometheus is quite trivial. This can be highly useful when you have volatile, static targets that come and go - for example Jenkins slaves (or build agents). Even on such targets you'd want to monitor things like systemd service status, memory usage and disk usage.

The "up" metric

When Prometheus scrapes a target (a HTTP endpoint) one of the standard metrics it gets is "up". Other metrics are mostly related scrape performance. If the "up" metric is 0 the scrape was unsuccessful. The reason for the scrape failure can vary, but in most cases it is irrelevant.

Trivial alerting for scrape targets that are down

If all your scrape targets are supposed to be up all the time, you can get away with a trivial Prometheus alert rule such as this:

 - alert: InstanceDown
    expr: up == 0
    for: 5m
    annotations:
      summary: 'Instance {{ $labels.instance }} down'
      description: "{{ $labels.instance }} of job {{ $labels.job }} has
                     been down for more than 5 minutes."

This works fine, but if any of your targets go down you will get an alert. You will still get some metrics out of the target, such as scrape performance metrics as well as the important "up == 1" metric.

Excluding down targets in Prometheus

One of the interesting features of Prometheus is that is focused on current metrics. If a metric does not have an up to date value, Prometheus will not send an alert even the metric values would warrant one. A good example is AWS Cloudwatch: the metrics obtained from Cloudwatch tend to be quite old. In order to alert on Cloudwatch metrics you often need to use the offset modifier to "look in the past".

This Prometheus behavior means that if a target goes down Prometheus stops being interested in its metrics, excluding the built-in ones related to the scrape itself. In practice this means that the only alert rule that would kick in is the "up" alert.

Knowing all this it is quite easy to exclude down targets in Prometheus. What we need to do is label the scrape targets we know are going to be down properly and filter the "up" alert so that we don't alert on targets we know may be down. I used the "volatile" label for this:

- job_name: on_demand_worker
  scrape_interval: 60s
  scrape_timeout: 10s
  static_configs:
  - targets:
    - on-demand-worker.example.org:9100
    labels:
      volatile: true

For every target that was supposed to be always up I added "volatile: false". Then, in "up" alert rule I only considered targets with "volatile: false":

 - alert: InstanceDown
    expr: up{volatile="false"} == 0
    for: 5m
    annotations:
      summary: 'Instance {{ $labels.instance }} down'
      description: "{{ $labels.instance }} of job {{ $labels.job }} has
                     been down for more than 5 minutes."

When a volatile targets goes down the "up" alert is not triggered. Additionally metrics are not received from it, but that's ok, because Prometheus is not interested in old metrics. When the target comes back up, we start receiving metrics normally. This way Prometheus can trigger alerts only when the target is up.

#prometheus

Samuli Seppänen

Author archive

Did you like the article?

Share it with others

Privacy policy

Puppeteers Oy, c/o LOV co-working space, Uudenmaankatu 1, 20500 Turku, Finland. Y-tunnus: 2919313-3 / VAT ID: FI29193133

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.