Excluding down targets in Prometheus

August 11, 2023 
Excluding down targets in Prometheus is easy. It allow you to monitor volatile systems that start and shut down on demand.
Excluding down targets in Prometheus is easy. It allow you to monitor volatile systems that start and shut down on demand.

Introduction

Prometheus is an open source metrics and alerting solution used to monitor a wide range of things. Unlike many classic network monitoring systems Prometheus is focused on colleting metrics. This makes it quite easy to do things that would be difficult in others systems. For example excluding down targets in Prometheus is quite trivial. This can be highly useful when you have volatile, static targets that come and go - for example Jenkins slaves (or build agents). Even on such targets you'd want to monitor things like systemd service status, memory usage and disk usage.

The "up" metric

When Prometheus scrapes a target (a HTTP endpoint) one of the standard metrics it gets is "up". Other metrics are mostly related scrape performance. If the "up" metric is 0 the scrape was unsuccessful. The reason for the scrape failure can vary, but in most cases it is irrelevant.

Trivial alerting for scrape targets that are down

If all your scrape targets are supposed to be up all the time, you can get away with a trivial Prometheus alert rule such as this:

 - alert: InstanceDown
    expr: up == 0
    for: 5m
    annotations:
      summary: 'Instance {{ $labels.instance }} down'
      description: "{{ $labels.instance }} of job {{ $labels.job }} has
                     been down for more than 5 minutes."

This works fine, but if any of your targets go down you will get an alert. You will still get some metrics out of the target, such as scrape performance metrics as well as the important "up == 1" metric.

Excluding down targets in Prometheus

One of the interesting features of Prometheus is that is focused on current metrics. If a metric does not have an up to date value, Prometheus will not send an alert even the metric values would warrant one. A good example is AWS Cloudwatch: the metrics obtained from Cloudwatch tend to be quite old. In order to alert on Cloudwatch metrics you often need to use the offset modifier to "look in the past".

This Prometheus behavior means that if a target goes down Prometheus stops being interested in its metrics, excluding the built-in ones related to the scrape itself. In practice this means that the only alert rule that would kick in is the "up" alert.

Knowing all this it is quite easy to exclude down targets in Prometheus. What we need to do is label the scrape targets we know are going to be down properly and filter the "up" alert so that we don't alert on targets we know may be down. I used the "volatile" label for this:

- job_name: on_demand_worker
  scrape_interval: 60s
  scrape_timeout: 10s
  static_configs:
  - targets:
    - on-demand-worker.example.org:9100
    labels:
      volatile: true

For every target that was supposed to be always up I added "volatile: false". Then, in "up" alert rule I only considered targets with "volatile: false":

 - alert: InstanceDown
    expr: up{volatile="false"} == 0
    for: 5m
    annotations:
      summary: 'Instance {{ $labels.instance }} down'
      description: "{{ $labels.instance }} of job {{ $labels.job }} has
                     been down for more than 5 minutes."

When a volatile targets goes down the "up" alert is not triggered. Additionally metrics are not received from it, but that's ok, because Prometheus is not interested in old metrics. When the target comes back up, we start receiving metrics normally. This way Prometheus can trigger alerts only when the target is up.

Samuli Seppänen
Samuli Seppänen
Author archive
menucross-circle