Prometheus is a monitoring system that is designed to process a large number of metrics, centralize them on one (or multiple) servers and serve them with a well-defined API. That API is queried through a domain-specific language (DSL) called "PromQL" or "Prometheus Query Language". Prometheus also supports basic graphing capabilities although those are limited enough that we use a separate graphing layer on top (see Grafana).

Tutorial

Looking at pretty graphs

The Prometheus web interface is available at:

https://prometheus.torproject.org

A simple query you can try is to pick any metric in the list and click Execute. For example, this link will show the 5-minute load over the last two weeks for the known servers.

The Prometheus web interface is crude: it's better to use grafana dashboards for most purposes other than debugging.

How-to

Pager playbook

TBD.

Disaster recovery

If a Prometheus/Grafana is destroyed, it should be compltely rebuildable from Puppet. Non-configuration data should be restored from backup, with /var/lib/prometheus/ being sufficient to reconstruct history. If even backups are destroyed, history will be lost, but the server should still recover and start tracking new metrics.

Migrating from Munin

Here's a quick cheat sheet from people used to Munin and switching to Prometheus:

What Munin Prometheus
Scraper munin-update prometheus
Agent munin-node prometheus node-exporter and others
Graphing munin-graph prometheus or grafana
Alerting munin-limits prometheus alertmanager
Network port 4949 9100 and others
Protocol TCP, text-based HTTP, text-based
Storage format RRD custom TSDB
Downsampling yes no
Default interval 5 minutes 15 seconds
Authentication no no
Federation no yes (can fetch from other servers)
High availability no yes (alert-manager gossip protocol)

Basically, Prometheus is similar to Munin in many ways:

  • it "pulls" metrics from the nodes, although it does it over HTTP (to http://host:9100/metrics) instead of a custom TCP protocol like Munin

  • the agent running on the nodes is called prometheus-node-exporter instead of munin-node. it scrapes only a set of built-in parameters like CPU, disk space and so on, different exporters are necessary for different applications (like prometheus-apache-exporter) and any application can easily implement an exporter by exposing a Prometheus-compatible /metrics endpoint

  • like Munin, the node exporter doesn't have any form of authentication built-in. we rely on IP-level firewalls to avoid leakage

  • the central server is simply called prometheus and runs as a daemon that wakes up on its own, instead of munin-update which is called from munin-cron and before that cron

  • graphics are generated on the fly through the crude Prometheus web interface or by frontends like Grafana, instead of being constantly regenerated by munin-graph

  • samples are stored in a custom "time series database" (TSDB) in Prometheus instead of the (ad-hoc) RRD standard

  • Prometheus performs no downsampling like RRD and Prom relies on smart compression to spare disk space, but it uses more than Munin

  • Prometheus scrapes samples much more aggressively than Munin by default, but that interval is configurable

  • Prometheus can scale horizontally (by sharding different services to different servers) and vertically (by aggregating different servers to a central one with a different sampling frequency) natively - munin-update and munin-graph can only run on a single (and same) server

  • Prometheus can act as an high availability alerting system thanks to its alertmanager that can run multiple copies in parallel without sending duplicate alerts - munin-limits can only run on a single server

Reference

Installation

Puppet implementation

Every node is configured as a node-exporter through the roles::monitored that is included everywhere. The role might eventually be expanded to cover alerting and other monitoring resources as well. This role, in turn, includes the profile::prometheus::client which configures each client correctly with the right firewall rules.

The firewall rules are exported from the server, defined in profile::prometheus::server. We hacked around limitations of the upstream Puppet module to install Prometheus using backported Debian packages. The monitoring server itself is defined in roles::monitoring.

The Prometheus Puppet module was patched to allow scrape job collection and use of Debian packages for installation. Much of the initial Prometheus configuration was also documented in ticket 29681 and especially ticket 29388 which investigates storage requirements and possible alternatives for data retention policies.

Manual node configuration

External services can be monitored by Prometheus, as long as they comply with the OpenMetrics protocol, which is simply to expose metrics such as this over HTTP:

metric{label=label_val}  value

A real-life (simplified) example:

node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392

The above says that the node alberti has the device /dev/sda mounted on /, formatted as an ext4 filesystem which has 16160059392 bytes (~16GB) free.

System-level metrics can easily be monitored by the secondary Prometheus server. This is usually done by installing the "node exporter", with the following steps:

  • On Debian Buster and later:

     apt install prometheus-node-exporter
    
  • On Debian stretch:

     apt install -t stretch-backports prometheus-node-exporter
    

    ... assuming that backports is already configured. if it isn't, such a line in /etc/apt/sources.list.d/backports.debian.org.list should suffice:

     deb https://deb.debian.org/debian/  stretch-backports   main contrib non-free
    

    ... followed by an apt update, naturally.

The firewall on the machine needs to allow traffic on the exporter port from the server prometheus2.torproject.org. Then open a ticket for TPA to configure the target. Make sure to mention:

  • the hostname for the exporter
  • the port of the exporter (varies according to the exporter, 9100 for the node exporter)
  • how often to scrape the target, if non-default (default: 15s)

Then TPA needs to hook those as part of a new node job in the scrape_configs, in prometheus.yml, from Puppet, in profile::prometheus::server.

SLA

Prometheus is currently not doing alerting so it doesn't have any sort of garanteed availability. It should, hopefully, not lose too many metrics over time so we can do proper long-term resource planning.

Design

Here is, from the Prometheus overview documentation, the basic architecture of a Prometheus site:

A
drawing of Prometheus' architecture, showing the push gateway and
exporters adding metrics, service discovery through file_sd and
Kubernetes, alerts pushed to the Alertmanager and the various UIs
pulling from Prometheus

As you can see, Prometheus is somewhat tailored towards Kubernetes but it can be used without it. We're deploying it with the file_sd discovery mechanism, where Puppet collects all exporters into the central server, which then scrapes those exporters every scrape_interval (by default 15 seconds). The architecture graph also shows the Alertmanager which could be used to (eventually) replace our Nagios deployment.

It does not show that Prometheus can federate to multiple instances and the Alertmanager can be configured with High availability.

Issues

There is no issue tracker specifically for this project, File or search for issues in the generic internal services component.

Monitoring and testing

Prometheus doesn't have specific tests, but there is a test suite in the upstream prometheus Puppet module.

The server is monitored for basic system-level metrics by Nagios. It also monitors itself for system-level metrics but also application-specific metrics.

Discussion

Overview

The prometheus and grafana services were setup after anarcat realized that there was no "trending" service setup inside TPA after Munin had died (ticket 29681). The "node exporter" was deployed on all TPA hosts in mid-march 2019 (ticket 29683) and remaining traces of Munin were removed in early April 2019 (ticket 29682).

Resource requirements were researched in ticket 29388 and it was originally planned to retain 15 days of metrics. This was expanded to one year in November 2019 (ticket 31244) with the hope this could eventually be expanded further with a downsampling server in the future.

Eventually, a second Prometheus/Grafana server was setup to monitor external resources (ticket 31159) because there were concerns about mixing internal and external monitoring on TPA's side. There were also concerns on the metrics team about exposing those metrics publicly.

It was originally thought Prometheus could completely replace nagios as well ticket 29864, but this turned out to be more difficult than planned. The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This makes it more difficult to replace Nagios because a ton of alerts need to be rewritten to replace the existing ones. A lot of reports and functionality built-in to Nagios, like availability reports, acknowledgements and other reports, would need to be reimplemented as well.

Goals

This section didn't exist when the projec was launched, so this is really just second-guessing...

Must have

  • Munin replacement: long-term trending metrics to predict resource allocation, with graphing
  • free software, self-hosted
  • Puppet automation

Nice to have

Non-Goals

  • 1 year data retention

Approvals required

Primary Prometheus server was decided in the Brussels 2019 devmeeting, before anarcat joined the team (ticket 29389). Secondary Prometheus server was approved in 2019-04-08. Storage expansion was approved in 2019-11-25.

Proposed Solution

Prometheus was chosen, see also grafana.

Cost

N/A.

Alternatives considered

No alternatives research was performed, as far as we know.