This documentation is for sysadmins to figure out what to do when things go wrong. If you don't have the required accesses and haven't been trained for such situation, you might be better off just trying to wake up someone that can deal with them. See the how-to-get-help documentation instead.

Specific situations

Server down

If a server is non-responsive, you can first check if it is actually reachable over the network:

ping -c 10 server.torproject.org

If it does respond, you can try to diagnose the issue by looking at Nagios and/or Grafana and analyse what, exactly is going on.

If it does not respond, you should see if it's a virtual machine, and in this case, which server is hosting it. This information is available in ldap (or the web interface, under the physicalHost field). Then login to that server to diagnose this issue.

If the physical host is not responding or is empty (in which case it is a physical host), you need to file a ticket with the upstream provider. This information is available in Nagios:

  1. search for the server name in the search box
  2. click on the server
  3. drill down the "Parents" until you find something that ressembles a hosting provider (e.g. hetzner-hel1-01 is Hetzner, gw-cymru is Cymru, gw-scw-* are at Scaleway, gw-sunet is Sunet)

Emergency policies

Those still need to be defined more clearly, but we can consider there are three "support levels" for emergencies:

  • code red: house is on fire, go go go
  • code yellow: houston, we have a problem, but we'll live for a day
  • routine: file a bug report, we'll get to it soon!

Code red

A "code red" is a critical condition that requires immediate action. It's what we consider an "emergency".

Code yellow

A "code yellow" is a situation where we are overwhelmed but there isn't exactly an immediate emergency to deal with. There's a separate process, called a "code yellow" (SRECON19 presentation, slides), as opposed to a code red, above, which we might want to consider for fixing longer term issues.

Routine

TBD.