This documentation is for sysadmins to figure out what to do when things go wrong. If you don't have the required accesses and haven't been trained for such situation, you might be better off just trying to wake up someone that can deal with them. See the how-to-get-help documentation instead.

Specific situations

Server down

If a server is non-responsive, you can first check if it is actually reachable over the network:

ping -c 10 server.torproject.org

If it does respond, you can try to diagnose the issue by looking at Nagios and/or Grafana and analyse what, exactly is going on.

If it does not respond, you should see if it's a virtual machine, and in this case, which server is hosting it. This information is available in ldap (or the web interface, under the physicalHost field). Then login to that server to diagnose this issue.

If the physical host is not responding or is empty (in which case it is a physical host), you need to file a ticket with the upstream provider. This information is available in Nagios:

  1. search for the server name in the search box
  2. click on the server
  3. drill down the "Parents" until you find something that ressembles a hosting provider (e.g. hetzner-hel1-01 is Hetzner, gw-cymru is Cymru, gw-scw-* are at Scaleway, gw-sunet is Sunet)

What follows are per-provider instructions:

Hetzner robot (physical servers)

  1. Visit the Heztner Robot server page (password in tor-passwords/hosts-extra-info)
  2. Select the right server (hostname is the second column)
  3. Select the "reset" tab
  4. Select the "Execute an automatic hardware reset" radio button and hit "Send". This is equivalent to hitting the "reset" button on a computer.
  5. Wait for the server to return for a "few" (2? 5? 10? 20?) minutes, depending on how hopeful you are this simple procedure will work.
  6. If that fails, Select the "Order a manual hardware reset" option and hit "Send". This will send an actual human to attend the server and see if they can bring it back online.

If all else fails, Select the "Support" tab and open a support request.

Hetzner Cloud (virtual servers)

  1. Visit the Hetzner Cloud console (password in tor-passwords/hosts-extra-info)
  2. Select the project (usually "default")
  3. Select the affected server
  4. Open the console (the >_ sign on the top right), and see if there are any error messages and/or if you can login there (using the root password in tor-passwords/hosts)
  5. If that fails, attempt a "Power cycle" in the "Power" tab (on the left)
  6. If that fails, you can also try to boot a rescue system by selecting "Enable Rescue & Power Cycle" in the "Rescue" tab

If all else fails, create a support request. The support menu is in the "Person" menu on the top right of the page.

Cymru

Open a ticket by writing support@cymru.com.

Sunet

TBD

Support policies

We consider there are three "support levels" for problems that come up with services:

  • code red: immediate emergency, fix ASAP
  • code yellow: serious problem that doesn't require immediate attention but that could turn into a code red if nothing is donw
  • routine: file a bug report, we'll get to it soon!

We do not have 24/7 oncall support, so requests are processed during work times of available staff. We do try to provide continuous support as much as possible, but it's possible that some weekends or vacations are unattended for more than a day. This is the definition of a "business day".

The TPA team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.

TPA is responsible for the base operating system and not all services running on TPO infrastructure, see the service admin definition for details on that distinction.

Debian GNU/Linux is the only supported operating system, and we support only the "stable" and "oldstable" distributions, until the latter becomes EOL. We do not support Debian LTS. It is the responsability of service admins to upgrade their services to keep up with the Debian release schedule.

Code red

A "code red" is a critical condition that requires immediate action. It's what we consider an "emergency". Our SLA for those is 24h business days, as defined above. Services qualifying for a code red are:

Other services fall under "routine" or "code yellow" below, which can be upgraded in priority.

Examples of problems falling under code red include:

  • website unreachable
  • emails to torproject.org not reaching our server

Some problems fall under other teams and are not the responsability of TPA, even if they can be otherwise considered a code red.

So, for example, those are not code reds for TPA:

  • website has a major design problem rendering it unusable
  • donation backend failing because of a problem in CiviCRM
  • gmail refusing all email forwards
  • encrypted mailing lists failures
  • gitolite refuses connexions

Code yellow

A "code yellow" is a situation where we are overwhelmed but there isn't exactly an immediate emergency to deal with. A good introduction is this SRECON19 presentation (slides). The basic idea is that a code yellow is a "problem [that] creeps up on you over time and suddenly the hole is so deep you can’t find the way out".

There's no clear timeline on when such a problem can be resolved. If the problem is serious enough, it may eventually be upgraded to a code red by the approval of a team lead after a week's delay, regardless of the affected service. In that case, a "hot fix" (some hack like throwing hardware at the problem) may be deployed instead of fixing the actual long term issue, in which case the problem becomes a code yellow again.

Examples of a code yellow include:

Routine

Routine tasks are normal requests that are not an emergency and can be processed as part of the normal workflow.

Example of routine tasks include:

  • account creation
  • group access changes
  • email alias changes
  • static web component changes
  • examine disk usage warning
  • security upgrades
  • server reboots