How-to

Creating a queue

On the RT web interface:

  1. authenticate to https://rt.torproject.org/
  2. head to the Queue creation form (Admin -> Queues -> Create)
  3. pick a Queue Name, set the Reply Address to QUEUENAME@rt.torproject.org and leave the Comment Address blank
  4. hit the Create button
  5. grant a group access to the queue, in the Group rights tab (create a group if necessary) - you want to grant the following to the group
    • all "General rights"
    • in "Rights for staff":
      • Delete tickets (DeleteTicket)
      • Forward messages outside of RT (ForwardMessage)
      • Modify ticket owner on owned tickets (ReassignTicket)
      • Modify tickets (ModifyTicket)
      • Own tickets (OwnTicket)
      • Sign up as a ticket or queue AdminCc (WatchAsAdminCc)
      • Take tickets (TakeTicket)
      • View exact outgoing email messages and their recipients (ShowOutgoingEmail)
      • View ticket private (commentary) That is, everything but:
      • Add custom field values only at object creation time (SetInitialCustomField)
      • Modify custom field values (ModifyCustomField)
      • Steal tickets (StealTicket)
  6. if the queue is public (and it most likely is), grant the following to the Everyone, Privileged, and Unprivileged groups:
    • Create tickets (CreateTicket)
    • Reply to tickets (ReplyToTicket)

On the RT server (currently rude):

  1. edit the /etc/aliases file to add a line like:

    rt-QUEUENAME: rtmailarchive+QUEUENAME,      "|/usr/bin/rt-mailgate --queue QUEUENAME --action correspond --url https://rt.torproject.org/"
    
  2. regenerate the alias database:

    newaliases
    
  3. add an entry in the virtual table (/etc/postfix/virtual):

    QUEUENAME@rt.torproject.org rt-QUEUENAME
    
  4. regenerate the virtual database:

    postmap /etc/postfix/virtual
    

In Puppet:

  1. add an entry in the main mail server virtual file (currently tor-puppet/modules/postfix/files/virtual) like:

    QUEUENAME@torproject.org         QUEUENAME@rt.torproject.org
    

TODO: the above should be automated. Ideally, QUEUENAME@rt.torproject.org should be an alias that automatically sends the message to the relevant QUEUENAME. That way, RT admins can create Queues without requiring the intervention of a sysadmin.

Discussion

Spam filter training design

RT is designed to be trained for spam filtering. RT users put spam in the "Spam" queue and then a set of scripts run in the background to train spamassassin, based on a mail archive that procmail keeps of every incoming mail.

This runs as a cronjob in the rtmailarchive user, which looks like this:

/srv/rtstuff/support-tools/train-spam-filters/train_spam_filters && bin/spam-learn && find Maildir/.spam.learned Maildir/.xham.learned -type f -delete

The first part is the following Python script (from rude):

#!/usr/bin/python
#
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, as published by Sam Hocevar. See
# http://sam.zoy.org/wtfpl/COPYING for more details.

from __future__ import print_function

import email.parser
import psycopg2
import os
import os.path
from datetime import datetime, timedelta

DEBUG = False

MAILDIR_ROOT = os.path.join(os.environ['HOME'], 'Maildir')
SPAM_MAILDIR = '.spam.learn'
HAM_MAILDIR = '.xham.learn'

KEEP_FOR_MAX_DAYS = 100

RT_CONNINFO = "host=localhost sslmode=require user=rtreader dbname=rt"

SELECT_HAM_TICKET_QUERY = """
    SELECT DISTINCT Tickets.Id
      FROM Queues, Tickets, Transactions
           LEFT OUTER JOIN Attachments ON Attachments.TransactionId = Transactions.Id
     WHERE Queues.Name LIKE 'help%%'
       AND Tickets.Queue = Queues.Id
       AND Tickets.Status = 'resolved'
       AND Transactions.ObjectId = Tickets.Id
       AND Transactions.ObjectType = 'RT::Ticket'
       AND Attachments.MessageId = %s;
"""

SELECT_SPAM_TICKET_QUERY = """
    SELECT DISTINCT Tickets.Id
      FROM Queues, Tickets, Transactions
           LEFT OUTER JOIN Attachments ON Attachments.TransactionId = Transactions.Id
     WHERE Queues.Name = 'spam'
       AND Tickets.Queue = Queues.Id
       AND Tickets.Status = 'rejected'
       AND Transactions.ObjectId = Tickets.Id
       AND Transactions.ObjectType = 'RT::Ticket'
       AND Attachments.MessageId = %s;
"""

EMAIL_PARSER = email.parser.Parser()

if DEBUG:
    def log(msg):
        print(msg)
else:
    def log(msg):
        pass

def is_ham(msg_id):
    global con

    cur = con.cursor()
    try:
        cur.execute(SELECT_HAM_TICKET_QUERY, (msg_id,))
        return cur.fetchone() is not None
    finally:
        cur.close()

def is_spam(msg_id):
    global con

    cur = con.cursor()
    try:
        cur.execute(SELECT_SPAM_TICKET_QUERY, (msg_id,))
        return cur.fetchone() is not None
    finally:
        cur.close()

def handle_message(path):
    msg = EMAIL_PARSER.parse(open(path), headersonly=True)
    msg_id = msg['Message-Id']
    if not msg_id.startswith('<') or not msg_id.endswith('>'):
        log("%s: bad Message-Id, removing." % path)
        os.unlink(path)
        return
    msg_id = msg_id[1:-1]
    if is_ham(msg_id):
        os.rename(path, os.path.join(MAILDIR_ROOT, HAM_MAILDIR, 'cur', os.path.basename(path)))
        log("%s: ham, moving." % path)
        return
    if is_spam(msg_id):
        os.rename(path, os.path.join(MAILDIR_ROOT, SPAM_MAILDIR, 'cur', os.path.basename(path)))
        log("%s: spam, moving." % path)
        return
    mtime = datetime.fromtimestamp(os.stat(path).st_mtime)
    limit = datetime.now() - timedelta(days=KEEP_FOR_MAX_DAYS)
    if mtime <= limit:
        log("%s: too old, removing." % path)
        os.unlink(path)
        return
    # well, it's not identified ham, not identified spam, and not too old
    # let's keep the message for now
    log("%s: unknown, keeping." % path)

def scan_directory(dir_path):
    for filename in os.listdir(dir_path):
        path = os.path.join(dir_path, filename)
        handle_message(path)

con = None

if __name__ == '__main__':
    con = psycopg2.connect(RT_CONNINFO)
    for filename in os.listdir(MAILDIR_ROOT):
        if filename.startswith('.help'):
            for subdir in ['new', 'cur', 'tmp']:
                scan_directory(os.path.join(MAILDIR_ROOT, filename, subdir))
    con.close()

It is unclear if this program was written for TPO or if it comes from elsewhere. It is included here for external reference but might have changed since this documentation was written. What it does is, basically:

  1. for each mail in the archive
  2. find its Message-Id header
  3. load the equivalent message from RT:
    • if it is in the Spam queue, marked as "Rejected", it is spam.
    • if it is in a help-* queue, marked as "Resolved", it is ham.
  4. move the email in the right directory mail folder (.spam.learn, .xham.learn) depending on status

Then the rest of the cron job continues. spam-learn is this shell script:

#!/bin/bash

dbpath="/var/cache/spampd"

learn() {
    local what="$1"; shift;
    local whence="$1"; shift;
    local whereto="$1"; shift;

    (
        cd "$whence"
        find -type f | \
          while read f; do
            sudo -u spampd -H sa-learn --dbpath "$dbpath" --"$what" < "$f"
            mv "$f" "$whereto/$f"
        done
    )
}

set -e

learn spam /srv/rtmailarchive/Maildir/.spam.learn /srv/rtmailarchive/Maildir/.spam.learned
learn ham /srv/rtmailarchive/Maildir/.xham.learn /srv/rtmailarchive/Maildir/.xham.learned

# vim:set et:
# vim:set ts=4:
# vim:set shiftwidth=4:

which, basically, calls sa-learn on each individual email in the folder, moving it to .spam.learned or .xham.learned when done.

Then, interestingly, those emails are destroyed. It's unclear why that is not done in the spam-learn step directly.

Possible improvements

The above design has a few problems:

  1. it assumes "ham" queues are named "help-*" - but there are other queues in the system
  2. it might be slow: if there are lots of emails to process, it will do an SQL query for each and a move, and not all at once
  3. it is split over multiple shell scripts, not versioned

I would recommend the following:

  1. reverse the logic of the queue checks: instead of checking for folders and queues named help-*, check if the folders or queues are not named spam* or xham*
  2. batch jobs: use a generator to yield Message-Id, then pick a certain number of emails and batch-send them to psql and the rename
  3. do all operations at once: look in psql, move the files in the learning folder, and train, possibly in parallel, but at least all in the same script
  4. sa-learn can read from a folder now, so there's no need for that wrapper shell script in any case
  5. commit the script to version control and, even better, puppet