mailfiles

Robbert Haarman

2010-12-11

Introduction

There is a variety of formats for storing email. The most known are mbox (the traditional UNIX mailbox format), the MH format, and maildir. Besides these fairly standard formats, there are a number of proprietary formats, and some lesser used ones. All these have known strenghts and weaknesses. mailfile is yet another format, which aims to be simple, scalable, free of locking issues, and to play nice with replication and archiving.

Introduction to Mailbox Formats

mbox

The mbox format stores the contents of an entire mailbox in a single file. New messages are added by simply appending them to the file. The mbox format is probably the most widely supported format. Nearly any UNIX MUA can use it, and many of them use it as their default or even only format.

Fetching the entire contents of the mailbox (as is often done when the mailbox is accessed by a POP3 client) is efficient. Fetching message headers requires scanning through the file, but there is no overhead from opening and closing files, so this can still be reasonably efficient. Emptying the mailbox is as easy as truncating the file.

The mbox format has many shortcomings. It has trouble with random access. Deleting (freqently done by IMAP clients or MUAs that access the mailbox locally) a message that is not at the end requires reading and writing the entire rest of the mailbox. Concurrent access is also an issue; if the mailbox is modified while you have it open, you need to re-read the entire mailbox to find out what changed. No more than one program can deliver messages to the mailbox at a time, or messages will be partially overwritten. An error in handling the mailbox can cause the entire mailbox (or everything after the error) to become unusable.

MH

The MH format represents a mailbox as a directory and each message as a file inside that directory. This allows mailboxes to be manipulated with regular file utilities; mv can be used for moving messages between mailboxes, rm removes messages, etc. Each message is assigned a sequence number as part of the filename, so that each message has a unique name.

The MH format is more elegant than mbox, because it fits in more naturally with existing concepts; a message is a file, and a collection of messages is a collection of files. It overcomes many of the weaknesses of mbox; random access to messages is straightforward and efficient. Modifications to one message require only that message to be reloaded. Viewing, say, the latest 50 messages requires performing a query on the directory listing, rather than a scan through all data in the mailbox.

The MH format is not perfect, though. Using sequence numbers in filenames leads to potential race conditions on delivery: if two agents start delivering a message around the same time, they might both generate the same sequence number, and one of the messages may be overwritten or the file might even be completely garbled.

maildir

maildir uses one file per message, but instead of using sequence numbers for naming messages, it generates a unique id (consisting of time, a unique number, and the hostname that made the delivery). The filename also contains message flags. A mailbox is represented by a directory containing 3 subdirectories: cur, new and tmp. A message that is being delivered is written to tmp. When the message has been written out completely, it is atomically moved to new. When the message has been seen by the user, it is moved to cur. This ensures that no incomplete messages are ever seen by MUAs, and that new messages are easily listed, even in very large mailboxes.

The maildir format was carefully designed to solve the problems of other formats and does a very good job at this. Concurrent mailbox access is never a problem, even multiple agents delivering to the mailbox at the same time is not a problem. New messages can be found very fast. Random access to messages is efficient. Modifying mailbox flags does not require altering the message itself; just renaming the file is enough.

What's not to like about maildir? Well, imagine you make a backups of your mailbox. Or that you keep your mailbox in multiple locations, for redundancy or other reasons. Now you modify your mailbox in one place. Messages will be in different locactions or have different names than they used to. If you now synchronize your replicas, you will suddenly have duplicate messages: the ones without the modifications, and the ones with. The multiple directories and flags in filenames are not only unelegant, but also lead to problems. These could be overcome, but it would require software written specifically for that purpose. Not very elegant.

mailfiles

Design Goals

Before presenting my proposed mail storage format (which I have dubbed ‘mailfiles’), let me restate the design goals:

Efficient Random Access

Since mailfiles is primarily a method for local mail storage, we want random access (as opposed to leech and delete access) to be efficient. Users will be viewing and deleting messages from anywhere in the mailbox; we don't want to have to scan all the data in the mailbox each time such an action is performed.

Scalability

We want messages to be quickly accessible, ideally regardless the size of the mailbox. While it will probably not be possible to eliminate all effects of mailbox size, we can at least avoid having to scan through data linearly.

Concurrency

We want to allow multiple delivery agents, but also multiple users to access the mailbox at the same time.

Persistance

If a message can be found in some location at one time, we want it to stay in that location until it is either deleted or moved explicitly by the user.

Interoperability

We don't want to tie the format to one implementation. Ideally, any program that deals with mail would be able to handle mailfiles.

Elegance

We want the format to achieve the design goals as well as possible, but without making the format unnecesserily complex or convoluted. Messages should be readily accessible to humans using general-purpose software wherever possible - after all, the idea behind email is usually that people read it.

Subsection

The mailfiles format closely resembles the MH format. Messages are files, and mailboxes are directories. Message filenames must be unique. To guarantee uniqueness, the filename contains the Message-ID (see RFC 2822) as the last element. If a message that is to be added to a mailbox does not have a Message-ID, one has to be generated. The full filename consists of the contents of the Subject, From and Message-ID headers, separated by double percent signs. Any characters not suitable for use in file names must be escaped. The convention is to use % escape codes as in URLs, except that spaces can be escaped with underscores for better human readability. The Subject and From fields may be truncated to yield shorter filenames, but must preserve enough information to give the user an overview of what is in the mailbox without opening the files. On systems that cannot cope with long filenames, a different naming strategy will have to be adopted, but this is not discussed here.

Besides the message files (and possible subdirectories), the mailbox directory contains one additional file named .mailfiles_version_0_1. As the name implies, this indicates the version of the format in use. Later versions may update the specification to include additional information or use different formats. Where these formats are compatible with existing specifications, only the last number will be incremented. If the update introduces incompatibilities, the first number will be incremented, and the last one reset to 0.

Evaluation

TBD