FFFF

FFFF

Robbert Haarman

2010-11-04


Introduction

FFFF is a data encoding designed for efficiency, flexibility, and ease of processing. Like s-expressions and XML, FFFF allows arbitrarily deep nesting and can be used to express any kind of data structure or program. Unlike s-expressions and XML, FFFF is a binary format, optimized for being quickly parsed and/or transmitted.


Status

FFFF is currently in draft status. Experimental implementations are being written. The experience gathered during this process will be used to update and finalize the definition of the format. Until then, this document is subject to change without notice.


Overview

The primary purpose of FFFF is encoding data. Ideally, it should be possible to encode any data that one might wish to store and/or transfer in such a way that the result preserves the structure of the original data, takes up few bytes, and can be read by another program with little work.

The goals set out for FFFF are very ambitious, and it is possible that experience or specific applications will lead to new insights and the development of better encoding schemes. FFFF anticipates this by providing a method for switching to a specific language, where a language has complete freedom in defining its own encoding scheme. Of course, a plethora of different languages is detrimental to interoperability. Therefore, care has been taken to make FFFF as described in this document good enough to cover the vast majority of cases. The use of this language is indicated by setting the language to FFFF version 0.2.

In FFFF, data is encoded as a stream of individual data items, each called a datum. Each datum consists of a tag identifying what kind of datum it is. The tag is followed by a value of zero or more bytes, whose interpretation depends on the kind of datum.

Tags in FFFF are natural numbers, encoded using FFFF's numeral encoding. In order to parse an FFFF datum, a program reading the datum must know the tag's definition. These definitions are provided by the language specification. The tags defined in this version of FFFF can be found in the section on tags.

In addition to encoding data directly, values may also be encoded by reference. FFFF allows defining a tag to refer to a specific value. After such a definition is made, the tag can be used as a reference to the value, which preserves object identity and conserves space. More information can be found in the description of definitions.


Encoding

Two commonly used encoding schemes used in FFFF are numeral encoding and blob encoding. These are described below.

Numeral Encoding

In FFFF, numerals are used to encode arbitrarily large numeric values. The numeral encoding consists of one or more bytes and can express any integer value of zero or greater.

The numeral encoding consists of a number of code bytes, where each byte contains a continuation bit (the most significant bit) and 7 data bits (the least significant bits). If the continuation bit is set to 1, this indicates that the code byte is followed by another code byte. The last code byte has its continuation bit set to 0. The data bits in the code bytes correspond to the bits in the encoded value, where the first code byte holds the 7 least significant bits of the encoded value, and each successive code byte holds the 7 next least significant bits of the encoded value.

Examples

Any value from 0 to 127 (both inclusive) can be encoded in a single byte:

Decimal0
Binary0
Encoding00000000
Decimal42
Binary101010
Encoding00101010
Decimal127
Binary1111111
Encoding01111111

The value 128 needs two bytes in FFFF numeral encoding:

Decimal128
Binary1 0000000
Encoding1000000 00000001

The value 1387055 needs three bytes:

Decimal1387055
Binary1010100 1010100 0101111
Encoding10101111 11010100 01010100

Blob Encoding

A blob can hold arbitrary binary data. In FFFF, blob encoding consists of a numeral indicating the number of bytes in the blob, followed by the bytes comprising the blob data.

Example

Unencoded bytes (hex)e0 a4 ae e0 a5 87 e0 a4 a4 e0 a5 8d e0 a4 a4 e0 a4 be
Blob encoding (hex)12 e0 a4 ae e0 a5 87 e0 a4 a4 e0 a5 8d e0 a4 a4 e0 a4 be

Tags

This section gives an overview of the tags defined in FFFF version 0.2.

Integers

Integer values are encoded entirely in the tag. To mark the tag as encoding an integer, the least significant bit of the tag is set to 1 (all tags for non-integer values have this bit set to 0). The rest of the tag bits are used to encode the integer's value in two's complement encoding. The following provide examples of integers encoded using this scheme:

Decimal value0
Two's complement0
FFFF encoding00000001
Decimal value1
Two's complement01
FFFF encoding00000011
Decimal value-1
Two's complement1
FFFF encoding01111111
Decimal value31
Two's complement011111
FFFF encoding00111111
Decimal value-32
Two's complement100000
FFFF encoding01000001
Decimal value32
Two's complement0 100000
FFFF encoding11000001 00000000
Decimal value100 000
Two's complement01100 0011010 100000
FFFF encoding11000001 10011010 00001100

Available since: FFFF version 0.1.

Booleans

There are two possible boolean values: true and false. Each of these has its own tag in FFFF:

ValueTag
false0
true2

Available since: FFFF version 0.1.

Blobs

A blob is a container for arbitrary binary data; a blob contains zero or more bytes. The tag value for blobs is 4. The tag is followed by the value, in blob encoding. For example:

Bytes to be encoded (hex) e9 9b bb e5 ad 90 e8 a8 88 e7 ae 97 e6 a9 9f
Encoding (hex) 04 0f e9 9b bb e5 ad 90 e8 a8 88 e7 ae 97 e6 a9 9f

Available since: FFFF version 0.1.

Strings

In FFFF, a tag value of 6 indicates a string. This is followed by a blob-encoded value consisting of a numeral indicating the number of characters in the string, followed by the contents of the string, encoded as UTF-8. Note that, in UTF-8, the number of characters is not necessarily equal to the number of bytes.

For example, the string Hello, world! is encoded as follows:

StringHello, world!
UTF-8 (hex)48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21
Number of characters13 (0d hex)
Encoded string06 0e 0d 48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21

The string Здравствуй, мир! (Hello, world! in Russian) is encoded as follows:

String Здравствуй, мир!
UTF-8 (hex) d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 2c 20 d0 bc d0 b8 d1 80 21
Number of characters 16 (10 hex)
Encoded string 06 1e 10 d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 2c 20 d0 bc d0 b8 d1 80 21

Available since: FFFF version 0.1.

Symbols

Symbols are named identifiers. There are two encodings for symbols in FFFF: the short encoding and the long encoding. The short encoding consists of only the symbol name, whereas the long encoding also includes a namespace. In both cases, the name is stored in UTF-8 with a character count prepended, similar to string encoding.

The short encoding for a symbol uses tag value 8, but is otherwise identical to the encoding for strings. That is, the tag is followed by a numeral indicating the number of bytes until the end of the encoded symbol (excluding the tag and this numeral), followed by a numeral indicating the number of characters in the symbol name, followed by the symbol name in UTF-8. For example:

Symbol namefoo
UTF-8 (hex)66 6f 6f
Short encoding08 04 03 66 6f 6f

The long encoding for a symbol uses tag value 10 (0a hex). The tag is followed by a numeral indicating the number of bytes until the end of the encoded symbol. After this numeral comes the namespace. The namespace can be any FFFF value, following the tag-value encoding. After the namespace, the symbol name is stored, as an UTF-8 string preceded by a numeral denoting the number of characters in the string.

The example below shows how a symbol quuz in the namespace foo is encoded. The symbol foo is encoded using short encoding, as shown in the earlier example.

Symbol namequuz
UTF-8 (hex)71 75 75 7a
Namespacefoo (symbol)
Namespace, encoded08 04 03 66 6f 6f
Long encoding0a 0b 08 04 03 66 6f 6f 03 71 75 75 7a

Both types of symbol encoding have been available since FFFF version 0.1.

Arrays

An array is a collection of zero or more values. There are two array encodings in FFFF: one with variable-size elements, and one with fixed-size elements.

An array with variable-size elements is encoded as the tag 12 (0c hex), followed by the size of the following data in bytes (as a numeral), followed by the number of elements in the array (as a numeral), followed by the elements themselves (as FFFF tag-value pairs). The elements follow directly after one another. Examples:

  • An empty array: 0c 01 00
  • An array containing the integers 1, 2, and 3: 0c 04 03   03   05   07
  • An array containing the strings foo and bar: 0c 0d 02   06 04 03 66 6f 6f   06 04 03 62 61 72

In arrays with fixed-size elements, elements are stored in fixed-size blocks, each the same number of bytes in size. If an element doesn't actually need all bytes allocated to it, the bytes are unused and are expected to be set to zero.

The encoding for an array with fixed-size elements is as follows: the tag 14 (0e hex), followed by the size of the following data in bytes (as a numeral), followed by the number of bytes per element (as a numeral), followed by the elements themselves. The number of elements in the array can be calculated by dividing the remaining bytes after the element size indicator by the number of bytes per element. Examples:

  • An array with two elements of six bytes each, containing the strings foo and bar: 0e 0d 06   06 04 03 66 6f 6f   06 04 03 62 61 72
  • An array with three elements of 7 bytes each, containing the integer 1, the string quuz, and the integer 2: 0e 16 07   03 00 00 00 00 00 00   06 05 04 71 75 75 7a   05 00 00 00 00 00 00

Both types of array encoding have been available since FFFF version 0.1.

Blocks

Blocks are used to group data. In FFFF, a block also provides scoping: tag definitions and language changes in a block persist until the end of the block, after which the definitions that were present at the beginning of the block are restored. This allows blocks to be processed independently and in parallel.

A block is encoded as a tag, followed by a numeral indicating the number of bytes until the end of the block, followed by zero or more FFFF values in the regular tag-value encoding. The tag value for blocks is 16 (10 hex).

The following example is the encoding for a block containing the symbol foo (in short encoding), the string bar, and the integer 42: 10 0e   08 04 03 66 6f 6f   06 04 03 62 61 72   d5 00

Available since: FFFF version 0.1.

Definitions

To allow the same datum to be referenced from multiple locations, and to allow for more compact encoding, FFFF allows defining a tag to refer to a specific datum. Such definitions are introduced by the tag 18 (12 hex), followed by the numeral which will act as a reference, followed by the FFFF datum it will refer to. For example, the following code defines the numeral 32 (20 hex) to refer to the string foo:

12   20   06 04 03 66 6f 6f

Given this definition, the tag 20 can subsequently be used to refer to the string foo in any place where an FFFF datum is expected. For example, with the above definition in effect, the following encodes an array containing 3 references to the string foo:

0e 04 01   20   20   20

Any tag that is not an integer may be defined this way, including tags that have previously been defined by earlier definitions or language definitions. Redefining an integer is an error, and the consequences of doing so are not defined by this document.

A definition takes effect immediately after it occurs. If it occurs inside a block, it is in effect until the end of the block. Otherwise, it is in effect until the end of the stream.

Available since: FFFF version 0.1.

Language

FFFF provides support for different languages, where each language can define its own encoding, which may or may not be compatible with the encoding described in this document. To support such languages, FFFF provides a tag to indicate the language of the content following it.

A language directive consists of the tag 16256 (3f80 hex), followed by a blob that identifies the language, followed by two numerals that indicate the major version and the minor version of the language, in order.

The major version and minor version allow new versions of languages to be released, while indicating to programs reading FFFF whether the language being used in an FFFF stream is compatible with the version of the language the program knows. Language version numbers must be assigned in such a way that a program that can read language L with major version X and minor version Y can also read language L with major version X and minor version Z, where Z is less than Y. Thus, if not all streams that are valid under an existing language version are valid under a new language version, the new language version must have a different major version number.

An FFFF stream that uses the language described in this document can indicate this by specifying a language identified by a blob containing four bytes of value 46 hex (the Unicode code point for the capital letter F), major version 0, and minor version 1. Encoded as an FFFF datum, this becomes: 80 7f 04 46 46 46 46 00 01

If the language is switched inside a block, the language switch is in effect until the end of the block.

Available since: FFFF version 0.1.

Imports

Tag 16258 (3f82 hex), identifier (blob), major version (numeral), minor version (numeral), offset (numeral). Imports a named set of definitions, at the given offset. Definitions are defined by export directives. Offset is added to the numbers mentioned in the export directives.

The imported definitions become available immediately after the import directive. If the import occurs inside a block, its effect lasts until the end of the block.

Available since: FFFF version 0.2.

Exports

Tag 16260 (3f84 hex), reference (numeral), datum. Same thing as a definition, but is made avaialable to the import statement.

Example of export and import: TODO

Available since: FFFF version 0.2.