How SNOW works

This document gives a description of the encoding scheme used by snow.

The Nature of Steganography

Steganography is the science of concealing messages in other messages. Some historical techniques have involved invisible ink, subtle indentations in paper, and even tattooing messages under the hair of messengers. In this digital age, steganography provides means for hiding messages in digital audio files, in some kinds of images, and even for generating pseudo-English text which encodes the message.

Ideally, the original message is not noticeably degraded by presence of a hidden message. As a result, the most effective techniques tend to make use of data that contains a lot of redundancy, such as raw audio and image files. Steganography works much less effectively, if at all, with efficient compressed formats such as JPEG and MPEG.

Unfortunately, sending large amounts of raw audio and image data can arouse suspicion, and the pseudo-English encoding schemes are not sophisticated enough to fool a human observer.

Whitespace Steganography

The encoding scheme used by snow relies on the fact that spaces and tabs (known as whitespace), when appearing at the end of lines, are invisible when displayed in pretty well all text viewing programs. This allows messages to be hidden in ASCII text without affecting the text's visual representation. And since trailing spaces and tabs occasionally occur naturally, their existence should not be sufficient to immediately alert an observer who stumbles across them.

The snow program runs in two modes - message concealment, and message extraction. During concealment, the following steps are taken.

Message -> optional compression -> optional encryption -> concealment in text

Extraction reverses the process.

Extract data from text -> optional decryption -> optional uncompression -> message

Each of the steps are described in detail below.

Compression

The compression scheme used by snow is a fairly rudimentary Huffman encoding scheme, where the tables are optimised for English text. This was chosen because the whitespace encoding scheme provides very limited storage space in some situations, and a compression algorithm with low overhead was needed. In other words, short messages had to compress to even shorter data. Depending on the text, you can usually get 25 - 40% compression.

If you want to compress a long message, or one not containing standard text, you would be better off compressing the message externally with a specialized compression program, and bypassing snow's optional compression step. This usually results in a better compression ratio.

Encryption

The encryption algorithm built in to snow is ICE, a 64-bit block cipher also designed by the author of snow. It runs in 1-bit cipher-feedback (CFB) mode, which although inefficient (requiring a full 64-bit encryption for each bit of output), provides the best possible security when different messages are encrypted with the same password. Although using the same password many times is theoretically a big no-no, in the real world it often can't be avoided.

The lower 7 bits of each character in the password are packed into an array, which is used to set the encryption key. The ICE encryption algorithm can operate at different levels, with higher levels using longer keys and providing more security. The ICE level appropriate for the password length is used.

CFB mode makes use of an initialization vector (IV), which is initially set to the first 64 bits of the key encrypted by itself. Each time a bit is encrypted, the IV is encrypted, and the leftmost bit of the encrypted IV is XORed with the bit. The IV is then shifted left one bit, and the ciphertext bit is added to the right. Decryption reverses this process.

The Encoding Scheme

To show the beginning of a message, a tab is added immediately after the text on the first line where it will fit. This prevents the insertion of mail and news headers containing trailing spaces from corrupting the message, since a trailing tab must be found before extraction begins.

Data is written 3 bits at a time, coding for 0 to 7 spaces. Any messages not a multiple of 3 bits will be padded by zeroes. During extraction, an extra one or two bits at the end will be ignored (fortunately there are no two-bit Huffman codes to confuse things).

An alternative scheme was considered, where bits were written one at a time as either a space or a tab. Although this scheme adds fewer characters per bit (1 vs 1.5), it requires more columns per bit (4.5 vs 2.67), and column space is the limiting factor.

Tabs are used to separate the blocks of spaces. Thus 3 bits are usually coded in 8 columns of text, and given that the default line length is 80 characters, this allows 30 bits to be stored on empty lines. A tab is not appended to the end of a line unless the last 3 bits coded to zero spaces, in which case it is needed to show some bits are actually there.

If a message will not fit into the available text, empty lines will be appended and used to contain the overflow. A warning message will also be produced, since this affects the look of the original text.