Until someone figures out a better way to do it, the only way to break DES is by exhaustive key search. To minimize that search time on general purpose computers, the fastest possible DES implementation is required.

At Fast Software Encryption 4 (FSE4), held in Haifa, Israel in
January 1997, Dr Eli Biham
presented his paper *A Fast New DES Implementation in Software*.
It described a non-standard implementation of DES whereby the algorithm
is broken down to AND, OR, NOT, and XOR gates, and these gates are
implemented as machine instructions. For example, on a 64-bit machine,
this results in DES being executed 64 times in parallel. Depending on
the machine architecture (word size, RISC v CISC, etc), this can be
significantly faster than traditional DES implementations.

Obviously, the speed depends greatly on the number of gates required to implement DES. In particular, this means minimizing the number of gates used in the eight S-boxes.

My first (and probably greatest) contribution to the field of bitslice
DES was to use the word *bitslice* to describe the technique.
A few hours after Eli's talk, I was discussing the implications with
some other attendees over beers at the Hotel Marom, and that's where
I first used the word.

I had some bogus (?) theory that the security of a cipher can be determined by its gate count. To test this theory, and to put together a fast DES implementation for the RSA challenge (which was also announced at the conference), I decided to reproduce Eli's technique.

Well, my gate count theory has gone nowhere, but the name *bitslice*
has stuck.

In Eli's paper, he describes S-boxes with an average count of 100 gates. By making a modification to his gate generation algorithm, I got the average count down to a little over 88.

S-box | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 |
---|---|---|---|---|---|---|---|---|

Gates | 95 | 84 | 89 | 77 | 96 | 87 | 86 | 88 |

The modification involved testing all 6! (720) orderings of the S-box input bits, and choosing the ordering that produced the lowest gate count after optimization. The technique is described in more detail on a separate page. Similar results were obtained independently around the same time by Shiho Moriai and Anil Das (and possibly others).

However, I recently heard that Rocke Verser had produced a bitslice version of DES where the average S-box gate count was 65. I heard this second-hand, and because Rocke takes the US export restrictions seriously, I wasn't likely to see it first-hand. True or not, I wasn't going to take this lying down! (It later turned out that Darrell Kindred had produced S-boxes with 66.1 gates, and Rocke had boxes with 61, both using non-standard instructions).

So, with a long weekend ahead of me, I sat down to code a new algorithm to generate S-boxes. Fortunately the algorithm was effective, and I got the average gate count down to 62. Further improvements have brought the gate count down to 56

S-box | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 |
---|---|---|---|---|---|---|---|---|

Gates | 63 | 56 | 57 | 42 | 62 | 57 | 57 | 54 |

I won't describe the technique here just yet. Firstly, it might be worth publishing. But secondly, other people may be inspired to beat my results, and I'd rather not give them any pre-conceived notions, since my algorithm is by no means perfect. Coming at the problem with a fresh mind may produce even better results.

Postscript: It has been eight years since I wrote that paragraph, and since then I have presented the technique at Eurocrypt'99 and copies of the paper are floating around. So I'll make it available here. Finally. Actually, I just forgot about it.

It turns out that many CPUs these days have instructions for non-standard
logical operations, such as NAND, NOR, NXOR, AND-NOT, and OR-NOT.
It was a fairly minor modification to my algorithm to generate S-boxes
using these gates (although it required a *lot* more CPU time).
All NAND and NOR gates were then removed so that the code could run
efficiently on SPARC and Alpha architectures. Just for the hell of it,
most of the NXOR and all of the OR-NOT gates were also removed. The resulting
S-boxes had an average of 55.4 gates. This was later improved to 51

S-box | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 |
---|---|---|---|---|---|---|---|---|

Gates | 56 | 50 | 53 | 39 | 56 | 53 | 51 | 50 |

On the previous incarnation of this web site I was giving away code that was optimised for competing in the RSA challenge. Well, the challenge is over, and it turns out my optimizations aren't in the same league as those of people who do it seriously, so I'll stick to what I'm good at, i.e. functions that implement the S-boxes.

- Just the S-boxes sboxes.c (13027 bytes)
- With sample code sboxes.tar.gz (8922 bytes)
- S-boxes with non-standard gates nonstd.c (12250 bytes)

Feel free to incorporate this code into your password crackers or whatever, but remember to give credit where it's due. To see examples of this code in use, have a look at the distributed.net DES client, and the password cracker John the Ripper.

Post-postscript: Thirteen years after generating the original S-boxes, they have finally been improved upon, by an impressive 17 percent. Details here.

Document written by Matthew Kwan, 12 May 1998

Last modified by Matthew Kwan, 2 July 2011

Please send any comments or corrections to mkwan@darkside.com.au

Last modified by Matthew Kwan, 2 July 2011

Please send any comments or corrections to mkwan@darkside.com.au