Bitslice DES

The need for speed

Until someone figures out a better way to do it, the only way to break DES is by exhaustive key search. To minimize that search time on general purpose computers, the fastest possible DES implementation is required.

At Fast Software Encryption 4 (FSE4), held in Haifa, Israel in January 1997, Dr Eli Biham presented his paper A Fast New DES Implementation in Software. It described a non-standard implementation of DES whereby the algorithm is broken down to AND, OR, NOT, and XOR gates, and these gates are implemented as machine instructions. For example, on a 64-bit machine, this results in DES being executed 64 times in parallel. Depending on the machine architecture (word size, RISC v CISC, etc), this can be significantly faster than traditional DES implementations.

Obviously, the speed depends greatly on the number of gates required to implement DES. In particular, this means minimizing the number of gates used in the eight S-boxes.

My humble contribution

My first (and probably greatest) contribution to the field of bitslice DES was to use the word bitslice to describe the technique. A few hours after Eli's talk, I was discussing the implications with some other attendees over beers at the Hotel Marom, and that's where I first used the word.

I had some bogus (?) theory that the security of a cipher can be determined by its gate count. To test this theory, and to put together a fast DES implementation for the RSA challenge (which was also announced at the conference), I decided to reproduce Eli's technique.

Well, my gate count theory has gone nowhere, but the name bitslice has stuck.

Gate counts count

In Eli's paper, he describes S-boxes with an average count of 100 gates. By making a modification to his gate generation algorithm, I got the average count down to a little over 88.

S-box S1 S2 S3 S4 S5 S6 S7 S8
Gates 95 84 89 77 96 87 86 88

The modification involved testing all 6! (720) orderings of the S-box input bits, and choosing the ordering that produced the lowest gate count after optimization. The technique is described in more detail on a separate page. Similar results were obtained independently around the same time by Shiho Moriai and Anil Das (and possibly others).

However, I recently heard that Rocke Verser had produced a bitslice version of DES where the average S-box gate count was 65. I heard this second-hand, and because Rocke takes the US export restrictions seriously, I wasn't likely to see it first-hand. True or not, I wasn't going to take this lying down! (It later turned out that Darrell Kindred had produced S-boxes with 66.1 gates, and Rocke had boxes with 61, both using non-standard instructions).

So, with a long weekend ahead of me, I sat down to code a new algorithm to generate S-boxes. Fortunately the algorithm was effective, and I got the average gate count down to 62. Further improvements have brought the gate count down to 56

S-box S1 S2 S3 S4 S5 S6 S7 S8
Gates 63 56 57 42 62 57 57 54

I won't describe the technique here just yet. Firstly, it might be worth publishing. But secondly, other people may be inspired to beat my results, and I'd rather not give them any pre-conceived notions, since my algorithm is by no means perfect. Coming at the problem with a fresh mind may produce even better results.

Postscript: (29 November 2006) It has been eight years since I wrote that paragraph, and since then I have presented the technique at Eurocrypt'99 and copies of the paper are floating around. So I'll make it available here. Finally. Actually, I just forgot about it.

Non-standard gates

It turns out that many CPUs these days have instructions for non-standard logical operations, such as NAND, NOR, NXOR, AND-NOT, and OR-NOT. It was a fairly minor modification to my algorithm to generate S-boxes using these gates (although it required a lot more CPU time). All NAND and NOR gates were then removed so that the code could run efficiently on SPARC and Alpha architectures. Just for the hell of it, most of the NXOR and all of the OR-NOT gates were also removed. The resulting S-boxes had an average of 55.4 gates. This was later improved to 51

S-box S1 S2 S3 S4 S5 S6 S7 S8
Gates 56 50 53 39 56 53 51 50

Get your free source code!

On the previous incarnation of this web site I was giving away code that was optimised for competing in the RSA challenge. Well, the challenge is over, and it turns out my optimizations aren't in the same league as those of people who do it seriously, so I'll stick to what I'm good at, i.e. functions that implement the S-boxes.

Feel free to incorporate this code into your password crackers or whatever, but remember to give credit where it's due. To see examples of this code in use, have a look at the distributed.net DES client, and the password cracker John the Ripper.

Post-postscript: (2 July 2011) Thirteen years after generating the original S-boxes, they have finally been improved upon, by an impressive 17 percent. Details here.

Post-postscript: (2 May 2016) Recent implementations using Nvidia's LOP3.LUT 3-input lookup table have achieved gate counts as low as 23.5, as demonstrated by DeepLearningJohnDoe. One example of its use is in Meriken's Tripcode Engine.


Document written by Matthew Kwan, 12 May 1998
Last modified by Matthew Kwan, 2 May 2016
Please send any comments or corrections to mkwan@darkside.com.au