Last semester, Jordi Riemens and I have built an implementation of the Mysterion block cipher for the Cortex M4 microarchitecture. This block cipher has a 128-bit state and claims similar security to the AES block cipher. Its design is called an “XLS design”, which is essentially the same as traditional substitution permutation networks.

The original paper gives a rough outline of how the block cipher works, and the original authors have kindly provided us with a reference implementation of the cipher for AVR. This reference implementation is however fully lookup-based, so for numerous reasons we cannot use this in practical applications.


The block cipher state is separated into 4 blocks of each 32 bytes. These 4 blocks are each divided into 4 rows of each 1 byte long. The following table illustrates this. I have annotated where the input bytes end up in the state (e.g. input byte 5 ends up at 55555555).

000000000    44444444    88888888    ........
111111111    55555555    99999999    ........
222222222    66666666    ........    ........
333333333    77777777    ........    ........
^~~~~~~~~ one 32-bit `sub-block`
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ one 128-bit `block`

The cipher does 12 rounds in which it performs 4 operations: The S-box, the L-box, ShiftColumns and key/constant addition.


The first operation during a round is the S-box. In these kinds of ciphers (SPNs), S-boxes make sure that the cipher is nonlinear. In the Mysterion cipher, the S-box is defined by a simple function of 8 bitwise operations:

def sbox(block):
    """Returns a new block after applying the Mysterion S-box"""
    a = (block[0] & block[1]) ^ block[2]
    c = (block[1] | block[2]) ^ block[3]
    d = (a & block[3]) ^ block[0]
    b = (c & block[0]) ^ block[1]
    return [a,b,c,d]


The L-box is the next operation in the round. Its purpose is to diffuse changes in the state. In other words, this operation makes sure that if some bits in the input state is changed, many bits in the output state are changed. Because of the L-box, a difference of one bit will already have propagated to the complete state after just two rounds (recall that Mysterion does 12).

This is implemented by multiplying the state with an MDS matrix made up of elements in a finite field of order 16. These MDS matrices are specially crafted matrices which optimise the diffusion. In the special case of Mysterion, this matrix is the eighth power of a transposed companion matrix. This reduces the amount of registers that are needed during the application of the L-box which came in handy in the implementation.


The next operation is the ShiftColumns operation, as found in all other SPN networks. This operation scrambles the locations of bits in the state, so that differences in one sub-block propagate to all the others.

Key and constant addition 

Up until now, we have done nothing to make the ciphertext dependent on the key. So to do this, we add the key to the state. Because we want every round to be different, we also add a round-specific constant value.

The Mysterion block cipher has no key schedule. In every round, the same key is added to the state.

Implementation strategies 

This section goes into the nitty gritty details of the implementation of the block cipher. If you haven’t read the paper yet, you may want to read it now, if you really want to understand what we are doing.

We have implemented the cipher in ARM assembly, but for our own overview we have also implemented the cipher in Python. In this part I’ll try to use the assembly product when possible. However—because of the optimisations—the assembly is not always very readable, so I will sometimes fall back to Python. You can always find the assembly code in mysterion.s.

When you look closely at the cipher’s state, you see 4 rows of 32 bits. For us—working on a 32-bit platform—it was obvious to use this and bitslice the rows over four registers. The first step here is to get the right bits from the state (which are grouped by block) into the right registers (which are by row). For this we have written the byteslice subroutine. In Python, this function looks like this:

def byteslice(message):
    registers = [0] * 4
    for i in range(4):
        for j in range(4):
            registers[i] |= message[4*j + i] << ((3 - j) * 8)
    return registers

We execute this function for both the message and the key. You might expect that, because we are working in assembly, we will have to do a masking operation in each loop iteration. But we only have to do the byteslice and unbyteslice operations once per block encryption. So we can combine the byteslicing with reading and writing in memory. Now we can use the ldrb and strb instructions. Instead of working on 32 bit words, these instructions operate on one byte each, allowing us to skip the masking operation.


Remember that the S-box was fully defined by bitwise operations. Now that we have our state in a nice bitsliced representation, we can exploit this and calculate the S-box on all the bits at the same time. This computes the S-box transformation super efficiently. The resulting code looks like this:

.macro sbox
     * r2..r5: state
     * r0..r1: temporary registers
    and r0, r2, r3
    eor r0, r0, r4
    orr r4, r4, r3
    eor r4, r4, r5
    and r5, r5, r0
    eor r5, r5, r2
    and r1, r4, r2
    eor r3, r3, r1
    mov r2, r0


Of course, calculating the S-box was the easy part. The L-box is a whole lot more complex. Our first intuition was that it would probably be easier, and thus more efficient, to compute the L-box by exploiting the fact that the multiplying matrix was its companion matrix to the power eight. This means that we could replace one matrix multiplication by eight vector multiplications. We expected this to be at least as fast as the matrix multiplication method, but needing fewer registers. So we built this function in Python, and then in assembly:

def lbox(state):
    # state is a list of 4 32-bit numbers (in bitsliced form!)
    for clock in range(8):
        x = _gf16_mul2(poly(clock), state)
        for reg in range(4): # reg for register
            acc = 0
            for i in range(8-clock):
                acc ^= x[reg] << i
            for i in range(1, clock+1):
                acc ^= x[reg] >> i
            state[reg] ^= acc & (0x80808080 >> clock)
    return state

We were fairly happy with this implementation. It uses register rotation, an optimisation method in which we slide through the values, instead of having to shift them on every loop iteration. To still add to the correct polynomial in the state, we accumulate the new values inside the registers and (while adding them) we shift them to the exact right spot in the accumulator, so that the value in the accumulator is aligned with the vector term that is being calculated. Note that we now have to do only 8 bitsliced multiplications in GF(2^4).

There is a downside to this, and that is that we can still only calculate one value at a time in the resulting state. After implementing this in assembly, we found that the accumulation step took a lot of cycles (see for yourself).

Later we argued that this could obviously done more efficiently. So we also implemented the L-box by “just” doing the matrix multiplication. All matrix elements could be precomputed anyway, so chances are that this implementation would be a lot faster.

def lbox(state):
    mat = [
        [0b0001, 0b1000, 0b0011, 0b1111, 0b0101, 0b1111, 0b0011, 0b1000],
        [0b1000, 0b1101, 0b0011, 0b0010, 0b0001, 0b0100, 0b0100, 0b1111],
        [0b1111, 0b1001, 0b1111, 0b1001, 0b0100, 0b1011, 0b0110, 0b0101],
        [0b0101, 0b0001, 0b0110, 0b1001, 0b1011, 0b0010, 0b0100, 0b1000],
        [0b1000, 0b1001, 0b1010, 0b0111, 0b0111, 0b1010, 0b1001, 0b1000],
        [0b1000, 0b0100, 0b0010, 0b1011, 0b1001, 0b0110, 0b0001, 0b0101],
        [0b0101, 0b0110, 0b1011, 0b0100, 0b1001, 0b1111, 0b1001, 0b1111],
        [0b1111, 0b0100, 0b0100, 0b0001, 0b0010, 0b0011, 0b1101, 0b1000]
    for i in range(8):
        # Rotate for rotated registers
        mat[i] = mat[i][i:] + mat[i][:i]
    mat = zip(*mat) # Transpose the matrix
    mat = [bitslice_32x4(row * 4) for row in mat] # Bitslice all matrix columns
    # ^ In the assembly code, these values are all precomputed, so there is no
    #   computational overhead. :)

    acc = [0] * 4
    for i in range(8):
        for j, x in enumerate(_gf16_mul2(mat[i][:], state[:])):
            acc[j] ^= x
        # rotate state left inside bytes by 1
        state = [((x << 1) & 0xfefefefe) | (x >> 7) & 0x01010101 for x in state]
    return acc

This immediately turns out to be an much cleaner piece of code than the previous one. After measuring our new method it turns out to be much faster. (“Yay!”) Also, our assembly code looked so much cleaner. One round of the L-box now looked like this:

    # All rounds look the same, except for the polynomials
    gf16_mul_lit r1,r10,r11,r12, 0xbbbbbbbb, 0x03030303, 0x5b5b5b5b, 0x77777777
    eor r6, r6, r1
    eor r7, r7, r10
    eor r8, r8, r11
    eor r9, r9, r12
    lbox_rotate_left /* Rotate the state towards the left inside byte
                        (macro of 14 cycles) */

It turned out that we actually needed those extra registers (exactly 4 of them). Because of this, we spill the values in r6-r9 (the bitsliced key) to the stack at the beginning of the L-box, and pop them from the stack at the end. This adds about 14 cycles to each L-box computation, but this is nothing compared to the speedup that we had achieved by using the new method.


ShiftColumns was easy. For each register, we do 4 shifted and operations. Then we add (bitwise OR) the results:

.macro shiftcolumns_inner reg
    and r0, \reg, #0xc0c0c0c0
    and r1, \reg, #0x30303030
    orr r0, r0, r1, ror #8
    and r1, \reg, #0x0c0c0c0c
    orr r0, r0, r1, ror #16
    and r1, \reg, #0x03030303
    orr \reg, r0, r1, ror #24

Key and constant addition 

This is just trivial. :wink:

.macro add_key
     * r2..r5: state
     * r6..r9: key
    eor r2, r2, r6
    eor r3, r3, r7
    eor r4, r4, r8
    eor r5, r5, r9

Stitching it all together 

After having made all the building blocks, there was only the task left of stitching them together, which—apart from some register allocation—is not too much of a hassle:

.macro mysterion_round round_number
     * Do one round of the Mysterion block cipher
     * r2..r5: state
     * r0, r1, r6..r12: temporary registers
    add_const \round_number

     * Do Mysterion block encryption on msg pointed to by r0, with the values
     * pointed to by r1 as key.

    /* Spill the pointer to the output buffer to the stack, because we need a
    register for temporary values. Other temporary registers are pushed
    according to the C calling convention. */
    push {r4-r12}
    push {r0}


    /* From this point [r2..r9] are in use */
    mysterion_round 1
    mysterion_round 2
    mysterion_round 3
    mysterion_round 4
    mysterion_round 5
    mysterion_round 6
    mysterion_round 7
    mysterion_round 8
    mysterion_round 9
    mysterion_round 10
    mysterion_round 11
    mysterion_round 12

    /* Put the ciphertext back in the input buffer */
    pop {r0}

    pop {r4-r12}
    bx lr


Every encryption algorithm of course also need a decryption algorithm. Luckily, the algorithm is very symmetric in all aspects. So much that there is not a lot to write about. The inverse S-box is just another sequence of operations. The L-box operation is exactly the same, with the slight difference that we are multiplying the inverse matrix with the state. The ShiftColumns operation is easily inverted. The key and constant addition are XOR operations, so they are exactly the same as in the encryption algorithm. We only have to remember to add the round constants in reversed order.


The result is an encryption (and decryption) algorithm that encrypts one 128-bit block in 5201 cycles. To our knowledge there exist no similar implementations of the Mysterion block cipher, so I think we are safe to claim that we have the fastest (and only) implementation of Mysterion for the ARM architecture.

Without any other implementations, we still don’t know if our implementation is reasonably fast, or just a piece of garbage. So we looked up the performance of AES (which is also an SPN) on the same platform. The current best implementation of AES takes 2651 cycles. I should mention that that this implementation bitslices over blocks, which gains the implementers a large advantage which we do not have. Having built the algorithm with a performance that is in the same league as a highly optimised academic implementation of AES, we are pretty satisfied with the result.


In the end, we like the design of the Mysterion block cipher. We would however argue that the paper and the reference implementation are all a bit vague and hard to grasp. With respect to other papers, this paper could have been a lot clearer in our opinion. It almost feels like the cipher was not proposed to be actually implemented and used, as it was sometimes ambiguous and left out certain details that were crucial for implementing (like the composition of the state when applying the S-box or the definition of the round constants).

Furthermore, the cipher’s design chooses an L-box that works on 32 bits at the same time. I myself am not a cryptanalyst, so I do not know how much this extra diffusion adds to the security of the block cipher. One thing that I can say is that it is the bottleneck of the cipher in terms of performance. Because the Lbox has a complexity of O(n^2), this has the most impact on the speed of the computation. When thinking only about performance it would be preferable to have an S-box with a larger width, and a diffusion layer with a smaller width. (Here again, AES seems to have figured it out.)

We hope our code may be a good guideline for future (aspiring) crypto-engineers. Not only for implementing this cipher on other platforms, but also for making custom implementations of one of the many ciphers and hash functions out there. As for the Mysterion block cipher, we think that the techniques that we are using can easily be extended to other platforms. On the architectures that support vector operations, a larger bitsliced approach is possible. On smaller platforms like AVR and other microcontrollers, we expect that bitslicing over 8 bits—the width of one sub-block—will result in fast code as well.


We built this implementation for the Cryptographic Engineering course of the Radboud University. Joost Rijneveld has guided us during the project, and figured out how to reverse assemble the inverse S-box using a SAT solver.

And of course we thank Anthony Journault for having published the block cipher, providing us with a reference implementation and kindly answering our emails when we bugged him with questions. :smile: