An Algorithm For MD5 Single-Block Collision Attack Using High-Performance Computing Cluster
An Algorithm For MD5 Single-Block Collision Attack Using High-Performance Computing Cluster
Anton A. Kuznetsov
Program Systems Institute of Russian Academy of Sciences
[email protected]
October 22, 2014
Abstract. The parallel algorithm and its implementation for performing a single-block collision attack on
MD5 are described. The algorithm is implemented as MPI program based upon the source code of Dr Marc
Stevens' collision search sequential program. In this paper we present a parallel single-block MD5 collision
searching algorithm itself and details of its implementation. We also disclose a pair of new single-block
messages colliding under MD5 that were found using our algorithm on the high-performance computing
cluster.
1. Introduction
Hash functions are the one-way functions that map arbitrary input messages to a fixedlength hash values. Hashes can be considered as signatures of the original message, and can be used
to check the message integrity and authenticity after it was delivered by network communication.
Hash functions are designed to be fast but difficult to revert (calculate f-1(hash)). MD5 is one of the
most widely-used hash functions. It was designed in 1992 by R.Rivest [1].
The message pair (M,M') is called a collision if hashes of both messages are equal. In 2004
Wang et al. [2] have disclosed a differential method for finding MD5 collisions, and presented a
collision with input messages of size 1024 bit (two-block collision). In 2010 Xie and Feng
presented single-block colliding messages [3] but for security reasons haven't disclosed any detail
about the collision searching method. They posted a challenge to cryptology community to
construct a different MD5 single-block collision. In 2012 Dr Marc Stevens have answered that
challenge [4] by presenting a single-block collision attack for MD5 and an example colliding
message pair.
In this paper we describe a parallel algorithm for finding single-block MD5 collisions, and
its MPI implementation that is based upon Dr Marc Stevens' sequential method. We also present a
new single-block colliding message pair that was found using our algorithm on the highperformance computing cluster in 11 hours.
So far only one parallel collision search method exists. Citation from [5]: "it is a simple
technique of parallelizing methods for solving search problems which seek collisions in pseudorandom walks. According to that method, to perform a parallel collision search, each processor
proceeds as follows. Select a starting point x0 S and produce the trail of points xi = f(xi-1), for i =
1, 2,... until a distinguished point xd is reached based on some easily testable distinguishing property
such as a fixed number of leading zero bits. Add the distinguished point to a single common list for
all processors and start producing a new trail from a new starting point. Depending on the
application of collision search, other information must be stored with the distinguished point (e.g.,
one must store x0 and d in order to quickly locate the points a and b such that f(a) = f(b)). A
collision is detected when the same distinguished point appears twice in the central list."
We found no publicly available research papers about parallel methods of collision search
that use Wang et al's differential method.
void mpi_barrier() {
int rc = MPI_Barrier(MPI_COMM_WORLD);
if (rc != MPI_SUCCESS) {
printf("Error in MPI_Barrier\n");
fflush(stdout);
}
}
Listing 1: mpi_barrier() wrapper subroutine
Do not declare vector and numeric variables in every iteration of the inner loop; declare
these before the outermost 'do..while'. This is due to the fact that innermost loop is executed
several billion times (in worst scenario), thus declaration of variables consume an amount of
CPU time.
Using a simple yet powerful free code profiler we have made a conclusion that during the
program run most of the CPU clock is consumed by calls to the four routines:
rotate_right()
rotate_left()
md5_ff()
md5_gg()
The former two were optimized using Intel compiler intrinsics:
rotate_right() was rewritten using _rotr()
rotate_left() was rewritten using _rotl()
md5_ff() routine is optimized like this:
rewrite from:
D ^ (B & (C ^ D))
to:
(B & C) | (~B & D)
md5_gg() routine is optimized like this:
rewrite from:
C ^ (D & (B ^ C))
to:
(D & B) | (~D & C)
md5compress() C++ function was rewritten in Assembler. This yields about 20% speed-up.
Source code was refactored by running a small Tcl script on it. All substrings in the source
code that match the "offset+%i" mask were replaced by the actual sum of the 'offset'
constant (that equals to 3) and the integer %i. This was done solely to improve code
readability and examine data dependencies between program subroutines.
initialization stage.
The parallel algorithm is highly scalable due to the fact that in the inner loop all iterations
are split equally among ranks. Total number of iterations is very high. In the worst scenario even
10-petaflop/s HPC cluster could take weeks to find collision.
We did not use any accelerator devices like Intel Xeon Phi, that are present on the cluster,
but this is actually feasible for our implementation.
5D
91
8E
AA
11
2D
73
B7
69
73
C7
C8
3E
0A
BA
1E
1E
1C
A2
32
33
DD
6A
94
4B
7A
A8
89
2C
AC
19
64
B3
6E
66
7C
88
3C
C2
11
EF
E0
86
73
AA
E4
16
4A
F0
CE
B3
3F
D0
06
4F
AF
EC
7B
3D
03
F3
B1
07
EA
M'
5D
91
8E
AA
11
2D
73
B7
69
73
C7
C8
3E
0A
BC
1E
1E
1C
A2
32
33
DD
6A
94
4B
7A
A8
89
2C
AC
19
E4
B3
6E
66
7C
88
3C
C2
11
EF
E0
86
73
AA
E4
16
4A
F0
CE
B3
3F
D0
06
4F
AF
EC
7B
3D
03
F3
B1
07
EA
Q-3=0x67452301
Q-2=0x10325476
Q-1=0x98BADCFE
Q0 =0xEFCDAB89
Q1 =0xD9A89593
Q2 =0xDA361481
Q3 =0x0660DFEA
Q4 =0x04812801
Q5 =0xEB78D1DC
Q6 =0x77D76EFF
Q7 =0xBE675C82
Q8 =0x29F20526
Q9 =0x3E1893ED
Q10=0x00000040
Q11=0xFFFFFDFE
Q12=0xB62EA109
Q13=0x062DA1C8
Q14=0x1661D7EA
Q15=0x00050621
Q16=0x14810A21
Q17=0xA8009748
Q18=0xADABC8E8
Table 2: Q values list
Q19=0x410F3F70
Q20=0x71936434
Q21=0xF7D2E265
Q22=0x09D6ECD5
Q23=0xF8B84FB6
Q24=0xBCCE16A3
Q25=0x463268A8
Q26=0x34EFF95F
Q27=0x5E7E0F7D
Q28=0xE8514E70
Q29=0xC677D867
7. Conclusion
We presented the collision searching parallel algorithm that was derived from Dr Marc
Stevens' original method. It is implemented as MPI program and successfully used to find a pair of
messages colliding under MD5.
Dr Marc Stevens' algorithm has a runtime cost of 250 md5compress() calls. We believe that a
single-block collision searching algorithm can be substantially improved, so that it requires much
less computational power. This is the subject for further research.
The collision search program can be adapted to run on other massively parallel devices:
multi-core CPUs, Nvidia CUDA devices, Intel Xeon Phi accelerators. This can greatly speed up
collision search on the workstation and/or computational cluster.
Acknowledgements
This work is supported by the Russian Academy of Sciences through the project
No.01201354596.
We express gratitude to Dr Marc Stevens for permission to modify the source code of his
single-block collision search program [8].
References
1. Ronald L. Rivest, The MD5 Message-Digest Algorithm, Internet Request for Comments, April
1992, RFC 1321
2. Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu, Collisions for hash functions MD4,
MD5, HAVAL-128 and RIPEMD, Cryptology ePrint Archive, Report 2004/199, 2004
3. Tao Xie and Dengguo Feng, Construct MD5 Collisions Using Just A Single Block Of Message,
Cryptology ePrint Archive, Report 2010/643, 2010
4. Marc Stevens, Single-block collision attack on MD5, Cryptology ePrint Archive, Report
2012/040, 2012
5. Paul C. Van Oorschot, Michael J. Wiener, Parallel collision search with cryptanalytic
applications, Journal of Cryptology, 1999, vol.12, pp. 1-28
6. http://supercomputer.susu.ac.ru/computers/tornado/
7. http://www.botik.ru/~botik/rnd/message1ak , http://www.botik.ru/~botik/rnd/message2ak
8. http://marc-stevens.nl/research/md5-1block-collision/