Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking With Low-Cost Parallel Hardware
Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking With Low-Cost Parallel Hardware
Parallel Hardware
1 Introduction 2 Background
Password hashing was introduced in the 1970s to Bcrypt is a password hashing scheme based on the
avoid storing passwords in plaintext. Instead, passwords Blowfish block cipher, which is structured as a 16-round
are hashed and only the hashes are stored, which prevents Feistel network [13]. Blowfish encryption uses a 64-bit
attackers from directly obtaining the actual passwords. input and a P-box (in bcrypt, it initially holds the pass-
However, in many cases passwords may nevertheless be word being hashed) to calculate addresses used to ac-
inferred by probing likely (and even not so likely) can- cess four 1 KB large S-boxes. These memory accesses
didate passwords against the hashes. To mitigate these are pseudorandom and 32-bit wide. Blowfish encryption
attacks, specialized password hashing schemes were de- is used by the EksBlowfish algorithm (Algorithm 1) in
signed, and bcrypt is one of them [13, 4]. It was designed function ExpandKey() to derive state determined by val-
to remain secure despite hardware improvements and to ues stored in S-boxes and P-box. This algorithm has
be resistant to brute-force attacks. The original goal is three inputs: cost, salt and encryption key (password
mostly achieved (so far) as it relates to bcrypt hash crack- being hashed). Cost determines how expensive the key
setup process is, salt is a 128-bit random value used so complex addressing modes and incapability of floating
that the same password does not always have the same point unit to issue logic instructions. One cycle instruc-
hash value and encryption key is the user chosen pass- tion latency (including load instructions from local mem-
word (after trivial pre-processing) [13]. Although this is ory) and 32 KB of local memory overcome bcrypt’s ran-
how EksBlowfish is defined in [13], the order of lines 4 dom memory access pattern while floating point unit
and 5 in Algorithm 1 is swapped in actual implementa- used in integer mode can further exploit some of avail-
tions, including in OpenBSD’s original that pre-dates the able instruction level parallelism. Atypically high num-
USENIX 1999 paper by two years. ber of registers allows to preload the P-box, which fur-
ther improves performance. Performance improvement
Algorithm 1 EksBlowfishSetup(cost, salt, key) [13] is constrained by missing support for complex address-
ing modes (causes register waste during S-box lookups)
1: state ← InitState()
and floating point unit without support for logic instruc-
2: state ← ExpandKey(state, salt, key)
tions (prevents additional exploitation of instruction level
3: repeat(2cost )
parallelism). However, advantages of this architecture
4: state ← ExpandKey(state, 0, salt)
surpass limitations to a great extent, which results in
5: state ← ExpandKey(state, 0, key)
performance efficient and energy-efficient bcrypt imple-
6: return state
mentations. To exploit all advantages of underlying hard-
ware architecture, bcrypt was implemented on Epiphany
Bcrypt (Algorithm 2) runs in two phases. The first manycore accelerator where each core computes bcrypt
phase uses the expensive key schedule Blowfish algo- hashes while ARM CPU is used by John the Ripper [9] to
rithm (Algorithm 1) to initialize Blowfish state. In the generate candidate passwords and send them to Epiphany
second phase, the obtained state is used with Blowfish in cores for hash computations. In order to hide the four
electronic codebook (ECB) mode to encrypt the 192-bit cycle latency of FPU configured in integer mode and
string “OrpheanBeholderScryDoubt” 64 times. Returned FPU’s inability to issue logic instructions, it was nec-
value is bcrypt hash [13]. essary to introduce more instruction level parallelism.
Computation of a single bcrypt hash per core could not
Algorithm 2 bcrypt(cost, salt, pwd) [13] exploit available resources. Therefore, we overlapped
1: state ← EksBlow f ishSetup(cost, salt, key) two bcrypt computations on each core. Instruction level
2: ctext ← “OrpheanBeholderScryDoubt” parallelism was further exploited by partial interleav-
3: repeat(64) ing computations of two Blowfish encryption rounds for
4: ctext ← EncryptECB(state, ctext) each instance, which sums up to four Blowfish encryp-
5: return Concatenate(cost, salt, ctext) tion rounds being interleaved. These optimizations al-
low some instructions to be calculated for free, i.e. one
instruction is executed on ALU while another one is ex-
3 Energy-Efficient Implementations ecuted on FPU. Apart from additional instruction level
parallelism, P-box for both instances was preloaded in 36
This section gives details of bcrypt implementation registers to avoid additional load instructions. Described
on different energy-efficient platforms: Parallella board optimizations were implemented with portions of code in
with 16- or 64-core Epiphany manycore accelerator [2, 3] assembly, namely for lines 3 to 5 of Algorithm 1. With
and ZedBoard with Zynq 7020 reconfigurable logic [16, this approach we managed to achieve 3/4th of the per-
17]. In these implementations, most resources are spent MHz per-core speed of a full integer dual-issue architec-
to optimize the most time consuming part of bcrypt, ture.
which is loop executed 2cost times (Algorithm 1, Lines
3 to 5).
3.2 ZedBoard/Zynq 7020
2
AXI4 bus. Arbiter receives data and stores it in cor- high frequency oscilloscope). The next hurdle was heat,
responding block RAMs. After receiving data, arbiter which we solved by adding a 12V 0.08A 40x40mm cool-
starts bcrypt instances running in parallel and waits for ing fan onto the Zynq heatsink (the fan looks huge com-
them to finish computation. When computation is fin- pared to the heatsink!) and powering it from one of the
ished, data is sent back to ARM CPU. Bcrypt implemen- pins of J21 ”current sense” connector and a ground pin
in one of the Pmod connectors. With these modifications
in place, the 112 bcrypt instances design became stable
and can be used reliably (on this specific board). Discon-
necting the fan temporarily so that we could use J21 for
its intended purpose, we measured (via J21) that Zed-
Board’s power consumption increases by around 1.5W
when we load and start to use this bitstream on bcrypt
cost 12 hashes (thus, deliberately achieving the maxi-
mum power consumption by keeping communication de-
lays to a minimum). Another 1W is consumed by the fan.
Figure 1: bcrypt implementation on ZedBoard The described 112 instances design artificially splits
computation across two cycles because only two lookups
tation in reconfigurable logic is fast and uses small por- from S-boxes can be done in a single clock cycle. But
tion of available resources, which allows for high num- if S-boxes for a single bcrypt instance are stored in two
ber of bcrypt instances running in parallel. For exam- BRAMs instead of in one, it is possible to fetch all four
ple, in a Zynq 7020 device, if four BRAMs are used to 32-bit values in a single clock cycle by performing eight
store S-boxes for four bcrypt instances and one BRAM S-box lookups from four BRAMs. This results in reduc-
is used to store other data (P-box, expanded key, salt and tion of maximum number of instances running in paral-
cost) for four bcrypt instances this equals to a maximum lel to 56, limited by available BRAM. However, these
of 112 bcrypt instances running in parallel. This mem- two designs have the same performance because halv-
ory layout fully utilizes the available BRAM resources ing the number of parallel instances is compensated by
(140 BRAMs) because all available ports of true dual- twice faster computation. Limitation of both designs is
port BRAMs are used on every clock cycle of Blowfish communication overhead, which impacts performance at
encryption round: eight lookups from BRAMs holding lower cost settings, but becomes negligible at higher cost
S-boxes and two lookups from BRAM holding P-boxes settings. Communication overhead comes in part from
and other data for both instances. transferring 56 (or 112) 4KB large sets of S-boxes filled
However, this design was initially unreliable because with initial values from ARM cores to reconfigurable
of a combination of Zynq PS core voltage drop and in- logic and it can be avoided by storing those initial val-
sufficient decoupling from PL main voltage supply. On ues in unused portions of available BRAM. Since each
ZedBoard and on Parallella board, both of these are pro- BRAM block on Zynq can hold 4 KB of data, the BRAM
vided by the same 1.0V voltage regulator output. Our blocks used to hold other than S-box data (whose size is
ZedBoard was rebooting right away, and on Parallella (its 164 B) are mostly empty and can be used to store ini-
revision with Zynq 7020) we measured (via Zynq’s own tial values of S-boxes. However, this design is unstable
ADC) a voltage drop from 960 mV (somewhat low) to at more than 28 bcrypt instances (in our testing) because
890 mV (unacceptable). To overcome this limitation, we of physical limitations of ZedBoard (presumably, insuf-
modified ZedBoard adding a wire going from C357 on ficient decoupling between PS and PL power despite of
the back of the board (near the relevant voltage regula- our modifications so far).
tor) to C217 near Zynq (C217 is a capacitor among those To overcome these limitations, experiments were con-
decoupling VCCPINT, the PS core voltage), thereby re- ducted on a bigger device from the Zynq family, ZC706
ducing the resistance of this specific path. We also added board with Zynq 7045 reconfigurable logic [15]. Un-
a 10 nF capacitor (which might or might not have mat- fortunately, this device was not much more reliable at
tered) and a couple of 470 uF electrolytic capacitors (one high instance counts than ZedBoard without modifica-
wasn’t quite enough per our testing, albeit possibly in tion was. However, it was possible to port 56 instances
terms of ESR rather than capacitance) in parallel with design, which was not working on ZedBoard to ZC706.
C217. With these changes, the 112 bcrypt instances de- Maximum number of concurrent instances is 216 (lim-
sign could finally work long enough for us to take voltage ited by the available BRAM), but it is not reliable. The
measurements, and the lowest we could capture with a highest instance count working reliably is 196. Apart
multimeter was over 970 mV on C217 (of course, this from this, we used Zynq 7045 device and tested our un-
would have been more appropriately measured with a stable ZedBoard design to obtain performance figures for
3
Performance (c/s) Energy-effciency (c/s/W)
25,000
20,583
20,000
15,000
10,000
7,044 6,596
6,246
4,812 5,347
5,000 4,571 4,116 4,556
3,522
2,285 2,400
1,207
600 47 43 49 79
0
Epiphany 16 Zynq 7020 Epiphany 64 Zynq 7020 Zynq 7045 HD 7970 FX−8120 Xeon Phi 5110P i7−4770K
(emulated with 7045)
4,000
4 Experimental Results
3,000
Figure 2 gives the comparison of performance and
2,400
energy efficiency of bcrypt implementations for differ-
ent platforms including our energy-efficient implementa- 2,000
tions: Epiphany 16, Epiphany 64 and ZedBoard. In addi-
1,207 1,200
tion we implemented and measured the performance and 1,000
energy efficiency on commodity multicore CPUs repre- 600
4
Performance for cost 12 (c/s) Theoretical performance for cost 5 (c/s) Measured performance for cost 5 (c/s)
35,000
30,000 28,462
25,000
20,538
20,000
15,000
10,000
8,112
6,313 6,246 6,753 6,596
4,571 5,408 5,347
5,000 4,490 4,556
1,207 1,207
9.6 64.5 226.3 35.7 43 50.2 53.7
0
Epiphany 16 Zynq 7020 Zynq 7045 HD 7970 FX−8120 Xeon Phi 5110P i7−4770K
Figure 3: Theoretical performance for cost 5 derived from performance for cost 12
Cost / Device 12 10 8 5
Epiphany 16 9.64 c/s 38.7 c/s 151.3 c/s 1207 c/s
Zynq-7020 64.83 c/s 253.1 c/s 932.6 c/s 4571 c/s
Zynq-7045 226.3 c/s 888.6 c/s 3371 c/s 20538 c/s
HD 7970 35.76 c/s 142.9 c/s 569.2 c/s 4556 c/s
FX-8120 42.93 c/s 171.2 c/s 680.2 c/s 5275 c/s
Xeon Phi 5110P 50.18 c/s 200.7 c/s 800.8 c/s 6285 c/s
i7-4770K 53.67 c/s 214.2 c/s 852.8 c/s 6615 c/s
most costly loop (Algorithm 1, line 2) and 64 or 192 other hand system prices are typically way higher than
times after it (Algorithm 2, lines 3, 4). We use num- our estimate of $300 for a bare-bones CPU-only system.
ber 64 because during attacks, only first 64 bits of hash Due to variance in possible GPU and Xeon Phi whole
are computed most of the time, which gives 585 Blow- system prices, we use device prices only. Our energy-
fish encryptions done outside most costly loop. In the efficient platforms are in the same price category as the
single iteration of the most costly loop 1024 Blowfish CPUs we compare them against, and are a lot cheaper
encryptions are done. This ratio is multiplied with mea- than Xeon Phi and some of the GPUs. When looking
sured performance for cost 12 to derive theoretical per- at c/s/$ performance considering system price, energy-
formance for cost 5. Figure 3 shows results of this deriva- efficient platforms outperform desktop CPUs. How-
tion. Zynq 7020 on ZedBoard is not only comparable ever, when considering chip and device prices, CPUs
to high end desktop CPUs and GPUs but it outperforms outperform Epiphany 16 while Zynq 7020 has the best
them in terms of performance. performance. Even though CPUs perform comparable
Another important aspect of platform comparison is to Epiphany manycore, when attacking bcrypt hashes
platform cost. Figure 5 shows how different platforms energy-efficiency plays a role. It is not just cost of the
compare to each other in cracks per second per dollar hardware that is important but also cost of the power con-
metrics. We use the various platforms’ prices at intro- sumption as well as cooling equipment because attack-
duction for our comparison. System prices include board ing thousands of bcrypt hashes can take days, months, or
prices for Epiphany 161 , Epiphany 642 and ZedBoard3 , years even with focused wordlists.
estimated system price of $300 for system needed to run
a CPU including motherboard, RAM and PSU. Chip and
device prices are prices of devices themselves, not in- 5 Theoretical Peak Performance Analysis
cluding any other components. As to PCIe cards, on one
hand it is possible to install up to 8 per system, but on the
Apart from measured performance in c/s and energy-
1 Current
price is $119
efficiency in c/s/W it is possible to compare various hard-
2 Intended
price, it is not available on the market ware platforms using theoretical c/s figure derived from
3 Academic price is either $299 or $319 the platform characteristics. This figure can be calculated
5
System price (c/s/$) Chip or device price (c/s/$)
50
40 38.41
30
26.08
24.18
20 18.85
16.09
12.19 11.57 10.59 10.15
10 8.299
-0 -0 - 2.358
0
0
$99 $75 $199 - $395 $119 - $549 $505 $205 - $2649 $650 $350
Epiphany 16 Epiphany 64 Zynq 7020 HD 7970 FX−8120 Xeon Phi 5110P i7−4770K
6
Zynq 7020 should be comparable to high end CPUs for [14] S IMMLER , H., K UGEL , A., M ANNER , R., V IEIRA , A.,
both low and high cost settings. Parallella board sup- G ALVEZ -D URAND , DE A LCANTARA , F., J.M.S., AND A LVES ,
V. Implementation of cryptographic applications on the reconfig-
ports up to 64 Epiphany chips. If using 64 Epiphany urable FPGA coprocessor microEnable.
64 chips this sums up to 4096 cores. Based on scala-
[15] X ILINX. Xilinx Zynq-7000 All Programmable SoC ZC706
bility of implementation between E16 and E64, theoreti- Evaluation Kit. http://www.xilinx.com/products/
cal performance of Parallella boards with 64 E64 chips is boards-and-kits/EK-Z7-ZC706-G.htm.
∼ 300000 c/s. Apart from this, future work on Parallella [16] X ILINX. XUP ZedBoard. http://www.xilinx.com/
includes using both Epiphany and Zynq 7020 at once. support/university/boards-portfolio/xup-boards/
With four Spartan 6 FPGAs, estimated performance for XUPZedBoard.html.
bcrypt implementation on ZTEX board is tens of thou- [17] X ILINX. Zynq–7000 All Programmable SoC Family of Recon-
sands of c/s, which will outperform currently available figurable Devices.
CPUs and GPUs. [18] ZTEX. USB-FPGA Module 1.15. http://www.ztex.de/
usb-fpga-1/usb-fpga-1.15.e.html.
Existing energy-efficient bcrypt implementations and
future work with very promising performance estimates
have shown that it is possible to achieve decent perfor-
mance in executing bcrypt on hardware. What is worry-
ing is the fact it can be achieved with low cost hardware,
which outperforms multicore CPUs and GPUs in terms
of performance and energy efficiency. This shows that
bcrypt will not remain secure forever and new, more ad-
vanced and attack resistant password hashing algorithms
have to be devised.
References
[1] A DAPTEVA. Epiphany Architecture Reference. http://
adapteva.com/docs/epiphany_arch_ref.pdf, 2013.
[2] A DAPTEVA. Parallella Computer Specifications. http://www.
parallella.org/board/, 2013.
[3] A DAPTEVA. Parallella Reference Manual. http://www.
parallella.org/docs/parallella_manual.pdf, 2013.
[4] D ESIGNER , S., AND M ARECHAL , S. Pass-
word security: past, present, future. http:
//www.openwall.com/presentations/
Passwords12-The-Future-Of-Hashing/, 2012.
[5] F. W IEMER , R. Z. Speed and Area-Optimized Password Search
of bcrypt on FPGAs.
[6] F OUNDATION , E. F. EFF DES cracker. http://en.
wikipedia.org/wiki/EFF_DES_cracker, 1998.
[7] K IRAN , L. K., A BHILASH , J. E. N., AND K UMAR , P. S. FPGA
Implementation of Blowfish Cryptosystem Using VHDL.
[8] M ALVONI , K., AND D ESIGNER , S. Energy-efficient bcrypt
cracking. http://www.openwall.com/presentations/
Passwords13-Energy-Efficient-Cracking/, 2013.
[9] O PENWALL. John the Ripper password cracker. http://www.
openwall.com/john/.
[10] O PENWALL. Modern password hashing for your software and
your servers. http://www.openwall.com/crypt/.
[11] PATEL , M. C. R., G OHIL , P. N. B., AND S HAH , P. V. FPGA -
hardware based DES and Blowfish symmetric cipher algorithms
for encryption and decryption of secured wireless data communi-
cation.
[12] P OPPITZ , M. FPGA Based UNIX Crypt Hardware Password
Cracker. http://www.sump.org/projects/password/,
2006.
[13] P ROVOS , N., AND M AZI ÈRES , D. A Future-Adaptable Pass-
word Scheme. Proceedings of the FREENIX Track:1999 USENIX
Annual Technical Conference (1999).