Post on 29-Jul-2015
20141. “BeyondModes: Building a Secure Record Protocol from a Cryptographic SpongePermutation.” CT-RSA 2014. LNCS 8366, pp. 270–285, Springer (2014)
2. “CBEAM: Efficient Authenticated Encryption from Feebly One-Way ϕ Functions.”CT-RSA 2014. LNCS 8366, pp. 251–269, Springer (2014)
3. “STRIBOB: Authenticated Encryption fromGOST R 34.11-2012 LPSPermutation.” CTCrypt ’14. To appear inMathematical Aspects of Cryptography,SteklovMathematical Institute of RAS (2014)
4. “Simple AEADHardware Interface (SÆHI) in a SoC: Implementing anOn-ChipKeyak/WhirlBob Coprocessor.” TrustED 2014, ACMCCS 2014Workshops, 03November 2014, Scottsdale AZUS. To appear. ACM (2014)
5. “Lighter, Faster, and Constant-Time: WHIRLBOB, theWhirlpool variant ofSTRIBOB.”With Billy Bob Brumley. IACR ePrint 2014/501. Submitted (2014)
6. “BRUTUS: Identifying CryptanalyticWeaknesses in CAESAR First RoundCandidates.” IACR ePrint 2014/850. Submitted (2014)
+ Invited Talks.1 / 17
Simple AEADHardware Interface (SÆHI) in a SoC:Implementing anOn-Chip Keyak/WhirlBob Coprocessor
Dr. Markku-Juhani O. Saarinenmjos@item.ntnu.no
NORWEGIANUNIVERSITY OF SCIENCE AND TECHNOLOGY
TrustED ’14 – 03November 2014 – Scottsdale AZ
2 / 17
Authenticated Encryption with Associated DataAn Authenticated Encryption with Associated Data (AEAD) primitive provides:
▶ Encryption or confidentiality / privacy protection, and▶ Authentication or integrity protection for encrypted and associated data.
Preferably in a single pass over the data.Security protocols such as IPSec and SSL/TLS usually required two processing stepsfor each packet in 1990’s and 200x’s.
▶ Authenticationwas handled with a HMAC (HashMessage Authentication Code).▶ Encryptionwas provided either with block cipher such as 3DES-CBC orAES-CBC or a stream cipher such as RC4.
Hardware implementation of such a twin set-up is cumbersome.Transition to AE has been swift during recent years because of ACM-GCM’s status inSuite B (Classified COTS) andmany attacks like CRIME, LUCKY13, POODLE.
3 / 17
Background: CAESAR project for newAEAD algorithms 2014-2017NIST - sponsored internationalCompetition forAuthenticatedEncryption: Security,Applicability, and Robustness.http://competitions.cr.yp.to/caesar-call.html
▶ Jan 2013 Announced by Dan Bernstein (secretary)▶ Mar 2014 Deadline for first-round submissions (57)▶ May 2014 Deadline for first-round software▶ Aug 2014 DIAC ’14Workshop, UCSB▶ Jan 2015 Second round candidates announced▶ Feb 2015 Second round tweaks (fixes)▶ Feb 2015 Second roundVerilog / VHDL (this talk)▶ Dec 2015 Third round candidates▶ Dec 2016 Final round candidates▶ Dec 2017 Final CAESAR portfolio announcement
4 / 17
Hardware API for Authenticated EncryptionCAESAR candidates came in many shapes and sizes. Here’s a rough breakdown:
8 are clearly based on a SHA3-style Sponge construction.9 are (somehow) constructed from AES components.
19 are AESmodes of operation.21 are based on other design paradigms or are entirely ad hoc.
Wewant consistent testing across second round candidates.Signalling. How to communicate with the hardware ? Can a consistent, high-level“hardware API” be constructed ?Memory access. Some prominent proposals (AEZ and SIV) require two passes over thedata, so APIs in the style of hash functions don’t really work.What to test. Realistic test profiles via operating system and application integration.
5 / 17
System-on-Chip (SoC) Designs
1241.664853.829
314.065
Total global shipments 2014 (million units)
Android Other mobile PC total
Majority of Internet andcommunication devices areAndroid Linux - based tablets
or smart phones.
System-on-Chip (SoC) designs integrate all the necessarycomponents of a computing application on a single chip.Mobile electronics such as (smart) phones and tablets arebuilt on SoCs. Also used in found in Internet-of-Things (IoT)appliances, modems, routers, homemedia, cars, etc.Security of transmitted and stored data is evenmore relevantto mobile devices than to traditional PC systems.Limited CPU performance.Energy efficiency critical.Coprocessors: Audio and video codecs, RF processing, 3Ddisplay rendering, M7/CCPmotion, natural language, etc.→Our evaluation target.
6 / 17
Zynq-7000 FPGAArtix 7 / ARMCortex A9 SoC
2xI2C
2xSPI
2xCAN
2xUART
GPIO
2x SDIOwith DMA
2x USBwith DMA
2x GigEwith DMA
XADC2x ADC, Mux,
Thermal Sensor
EMIOACPGeneral Purpose
AXI PortsHigh Performance
AXI Ports
PCIe Gen21-8 Lanes
SecurityAES, SHA, RSA
Multi-Gigabit TransceiversMulti-Standard I/Os (3.3V & High-Speed 1.8V)
Processing System
Programmable Logic(System Gates, DSP, RAM)
Proc
esso
r I/O
Mux
Flash ControllerNOR, NAND, SRAM, Quad SPI
Multiport DRAM ControllerDDR3, DDR3L, DDR2
DMATimersConfiguration
ARM®CoreSight™Multi-Core Debug and Trace
256 KbyteOn-ChipMemory
SnoopControlUnit
512 Kbyte L2 Cache
General InterruptController
WatchdogTimer
Cortex- A9 MPCore32/32 KB I/D Caches
NEON™DSP/FPUEngine NEONDSP/FPUEngine
Cortex ™- A9 MPCore32/32 KB I/D Caches
AMBA® Interconnect AMBA Interconnect
AMBA Interconnect AMBA Interconnect
On a single chip:▶ Dual-core ARMCortex A9 CPU@650MHz.▶ Artix-7 or Kintex-7 - type FPGA logic fabric.▶ Can run Linux and Android.▶ Realistic target for SoC implementations.▶ Full devkit under $200.
We Study:▶ Hardware assisted implementations vs.software vs. hardware implementations.
▶ FPGA and software footprint, speed, power.▶ Integration in applications e.g. via OpenSSL.
7 / 17
Implementation 1: Keyak (SHA3Keccak) Corek1600_1
k1600
clk
rnd[4:0]
in[1599:0]
out[1599:0]
keccak_rc_i
RTL_ROM
O[63:0]A[4:0]
tp[0]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[0]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[0]1_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[0]2_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[1]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[1]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[1]1_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[1]2_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[2]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[2]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[2]1_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[2]2_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[3]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[3]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[3]1_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[3]2_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[4]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[4]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[4]1_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
tp[4]2_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[0]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[0]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[1]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[1]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[2]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[2]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[3]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[3]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[4]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[4]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[5]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[5]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[6]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[6]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[7]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[7]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[8]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[8]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[9]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[9]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[10]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[10]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[11]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[11]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[12]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[12]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[13]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[13]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[14]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[14]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[15]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[15]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[16]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[16]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[17]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[17]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[18]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[18]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[19]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[19]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[20]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[20]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[21]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t1[21]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[22]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t1[22]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0] t1[23]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[23]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0] t1[24]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t1[24]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[0]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[0]0_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[0]0_i__0
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[1]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[1]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[2]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[2]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[3]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[3]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[4]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[4]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[5]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[5]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[6]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[6]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[7]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[7]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[8]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[8]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[9]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[9]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[10]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[10]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[11]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[11]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[12]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[12]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[13]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[13]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[14]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[14]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[15]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[15]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[16]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[16]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[17]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[17]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[18]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[18]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[19]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[19]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[20]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[20]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[21]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[21]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0] t3[22]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[22]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]t3[23]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]
t3[23]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]
t3[24]_i
RTL_XOR
O[63:0]I0[63:0]
I1[63:0]t3[24]0_i
RTL_AND
O[63:0]I1[63:0]
I0[63:0]63
:0
Single round of Keccak/Keyak 1600-bitcore permutation drawn with 64-bitdata paths. Mainly XORs visible.
SHA3 algorithm Keccak and Keyak AEADs:▶ Keccak is a sponge hash with a 1600-bit corepermutation, selected as SHA3 in 2012.
▶ Designed by Guido Bertoni, Joan Daemen,Michaël Peeters, and Gilles Van Assche.
▶ Same team proposed the Keyak family ofAEADs that utilize the same permutation in theCAESAR project.
We implemented:▶ The 1600-bit core in Verilog for the Artix-7FPGA core of Zynq 7000.
▶ Themodule can be utilized for both hashing andauthenticated encryption.
8 / 17
Implementation 2: WhirlBob /Whirlpool Core
WhirlBob Core.
StriBob,WhirlBob, andWhirlpool:▶ StriBob is a CAESAR proposal byMarkku-Juhani O. Saarinen.▶ WhirlBob is a 2nd round tweak proposed byM.-J. O. Saarinen andBilly Bob Brumley (submitted to INSCRYPT ’14).
▶ WhirlBob is based on the permutation of the (ISO) StandardizedWhirlpool 3.0 hash by Paulo Barreto and Vincent Rijmen.
We implemented:▶ The 512-bit, 1 cycle per round core permutation in Verilog for theArtix-7 FPGA core of Zynq 7000.
▶ Themodule can be utilized for bothWhirlpool hashing andWhirlBob authenticated encryption.
9 / 17
Hardware Performance
With one “extra” reloading cycle per block, the maximum theoretical throughput ofthese implementations is:
Parameter WhirlBob KeyakRounds 12 12Cycles 13 13Rate (bits) 256 1344Speed (bit/clk) 19.7 103.4
Processing speeds are significantly slower when the Keccak core is used in the24-round SHA3 hashing mode. Speed ranges from 23.0 (SHA3-512) to 47.5(SHA3-224) bits/clock. Whirlpool, in coparison, is slightly faster thatnWhirlBob.
10 / 17
CAESAR Software API vs. Hardware APIA simple C API was specified by the CAESAR secretariat for reference softwareimplementations of the first round candidates.i n t c rypto_aead_encrypt (
u i n t 8 _ t ∗c , u i n t64_ t ∗ c len , / / C ipher tex tconst u i n t 8 _ t ∗m, u in t64_ t mlen , / / Messageconst u i n t 8 _ t ∗ad , u i n t64_ t adlen , / / Assoc ia ted Dataconst u i n t 8 _ t ∗nsec , / / ( Secret IV )const u i n t 8 _ t ∗npub , / / Nonceconst u i n t 8 _ t ∗k ) ; / / Secret Key
Decryption and integrity verification can be performedwith crypto_aead_decrypt(),which has an equivalent interface.SÆHI utilizes the same software API and a simple memory-mapped hardware API. Thesoftware side is essentially a driver suitable for bare metal implementation.
11 / 17
Proposed Baseline Hardware APIOur cryptographic coprocessor has a simple, almost universal memory-mappedinterface. Themodule or hardware PIN interface is the same as for generic single portRAM (with optional interrupt request line).
Signal Dir Purpose DiagramADDR In Address
ADDR
DI
WE
EN
CLK
AEAD
Core
DO
IRQ
DI In DataWriteWE In Write enableEN In Enable/SelectCLK In ClockDO Out Data ReadIRQ Out Interrupt Req.
The signaling between software component and this API is defined by the driver.Faster (DMA, AXI) alternatives can be used – this is just the baseline interface.
12 / 17
Comparing Implementations
Code lines in ourWhirlBob (StriBob) andKeyak reference implementations:
Component WhirlBob KeyakInterface Verilog 99 114Round Verilog 228 129Driver C 60 60API Interface C 261 ≈ 250Total code 639 553
Post synthesis and route utilization withinArtix-7 FPGA fabric of Xilinx Zynq 7010:
Logic WhirlBob KeyakLUTs 3,795 4,574Flip-Flops 1,060 3,237MUXs 90 159Other 1 2Total logic 4,946 7,972
13 / 17
Implementations
We first developed the implementations with a homemade VGAmodule (not utilizingCPU at all). The implementations were then integrated into Xillinux and andmadeaccessible to user space daemons.
14 / 17
What to test?Wehope tomeasure for each candidate:A Area. FPGA Slices or ASICGate Equivalents.W Power. Power consumption (Watts = J/s).R Speed. Ideal throughput (Bytes/Second).
One key goal is to maximizee = R
W .
Note that doubling A for factor 2 parallelismwill approximately double both R andWand ewill remain constant.The same is true for doubling the clock frequency since power consumption is almostlinearly dependent on clock frequency for most (CMOS) circuits.Hence Bytes/Joule is perhaps themost relevant metric for mobile devices.
15 / 17
Integration path for Linux/Android Testing
libcrypto.so
OpenSSL Crypto API
AEAD Plugin “engine”
libmyaead.so
TLS API
libssl.so
SSH API
libssh.so
Browser
application
SSH
application
utilities
cmd tools
ciphers
protocols
apps
user space processes
SÆHI daemonnot available
interprocesscommunication
Cipher
Software SÆHI
Daemons
CPU Core KERNEL
SÆHI
AEAD 1
SÆHI
AEAD 2System-on-Chip ▶ The dominant underlying API for Linux is based on
OpenSSL: libcrypto, libssl. Supported by browsers etc.▶ OpenSSL supports configurable plugin “engines”.▶ After recent bugs (heartbleed), new forks:
▶ Google: BoringSSL.▶ OpenBSD group: LibreSSL (upcoming ressl API).
▶ Since the hardware accelerator is a shared resource,implement as an user space daemon.
▶ Utilize experimental ciphersuite identifiers inapplications and TLS, SSH, IPSec. Plug-in CAESARciphers to replace AES-GCM.
▶ Measure utilization, power, time, throughput withrealistic usage profiles.
16 / 17
Conclusions
▶ CAESAR is a project to find next-generation Authenticated Encryption algorithms.▶ Proposed SÆHI, a simple memory-mapped hardware API for CAESAR ciphers.▶ Realistic hardware target: System-on-Chip with FPGA logic and ARMCortex A9.▶ FPGA implementations of Keyak andWhirlBob algorithms.▶ Integration path for Applications in Android.
next.. a little demo!
17 / 17