Parallelizing MD5 Checksum Computation to Speed Up S3-Compatible Data Movement
The Problem
S3 compatibility threatens to slow down data movement at scale.
I work for a company that manages unstructured data for large enterprises. As we back up data from our customers’ data centers, we are expected to integrate easily with their existing systems. For this reason, the Igneous Storage API, like many modern storage APIs, is largely Amazon S3-compatible.
For data integrity, the S3 protocol specifies an MD5 content hash for each object in a PUT request. MD5 is a secure hash function invented in 1992, and even though it is considered obsolete by the security community it is still in daily use worldwide.
Ultimately, because the S3 API requires the computation of MD5 checksums, the Igneous platform’s performance can be bottlenecked by how fast we can compute MD5 checksums for the thousands of objects per second we persist during a write episode. Because the platform must be able to move billions of files quickly and reliably, we need to address this bottleneck.
The Context: Our Environment and the Tools We Use
At Igneous, we use Go (golang) throughout our software stack: it drives our on-premises storage equipment (Intel CPU servers and armv7 storage processors) as well as our cloud-resident management plane. Most of the heavy lifting in our system is done by our Intel servers. Among the many thousands of lines of Go code we have written, we have used assembly language in a handful of places to accelerate low-level data motion or checksum functions, such as computing the MD5 checksum.
Although Go is a high-level language, it provides convenient access to its assembler. One reason to use assembly language is for access to architecture-specific CPU instructions that the compiler cannot generate. This applies especially to “vectorized” inner loops where an architecture-specific vector unit may be available to process blocks of data in special registers 128, 256 or even 512 bits at a time. In Intel’s nomenclature, the vector unit is called “AVX.”
A Little More Background on MD5
MD5 is a trapdoor function; it can be used to verify file integrity, typically against unintentional corruption. The MD5 hash works by initializing a 128-bit state vector and modifying it using a series of shifts and bitwise logical operations over every 64 bytes of input. Each round of MD5 consists of 16 applications of four nonlinear functions, each one of which consists of about 8-10 machine instructions. This works out to about 8-10 machine instructions per byte of input. In practice, a modern pipelined CPU can compute MD5 at a rate of about 4.5 cycles per byte.
Although this represents some improvement over a non-pipelined solution, 4.5 cycles per byte implies a throughput of around 600MB/sec, and in a world of 10Gb networking, it’s easy to see how MD5 can become a bottleneck. Ideally, we would like to parallelize MD5, but the hash is inherently not parallelizable since each round modifies the state vector in a way that is dependent on all previous rounds.
How can we improve this process, then?
Considering that we batch-process many thousands of PUT requests per second, there would be a benefit to processing many distinct MD5 checksums in parallel, and this is something we can do using AVX. Let’s see how it works.
The Solution
Standard Library
It helps to first look at what the scalar MD5 checksum code does before trying to improve it. In fact, the Go standard library already has a highly optimized scalar implementation of MD5 using assembly language, and each round is written as a macro. The running state of the MD5 algorithm is saved in a digest defined as:
// digest represents the partial evaluation of a checksum.
type digest struct {
s [4]uint32 // 128 bit state vector
x [chunk]byte // partially-written 64 byte block
nx int // “len” x
len uint64 // count of bytes checksummed
}
And when a large run of bytes is consumed, data is checksummed in bulk (at multiples of the block size) by calling a block function with the signature:
func block(dig *digest, p []byte)
The Intel version of this inner function must first unpack the digest as it is passed on the stack. Its function prologue looks like this:
TEXT ·block(SB),NOSPLIT,$0-32
MOVQ dig+0(FP), BP
MOVQ p+8(FP), SI
MOVQ p_len+16(FP), DX
SHRQ $6, DX
SHLQ $6, DX
LEAQ (SI)(DX*1), DI
MOVL (0*4)(BP), AX
MOVL (1*4)(BP), BX
MOVL (2*4)(BP), CX
MOVL (3*4)(BP), DX
You can see that the state is loaded directly into the four registers AX, BX, CX, DX, and each round operates on these registers. As an example, this is what the fourth round looks like:
#define ROUND4(a, b, c, d, index, const, shift) \
LEAL const(a)(R8*1), a; \
ORL b, R9; \
XORL c, R9; \
ADDL R9, a; \
MOVL (index*4)(SI), R8; \
MOVL $0xffffffff, R9; \
ROLL $shift, a; \
XORL c, R9; \
ADDL b, a
The important thing to notice here is that MD5 splits its state vector into four distinct 32-bit words (named a, b, c, d) and that in fact all the Intel instructions operate on 32-bit quantities (the “L” suffix on ADDL, MOVL, XORL, etc. denotes a 32-bit register width). Therefore, MD5 is, in essence, a 32-bit algorithm, and by using a 256-bit vector unit we could in principle achieve an 8-way parallelization (8x32 = 256).
After all the rounds of the kernel of the block checksummer have been applied, the new state vector is stored back into the digest:
end:
MOVL AX, (0*4)(BP)
MOVL BX, (1*4)(BP)
MOVL CX, (2*4)(BP)
MOVL DX, (3*4)(BP)
RET
AVX2
Intel’s second-generation AVX architecture (AVX2) specifies 256-bit registers, and we can imagine running 8 simultaneous calculations of MD5 in parallel, provided that we can restrict each instance of MD5 to its own 32-bit “lane”. (The idea of a lane is that neighboring calculations in a wide register do not interfere with one another. As an example, imagine a two-lane machine which stores single decimal digits in each lane, so that: (2, 2) + (9, 9) = (1, 1). In other words, the carry digit is never propagated from one lane to the next.)
Vectorization
Running these calculations in parallel is called “vectorization.” Let’s do that now.
The first task of vectorizing the ROUND macros to find a vector-equivalent instruction for each instruction in their sequences. Μany have a one-to-one representation. In particular, the bitwise operations translate trivially to vector registers: bitwise XOR is bitwise XOR regardless of register width.
The other instructions must be kept within a 32-bit lane. Fortunately, Intel offers shifts and adds on 32-bit lanes (“D” suffix on VPSRLD, VPSRLD, denotes a 32-bit integer in AVX), and the shifts can be used to synthesize a 32-bit roll. So after translating the macro above to AVX, it looks like this:
#define ROUND4(a, b, c, d, index, const, shift) \
VPADDD 32*const(consts),a,a;\[ ADDL -> VPADDD ]
VPADDD mem,a,a;\[ ADDL -> VPADDD ]
VORPS b,tmp,tmp;\[ ORL -> VORPS ]
VXORPS c,tmp,tmp;\[ XORL -> VXORPS ]
VPADDD tmp,a,a;\[ ADDL -> VPADDD ]
load(index);\
roll(shift,a);\[ ROLL -> roll macro ]
VXORPS c,ones,tmp;\[ XORL -> VXORPS ]
VPADDD b,a,a[ ADDL -> VPADDD ]
#define roll(shift,a)\
VPSLLD $shift,a,rtmp1;\
VPSRLD $32-shift,a,a;\
VORPS rtmp1,a,a
(The annotations in the right-hand column show how the new instructions correspond to the scalar instructions in the old ROUND4 macro.)
Constants
Finally, something must be done about the immediate constants on which MD5 depends. These are handled as immediate loads in the scalar code (the LEAL instruction earlier is actually an Intel shortcut for an immediate load), but AVX has no way to provide for immediate constants. To solve this, we precompute the MD5 constants and store them in a lookup table, “inflated” and replicated 8-wide.
var md5consts = [64]uint32{
0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee, 0xf57c0faf,
[ … 55 words elided … ],
0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391,
}
var avx256md5consts = inflate(md5consts[:], 8)
func inflate(c []uint32, x int) []uint32 {
inf := make([]uint32, x*len(c))
for i := range c {
for j := 0; j < x; j++ {
inf[(i*x)+j] = c[i]
}
}
return inf
}
(Curiously, these MD5 constants are the binary expansion of certain trigonometric functions, so the output of this hash function depends implicitly on the value of π!).
“Gathering”
This sleight-of-hand is good enough for the MD5 kernel, but unless the source bytes are loaded nicely into a 256-bit register, the optimized inner loop is worthless. To this end, we make use of a special “gather” instruction which can load bytes from 8 separate sources into one 256-bit register. It has a very exotic format:
VMOVAPD mask,tmp;\
VGATHERDPS tmp,index*4(base)(off*1),dst
Eight 32-bit integers are loaded from addresses specified by a base register (64-bit scalar) with a 256-bit vector register offering 8 distinct offsets for the 8 source locations. Finally, a bitmask specifies which subset of these 8 offsets is valid, in the event that the gather instruction is used to load fewer than 8 sources at once.
One restriction here is that the offsets are limited to 32 bits, and so the sources must all lie in the same 32-bit “window” of a 64-bit address space. This is an inevitable consequence of using a 256-bit register to supply 8 offsets, and at Igneous we work around this by making sure to pool-allocate memory such that any source buffers for parallel MD5 are guaranteed to be within a 32-bit span of one another.
Digests
The last aspect worth touching on is how this MD5 implementation can use the same MD5 digest as the scalar routine in the standard library. In fact, we define a “write N” function as follows:
func WriteN(h []hash.Hash, p [][]byte) (n int, err error)
This function takes any N MD5 digests and associated source buffers and updates the digests after parallelizing as many of the checksums as possible. For the sake of simplicity, we require that len(p[i]) is the same for all i, since the slotting and scheduling of parallel checksumming happens at a higher level.
Since the standard library does not export its digest definition, we must spoof this in the parallel MD5 with a structure cast, but a unit test takes care that the standard library definition does not drift.
After spoofing the hash interfaces this way, they can be packed onto the stack as a sequence of 32-bit integers which can be loaded directly into four 256-bit registers; i.e., all eight digests’ s[0] comprise the 8-way state vector for “a,” all eight digests’ s[1] comprise the state vector for “b,” and so on:
// condense up to 8 hash states at a time into
// 4x8 uint32 vectors
var s digest8
for i := range h {
d := digestOf(&h[i])
s.v0[i] = d.s[0]
s.v1[i] = d.s[1]
s.v2[i] = d.s[2]
s.v3[i] = d.s[3]
}
And similarly, after calling the 8-way block function, the digests are re-inflated into the caller structs:
// burst hash state into constituent digests
for i := range h {
d := digestOf(&h[i])
d.s[0] = s.v0[i]
d.s[1] = s.v1[i]
d.s[2] = s.v2[i]
d.s[3] = s.v3[i]
d.len += uint64(n)
}
The advantage of using the same digest format is that the standard library hash can be used to “mop up” trailing bytes:
// write trailing bytes
if n != m {
for i := range h {
h[i].Write(p[i][n:])
}
n = m
}
Performance of Our Solution
The benchmarking built into Go is a pleasure to use, and it gives immediate performance feedback:
go test -bench . -run none
goos: darwin
goarch: amd64
pkg: igneous.io/common/crypto/md5vec
BenchmarkMd5-4 100 13331769 ns/op 629.22 MB/s
BenchmarkMd5by8-4 500 2392278 ns/op 3506.54 MB/s
A “perfect” speedup would be 8x, but in practice, we get closer to 5.5x (3506 / 629). There may be a few explanations for this: The 8-way gather instruction is probably a bottleneck getting to the source data; the vector unit may not pipeline operations as well as the scalar unit; and in fact the translation of instructions from scalar to vector is not one-to-one, because certain sequences like roll must be expanded into shifts and masks. Nevertheless, this represents a huge improvement over the scalar code.
And there you have it! I hope you’ve enjoyed following along with how to parallelize MD5 checksum computation, and that this gives you a fun glimpse into one situation where it makes a lot of sense to use the Go assembler.
UPDATE: Check out the Git repository.