This is a strategy we use in BoringSSL to compute the value in Montgomery reduction. I haven’t found it written up anywhere else, so I thought I’d jot it down:
At least before post-quantum, almost all asymmetric cryptography (RSA, ECDSA, ECDH, etc.) requires multiplication over some large modulus: given , we want to compute .
The challenge is that is large, and division is usually neither “constant-time” nor fast. Many of our implementation concerns revolve around how to avoid division. When we can, we pick convenient moduli and write specialized code for them. Curve25519 is a pseudo-Mersenne prime, and the NIST curves are Solinas primes. (Though NIST’s use of base 232 is not as useful anymore on modern 64-bit CPUs.) Otherwise, for primitives like RSA that need an arbitrary odd modulus, the usual technique is Montgomery reduction.
Wikipedia and BearSSL give good descriptions of Montgomery reduction, but briefly:
Pick some value R that is greater than and relatively prime with our modulus, N. R should be easy to divide by, so round N to a power of our base (232 or 264). Montgomery reduction takes some T, such that , and computes . That is, we reduce it, but pay an extra factor of to do it.
We compensate for this extra factor by representing numbers differently. A number A in Montgomery form is . Now every product builds up an extra factor of R, which the extra term cancels. That is, working modulo N, we have:
So, given A and B in Montgomery form, we can compute the reduced product in Montgomery form. To implement modular exponentation in RSA, first convert inputs into Montgomery form, do all the computation, and finally convert out of Montgomery form at the end.
Converting a value from Montgomery form is simple: run Montgomery reduction. This multiplies by and cancels the R term.
Converting to Montgomery form requires one more step: precompute . Now take the Montgomery product of A and the precomputed . The term cancels one factor of R, giving . This is A in Montgomery form.
When N is a constant (e.g. elliptic curve parameters), we can just ship in code. If N is not secret, division with side channels is fine. However, when using the Chinese remainder theorem to optimize RSA, we work modulo the secret primes. This means even Montgomery setup should be side-channel-resistant.
(Some specialized platforms may also not have a division circuit, or an implementation may want a single function for both public and secret moduli.)
One approach, described in Section 2.3 of this paper, is to iteratively double mod N. Modular addition only needs a single conditional subtraction to reduce the sum. Here is a sketch of an implementation in Go:
Sample code represents large non-negative integers as little-endian slices of 64-bit words in base 264. This assumes a platform where 64-bit multiplications are efficient. All input and output slices are the same size. Unless otherwise stated, slices do not need to be minimal. That is, 42 may be represented as {42, 0, 0, 0}
if working with four-word (256-bit) values.
// computeRR returns R^2 mod n. n must be odd and represented in // the minimum number of words. I.e. n[len(n)-1] must be non-zero. func computeRR(n []uint64) []uint64 { log2RR := 2 * 64 * len(n) // Start with 2^(nBits-1), which is already reduced. nLastBits := 1 for n[len(n)-1]>>nLastBits != 0 { nLastBits++ } nBits := 64*(len(n)-1) + nLastBits r := make([]uint64, len(n)) r[len(n)-1] = uint64(1) << (nLastBits - 1) // Double until we get to RR. for i := nBits - 1; i < log2RR; i++ { r = modAdd(r, r, n) } return r }
// add returns r = a + b, with a carry bit that is set on overflow. func add(a, b []uint64) (r []uint64, carry uint64) { r = make([]uint64, len(a)) carry = 0 for i := range a { r[i], carry = bits.Add64(a[i], b[i], carry) } return } // sub returns r = a - b, with a borrow bit that is set on underflow. func sub(a, b []uint64) (r []uint64, borrow uint64) { r = make([]uint64, len(a)) borrow = 0 for i := range a { r[i], borrow = bits.Sub64(a[i], b[i], borrow) } return } // conditional returns ifOne if cond is one and ifZero if cond is zero. func conditional(cond uint64, ifOne, ifZero []uint64) []uint64 { mask := uint64(0) - cond r := make([]uint64, len(ifOne)) for i := range ifOne { r[i] = (mask & ifOne[i]) | (^mask & ifZero[i]) } return r } // reduceOnce returns x mod n where x is the integer represented by // append(a, carry). x must be less than 2*n. func reduceOnce(a []uint64, carry uint64, n []uint64) []uint64 { r, borrow := sub(a, n) return conditional(borrow-carry, a, r) } // modAdd returns a + b mod n. a and b must be less than n. func modAdd(a, b, n []uint64) []uint64 { r, carry := add(a, b) return reduceOnce(r, carry, n) }
While this works, it is not very fast. For, say, a 1024-bit modulus (a prime factor of RSA-2048), we end up running modAdd
1025 times. Instead, we can just use our existing Montgomery multiplication routine.
is just R in Montgomery form, and for some r. (r is 64 * len(n)
in the sample code.) Suppose we exponentiate , all in Montgomery form, with square-and-multiply and Montgomery multiplication:
To bootstrap the process, and can be computed with the doubling trick above. We can improve on this a bit:
First, each multiply step in square-and-multiply is just doubling the value, so replace each multiply step with modular addition. This is faster and avoids needing to keep a copy of .
Next, each square step takes some to . This is equivalent to doubling k times. In the first few iterations of the algorithm, k is small, so doubling k times is faster. In later iterations, k is large, so a single Montgomery square is faster. Adding is linear and multiplication is quadratic, so we can expect the transition point for k to be roughly linear in the size, in words, of the modulus.
In BoringSSL, we found switching at and to be good estimates for 64-bit and 32-bit CPUs, respectively. This is particularly convenient because there are no multiply steps left at this point. Different transition points may be optimal for different implementations, however.
On 64-bit, this transition point gives:
The second step will take between and iterations, depending on . For a 1024-bit modulus, the total is 17 modular doubles and 6 Montgomery squares, compared to 1025 modular doubles with the doubling-only method.
Here is a sketch of an implementation in Go:
// computeRR returns R^2 mod n. n must be odd and represented in // the minimum number of words. I.e. n[len(n)-1] must be non-zero. // n0 must be -n[0]^-1 mod 2^64, one of values needed to set up // Montgomery reduction. See negInv in the support code. func computeRR(n []uint64, n0 uint64) []uint64 { log2R := 64 * len(n) // Start with 2^(nBits-1), which is already reduced. r := make([]uint64, len(n)) nLastBits := 1 for n[len(n)-1]>>nLastBits != 0 { nLastBits++ } nBits := 64*(len(n)-1) + nLastBits r[len(n)-1] = uint64(1) << (nLastBits - 1) // Double until we get to 2^len(n) * R. This takes // (65 - nLastBits) + len(n) iterations. for i := nBits - 1; i < log2R+len(n); i++ { r = modAdd(r, r, n) } // Montgomery square 6 times. // 2^((2^6)*len(n)) * R = 2^(64*len(n)) * R = 2^log2R * R = R * R for range 6 { r = montgomeryMul(r, r, n, n0) } return r }
// mulAdd adds a * b to r. The result is written to r, except // for the top-most word, which is returned. func mulAdd(r, a []uint64, b uint64) uint64 { var carry uint64 for i := range r { hi, lo := bits.Mul64(a[i], b) lo, carryBit := bits.Add64(lo, carry, 0) hi, _ = bits.Add64(hi, 0, carryBit) // Can't overflow r[i], carryBit = bits.Add64(r[i], lo, 0) hi, _ = bits.Add64(hi, 0, carryBit) // Can't overflow carry = hi } return carry } // negInv returns -n^-1 mod 2^64. n must be odd. func negInv(n uint64) uint64 { // See https://crypto.stackexchange.com/a/47496 and // https://bearssl.org/bigint.html#montgomery-reduction-and-multiplication // for why this works. r := n for range 6 { r *= 2 - n*r } return -r } // montgomeryMul returns the Montgomery multiplication of a and b // mod n. n0 must be negInv(n[0]) func montgomeryMul(a, b, n []uint64, n0 uint64) []uint64 { var carry uint64 t := make([]uint64, 2*len(n)) for i := range n { hi1 := mulAdd(t[i:i+len(n)], a, b[i]) hi2 := mulAdd(t[i:i+len(n)], n, n0*t[i]) t[i+len(n)], carry = bits.Add64(hi1, hi2, carry) } return reduceOnce(t[len(n):], carry, n) }