mod_1 and div_1

Wed May 5 22:20:01 CEST 2010

And speaking of Montgomery's mod_1 trick, I just realized that it *can*
be extended to compute the quotient too. To be concrete, consider the
division of a four limb number N = n_0 + n_1 B + n_2 B^2 + n_3 B^3 by a
single limb number d. For simplicity, assume d is in the range B/2 < d <
B.

Current code computes this as a loop around udiv_qrnnd_preinv, which
takes three iterations. A total of 3 umul, 3 umullo, in a fully
dependent chain.

Montgomery's trick, in its simplest form, computes N mod d as follows.
First precompute B' = B^2 mod d. Then we can write

  N = n_0 + B (n_1 + n_2 B + n_3 B') (mod d)

The expression in parentheses is a two limb number (for simplicity, I
ignore the possibility of overflows here and in the rest of this mail),
denote it n_1' + n_2' B. We now have

  N = n_0 + n_1' B + n_2' B^2 = n_0 + n_1' B + n_2' B' (mod d)

where the right hand side is is now a two limb number. Do one final
divide, using udiv_qrnnd_preinv,

  n_0 + n_1 B + n_2' B' = q_0 d + r

So we need one full umul per limb reduced, plus an additional umullo for
adjustment in the final udiv_qrnnd_preinv. So the total multiplication
count is three umul and one umullo.

What if we precompute the quotient as well? Let B^2 = d m + B' (m is
the familiar approximate reciprocal, and it's one limb plus an
additional one bit which can be represented implicitly). Then if we
substitute this for B^2 in the above computation to keep track of the
multiples of d, we get

  n_1 + n_2 B + n_3 B^2 = n_1 + n_2 B + n_3 B' + n_3 m d

and

  n = n_0 + n_1 B + n_2' B' + (n_2' m + n_3 m B) d
    = r + (q_0 + n_2' m + n_3 m B) d

So we get the quotient by two additional *independent* umul, and an
addition chain to combine the resulting double-limb products. One could
add these together in each iteration (but then one risks doing carry
propagation more than once), or store high and low limbs separately and
do a single mpn_add_n at the end).

For large n, we do two full umul per limb, but since the next iteration
depends on only one of them (the multiplication by B') we only have a
single multiplication latency to wait between iterations. So it could
potentially be twice as fast as the udiv_qrnnd_preinv loop.

Does this make sense? Since m is the good old approximative reciprocal,
we generate each quotient limb in more or less the same way as we get
each candidate quotient in the udiv_qrnnd_preinv loop. But then we don't
adjust the partial remainder multiplying the quotient candidate by d and
subtracting. Instead we compute a double-limb remainder by multiplying
with B' and add this to the less significant words. And then all
adjustments are postponed until the end (except we might have to handle
the overflow cases I ignore above).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.