%% revision Jan 10, 2005, RJF
%Your manuscript has been successfully uploaded to Transactions on Mathematical Software. You will receive future communications via e-mail.
%
%Your manuscript number is: TOMS-2003-0055
%
%Please make note of your manuscript number. You will receive an e-mail from TOMS within 24 hours of the end of this process, confirming receipt of your submission.
\documentclass{article}
\usepackage{fullpage}
\title{Can you save time in multiplying
polynomials by encoding them as integers?/revised 2010}
\author{Richard J. Fateman\\
University of California\\
Berkeley, CA 947220-1776}
\begin{document}
\maketitle
\begin{abstract}
Encoding a univariate polynomial as a long integer conceptually
unifies some processing tasks on polynomials with those on integers,
including multiplication. An evaluation of a polynomial at a single
point is sufficient to encode and decode all its coefficients if the point is
sufficiently larger than the coefficients. This is sometimes referred
to as ``Kronecker's trick''. We show that this idea has practical
implications for efficient programs.
\end{abstract}
\section{Introduction}
I was recently prompted\footnote{by a note from Daniel Lichtblau
concerning the use of GMP for multiplying polynomials in Mathematica
5.0. DL's note is in the appendix.} to re-examine the practical
possibilities of packing coefficients of polynomials into long
integers or ``bignums'' and using efficient integer library routines
to do polynomial multiplication.
{{This kind of transformation, evaluating a polynomial to map it to an
integer, is sometimes called ``Kronecker's trick'' (see von zur Gathen
and Gerhard \cite{vzg} section 8 for some references and history suggesting this
dates back to at least 1882). The essential equivalence of dense
polynomials and long integers is fairly widely-known in the symbolic
computation community, to the extent that in the first (1969) edition
of Knuth's {\em Art of Computer Programming,} volume 2 \cite{knuth} we
see ``polynomial arithmetic modulo $b$ is essentially identical to
multiple-precision arithmetic with radix $b$, except that all carries
are suppressed. ... In view of this strong analogy with
multiple-precision arithmetic, it is unnecessary to discuss polynomial
addition, subtraction, and multiplication any further in this section
[4.6.1].''}}
Yet during the early development of computer algebra systems starting
in the mid-1960s, these possibilities, if entertained at all, were
generally dismissed as of theoretical interest only. Circumstances
have changed. In particular we are provided with a competitive
environment for the deployment of high-quality (free) software
libraries that might relieve the computer algebra systems programmers
from many concerns about implementation of arithmetic on
multi-precision integers. Additionally, we are presented with far
more powerful and complicated computers.
In fact, the main practical motivation here for using this technique
is to provide fast implementation of a polynomial multiplication
{\em with less programming} by (re-)using
others' efficient software. Additionally, when we use (others') library programs,
we can benefit from their continued refinement as well as
adaptation to new and improved hardware.
Somewhat paradoxically, we reduce the
polynomial arithmetic problems to integer problems which are solved
(conceptually at least) by viewing the integers as polynomials.
This still leaves us a choice as to which of the elaborate
implementation libraries for bignum arithmetic to incorporate as a
library, We currently are using GMP \cite{gmp} but could probably use
one of its forks (MPIR). In the past we have made use of a similar
package, ARPREC \cite{arprec}, but this seems to have fallen into
relative disuse. NTL \cite{shoup} was another library, but its author
seems to recommend linking it to GMP for best performance.
GMP seems to have an enthusiastic and substantial effort mounted to
optimize it for different hardware architectures, as well as subtle variations in the
architectures. A major concern seems to be optimizing the
choice of algorithm for the
best long-integer multiplication variation (classical to FFT with
variations between), and with different (e.g. unbalanced size) inputs.
Using more than one ``core'' CPU may also be of interest\footnote{In our experience with GMP we found it is not always
possible to install the latest, most-specific and efficient version
on a target machine, but we suspect that a slightly more generic
version, pre-compiled for a class of similar machines
and available with less effort, may be nearly as efficient.}.
%% DL claims a factor of 2 in WRI tests.
Given such a standard library, we can view a system for computer
algebra polynomial manipulation as one that is initially machine
independent and stated at a high level in a common language, without
concern for the underlying computer architecture. Starting from this
perspective we can achieve efficiency on any problem {\em for each
machine} because -- once turned over to the corresponding bignum
package -- the principal operations are effectively {\em optimized for
that architecture by the authors of GMP}.
Just to be clear, let us ask and then answer this
%%obvious questions''Would it
obvious question, ''Would it be better to directly write an FFT
program for fast polynomial multiplication, instead of using this
route?'' The answer: In principle, yes. In practice, probably not.
We've written such programs; frankly it is laborious to get all the
details right, and an endless task to continue to refine such programs
in the face of changes in computer architecture and memory
implementation. We are also sure that our programs are not the best
possible, even for the architecture we have used for our tests:
certainly they are not targetted to specific {\em other}
architectures. They are not as good as the GMP versions. Shoup in his
NTL system \cite{shoup} {\em does} do this hard work; our technique
consequently is likely to be a retrograde step for NTL. (This is
confirmed by our simple benchmark.) Perhaps we can find a library FFT
routine that is not only right, but has been constructed with the same
obssessive attention to detail and efficiency that is seen in GMP? We
have in fact experimented somewhat successfully with FFTW \cite{FFTW}
Such an FFT performed on fixed-size ``floating-point'' words is quite
fast (and correct) in certain examples: when the result coefficients
are somewhat less than the size of the floating-point fraction size.
However, if the coefficients are much smaller, then the Kronecker trick,
by packing several per word, may win. For example,
Computing with 8 coefficients packed into a 64-bit word clearly has
the potential to speed up calculations over 8 64-bit words.
Intel's MMX architecture
provides a view of the same potentiality with applications in signal
processing and graphics. Vector or array super-scalar operations on
some computer may make the argument plausible for larger assemblages,
say if one can add two vectors of 128 words at once.
Yet another technology argument for this potentially more compact
encoding: if the version of the polynomial as an integer
keeps the total size within some cache limit,
it may provide a substantial performance gain.
Since the encodings of polynomials require an initial scan of the
data to compute a bound on coefficient sizes prior to packing into
an array, the behavior of this scan with respect to caches should
be incorporated in any detailed analysis.
%reworked from ref #1
\section{Multiplication}
Take the task of multiplying two polynomials in a single
variable with non-negative integer coefficients. For simplicity of
exposition we will talk about multiplying a polynomial
by itself, so as to require encoding only one input,
but the method is not limited to squaring.
Consider the problem of squaring $p= 34x^3+56x^2+78x+90$.
The result is this polynomial:
% 1156*X^6+3808*X^5+8440*X^4+14856*X^3+16164*X^2+14040*X+8100
$$ 1156\,x^{6}+3808\,x^{5}+8440\,x^{4}+14856\,x^{3}+16164\,x^{2}+%
14040\,x+8100 $$
Now take $p$ evaluated at $x=10^5$ and square this integer to
get $1156038080844014856161641404008100$. One can just read off
the coefficients in the answer, modulo $10^5$, $10^{10}$ etc.
If we choose a power of two instead of a power of 10, and if the
bignum representation is binary or some power of two (say $2^{32}$),
then evaluation into an integer can be done by shifting and masking:
packing the coefficients padded by zeros. Unpacking the coefficients
is similarly a matter of shifting and masking.
{\em Doing this efficiently requires attention\footnote{But not
necessarily attention by US!} to nasty details of the representation
of bignums!} Nevertheless, ignoring efficiency, it is possible to
state the process very simply as arithmetic, as shown below.
We seem to be done with the task, except for figuring out how much
padding is necessary.
An exact bound could be computed by multiplying the two polynomials, but
that is exactly the work we hope to avoid.
%Given a polynomial $p$ define its abs-norm $N_1(p)$ to be the sum of
%its coefficients\footnote{In general we would take absolute values, but at
%this point we allow only non-negative coefficients anyway.}.
%The largest coefficient of a
%product of two polynomials $p$ and $q$ is bounded by $N_1(p)\times N_1(q)$.
%This bound is reached if $p$ and $q$ are simply constants.
If we define max-norm $N(p)$ to be the maximum of the absolute values
of the coefficients in a polynomial $p$, and $d_p$ to be the degree of
$p$ in the variable $x$, then the largest coefficient in $p\times q$
is bounded by $(1 + \min (d_p,d_q))\times N(p)\times N(q)$.
[Sketch of proof: if $p=\sum{a_ix^i}$ and $q=\sum{b_kx^k}$ then the
largest coefficients in $r=p\times q$ for fixed $N(p)$ and $N(q)$
occur when $a_0=a_1=\cdots=N(p)$ and $b_0=b_1=\cdots=N(q)$. We can
factor out $N(p)$ and $N(q)$ simplifying the question then to what is
the largest coefficient in a product of two polynomials with
coefficients all equal to 1? The coefficient of $x^k$ in $r$
is$\sum_{i+j=k}{a_ib_{j}}.$ There are at most $(1 + \min (d_p,d_q))$
terms in such a sum, since you run out of $a_i$ or $b_j$ terms beyond
that point.]
This bound is tight (i.e. achievable in the case of all equal
coefficients), but we expect that for some applications of polynomials
over the integers the coefficients will often be much smaller than the
computed bound, and there will be substantial over-padding. For
applications of polynomials with coefficients in a finite field the
padding may be just adequate.
After computing this bound $B$, assuming we have an underlying
binary representation, we choose the smallest $n$ such that $2^n>B$,
pack, multiply, and unpack.
\section {Extensions}
\subsection{Negative Coefficients}
Assume the leading coefficient is positive. (This is not a
restriction because we can convert the problem to the negation of the
polynomial and then change its sign before returning the result.) We
will compute using a kind of complement notation. Consider again our
decimal example, but with some negative signs: $34x^3-56x^2+78x-90.$
To evaluate at $10^5$ we rewrite this polynomial as $33*(10^5)^3
~+~(10^5-56)*(10^5)^2 ~+~77(10^5) ~+~(10^5-90)$ where we construct
each of the negative numbers by ``borrowing'' $10^5$ from the next
higher digit. If that digit is itself negative, we borrow
again. There is no change in converting to the integer representation;
converting back simply follows the rewriting shown above. Assuming we
use a binary base, we should choose an $n$ such that $2^n>2B$ so that
we can get both + and - coefficients reconstituted. That is, we change
the padding-- bumping it up by a factor of two (one bit more), and if
the value (remainder) extracted is more than half the padding, it is a
negative number and should be adjusted. It is sometimes suggested
that negative coefficients could be dealt with by splitting polynomials
into positive and negative parts \cite{vzg},8.4. Using complement notation
is much simpler.
The description as a program to do all this can be quite short, and perhaps
easier to understand than our description above.
\begin{verbatim}
/* this is Macsyma Code */
/* convert the polynomial p(x) to an integer, with padding v */
tobig(p, x, v) := subst(v, x, p);
/* notation:
[q,r]:divide(i,v) sets q to quotient, r to remainder of i by v */
frombig (i,x,v):= if (i=0) then 0
else if (i<0) then -frombig(-i,x,v)
else block([q,r],
[q,r]:divide(i,v),
if r>v/2 then r-v+x*frombig(q+1,x,v)
else r +x*frombig(q,x,v));
\end{verbatim}
e.g. if $p= 34 * x^3 + 56 * x^2 + 78 * x + 90,$
$h= {\rm tobig}(p,x,10^6)$ gives 34000056000078000090
and ${\rm frombig}(h,x,10^6)$ gives
$x * (x * (34 * x + 56) + 78) + 90$ which is equivalent to $p$.
Actually, creating this polynomial may not be as useful as
creating a list of the coefficients, which is what this next
%program does. First coefficient in the list is highest degree.
program does. The first coefficient in the list is highest degree.
We've also rewritten this to use iteration rather than recursion.
\begin{verbatim}
frombigL(i,v):=block([ans:[],q,r,signum:if (i<0) then -1 else 1],
i:i*signum,
[q,r]:divide(i,v),
while (i>0) do
(if r>v/2 then (push(signum*(r-v),ans),
i:q+1)
else (push(signum*r,ans),
i:q),
[q,r]:divide(i,v)),
ans);
\end{verbatim}
\subsection{More Than One Variable}
It is possible to use the same trick recursively. That is, consider a
polynomial $p$ in main variable $y$ whose coefficients are polynomials
in variable $x$. That is, $p= \sum_{0\le i\le n}q_i(x)y^i$. Then $p$
can be encoded by finding some sufficient uniform padding for {\em
ALL} the coefficients within the $\{q_i\}$ to maintain their separation,
as well as another padding to maintain the separation between these
%$\{q_i\}$. This is not particularly appealing encoding
$\{q_i\}$. This is not a particularly appealing encoding
unless the polynomials $\{q_i\}$ are dense and of uniform degree.
\subsection{Other Operations}
It is possible to do more operations on the bignum form as long as the
number remains a faithful representation of the polynomial with
respect to the operation. Clearly we must avoid any ``carry'' from one
coefficient to another. We can try guarding these coefficents more
carefully by some heuristic computation for extra padding in the case
where we do not have a convenient bound, and test to see if our result
is consistent. One approach is to replace $x$ by $y^2$ and thus our
true polynomials have only even powers of the variable; any time there
is a non-zero odd coefficient, there has been some ``leakage''.
Addition is easily supported. If, after doing an add, there are any
non-zero coefficients of the odd powers, we have detected an invalid
operation. (This idea of having guard space between fields packed
into a single word was used in the ALTRAN system of the 1960's. In
their design only one bit between fields was allocated, since that
is all that is necessary.) For other
operations we must also make sure that the absence of odd powers
guarantees a valid operation, as is the case with addition of two such
polynomials.
There are other operations that can be reduced to integer operations
{\em at least sometimes} or whose results can be computed with
high probability, and then tested. The most prominent is the polynomial
greatest common divisor (GCD) calculation which is notoriously
expensive and frequently needed in simplifying rational expressions.
There is a literature on the technique here, called the Heuristic GCD
(GCDHeu); it is used in the Maple and MuPAD computer algebra systems
\cite{gcdheu,schon88}.
The checking to make sure the (heuristic) integer
answer really maps to the correct polynomial is non-trivial, and a
complete GCDHeu algorithm must consider that the procedure may require
more padding than suggested by the initial heuristic, and that it may
still fail and that another GCD algorithm entirely should be used.
(Testing of the result is included in the appendix; fallback algorithms are not.)
There are other suggestions for using this kind of Kronecker evaluation
in the reduction of multivariable polynomial factoring to eliminate
one or more variables (e.g. to bivariate \cite{ek82}).
An important difference between our proposal here and that in the
GCDHeu or factoring approach is that we advocate rapid encoding of coefficients
via shifting and masking of binary quantities, not evaluation of the
polynomial at some ``randomized'' point\footnote{A starting evaluation point
for at least one version of GCDHeu is 29 + twice the minimum of the
maximum absolute values of the coefficients in the input, hardly
likely to be a power of two.}.
%This represents a significant difference in implementation efficiency.
In some earlier experiments \cite{liao}, we found areas where GCDHeu
was superior to other tested algorithms, even though these tests were
using the built-in Lisp (Allegro CL) bignum package (classical arithmetic), and was
therefore not especially speedy compared to what we know we can do with GMP.
Using a basic arithmetic suite that includes
{\em much} speedier code for much larger sizes should make GCDHeu competitive
for somewhat larger problems than shown in \cite{liao}, since the integer
GCD calculation can be a major portion of the expense.
More recently, von zur Gathen and Gerhard \cite{vzg} (in figure 6.4)
show tests that using a power-of-two evaluation point for GCDHeu is
(slightly) faster on their randomly generated examples compared to
other points. They do not discuss the reasons for this, or the
probability of the GCDHeu heuristic failing in this case.
It is beyond the scope of this paper to exhaustively test a
power-of-two GCDHeu algorithm, but worth asking a feasibility question
concerning how fast we can calculate the GCDHeu. Here is one data
point: consider two polynomials $p$ and $q$ each of degree 2000 with
random coefficients bounded by 10,000. To compute the correct GCD of
$p\times q$ and $p$ (The GCD is $p$) took 78ms including encoding and
decoding. To compute the product of $p$ and $q$ took 31 ms using GMP
encoding. To compute the product $p\times q$ by ``regular'' arithmetic
took 343 ms.
% added
The full cost of the GCDHeu would have to include additional
confirmation steps which might dominate the time quoted, so this is a
``most favorable'' comparison. Yet the conventional polynomial
subresultant sequence GCD algorithm took 245 seconds, or over 3,000
times longer, using programs from an earlier comparison \cite{liao},
leaving considerable room for testing or re-performing the heuristic
GCD with different settings. (Arguably this instance of GCD
calculation is implausibly large for most uses of a CAS, although much
emphasis appears to be made of far larger examples in von zur Gathen
and Gerhard \cite{vzg} section 6.13.)
{Further analysis of the GCDHeu idea and
extensions may be found, for example, in a paper
by Geddes, Gonnet and Smedley \cite{ggs}. Although the cited
paper is primarily concerned with GCD, it includes an algorithm for
polynomial trial division. Also see Davenport/Padgett \cite{dp},
%added Jan 18,2006
and Geddes {\em etal}\cite{gcl} section 7.7, theorem 7.7
where single-point evaluation and interpolation are discussed and the
bound on an evaluation point is defended; the
emphasis is, however on using a much smaller evaluation point as
a heuristic.
}
Other possible compound operations can be performed, with
suitable checks for validity, on point-evaluations and single-point
``interpolation'' to integers. For some operations, validity is not
guaranteed by the mere absence of carries. In some cases it appears
that the cost of the checks may exceed the cost of doing the job in
the first place. We include in the category of ``possible, but be
careful'' the evaluation of determinants of polynomials, because
algorithms of an iterative nature with many
sequential additions and multiplications can be translated into
integer operations. It remains to determine if we can
succeed by finding a suitable ``tight'' padding for the whole sequence
of operations, or a probable testable padding.
Another notion we have explored briefly is an application to
factorization of an integer $I$. Consider ``guessing'' the padding
$p$, and convert $I$ to a polynomial $q(x)$. Using any of the popular
techniques for polynomial factorization, try to factor $q$. If there
are factors, these can be immediately mapped to integers. Thus $I=
19624459 = 4409 \times 4451$ can be factored by guessing some padding,
say 4403, which produces a polynomial which factors as
$(x-6)(x+48)$. Any guess between 4379 and 4481 will produce a useful
factorization. Note that $4429$, a good guess, is near $\sqrt{I}$. There
are many reasons for this method to not work, generally. If
there are two large prime factors which are substantially different,
the heuristic of choosing $\sqrt(I)$ does not work. We have not
explored the relationship of this heuristic to current integer
factorization algorithms.
%% find references ??
\section{More Compact Encodings}
{David Harvey \cite{harvey} has explored an alternative, of knowingly
choosing insufficient padding and performing the multiplication two or
four different ways in essence by choosing several evaluation
points. Consider padding with a smaller $n$, as we have done with
$2^n$, but also $2^{-n}$, shifted to allow for integer
coefficients. This gives us the same coefficients but in reverse
order. Or using $-2^n$ which alternates signs. If we can unpack and
disentangle possible overlaps efficiently (Harvey shows that we can)
there are regions (ranges of input-sizes) in which there an advantage:
when the inputs are neither too small or very large. This essentially
takes advantage of the observation that, with the guaranteed padding
bound, one may be wasting timesince many of the digits (whole words of
them!) will consist of zeros. Harvey's alternative gives us a chance
to pack them tighter and use a clever unpacking to disentangle the
overlaps. Note that for very large examples where GMP uses FFTs,
Harvey's benchmarks (see figures 1,2 \cite{harvey}) illustrate there
is not much advantage. For small examples the overhead of packing and
unpacking swamps any improvements. A factor of about 2 is achieved
in a range of polynomial degrees between 100 and 5000 for coefficients
in a 48-bit finite field.}
\section{Finite Field Operations}
A particularly simple way of taking advantage of the
mapping to bignums is available if the arithmetic is to be
performed in a finite computation structure (usually a finite
field, but weaker conditions could be supported-- no inverses
are needed, for example.)
In the useful case of univariate polynomials over $GF[p]$, the input
coefficients might as well be integers $k$ with $0 \le k < P $ for
some odd prime $P$. This is particularly appealing if $P$ is less,
perhaps much less, than the maximum single-word integer, as suggest
previously.
Some caution is needed: Even though
we know the coefficients can all be reduced modulo $P$, the bignum
representation needs padding. In particular, for multiplication we
must allow for a maximum coefficient size of $(1 + \min
(d_p,d_q))\times P^2$, and the translation back to conventional
polynomial form must do a modulo $P$ reduction for each coefficient.
We know of no especially quick method here; it seems to
require unpacking and computing a remainder, then re-packing coefficients.
However, knowing the domain of computation, with a finite field
we do not have to look at the coefficients to find their maximum. This
saves a linear-time scan through the memory storing the polynomial.
In such finite field or modular operations
it is possible to take one or several such operations and through
an interpolation or lifting operation, obtain results in a larger
computation structure \cite{vzg}.
\section{Benchmarks}
It is not our intention in this paper to exhaustively explore all
possible ranges of coefficients and degrees and to compare to all possible
other implementations. {\bf This would be a task that\footnote{In spite of
some referees' suggestions to the contrary} is simply not worthwhile.} It would
require comparison to systems that are obscure and special-purpose, that
may or may not be available for convenient use by the author or
by the reader, and whose characteristics may
change without notice. Indeed some will certainly continue to improve, and
others may cease to function as particular computers and operating systems
go out of fashion. We have considerable experience with attempting to
publish such figures in a reproducible fashion, and collect comments such
as ``Your comparison was done with version X, but you must
use version X+1 which is faster; also what about system XYZ (unknown except
to the referee!)''
Instead we wish to establish a feasibility criterion. Does this technique
work for a sufficiently interesting example such that implementations
should be considered for other systems? That is, based on the information
here, should the authors of existing systems X, Y, or Z, look at the idea
here?
We choose as our first point of comparison for polynomial multiplication speed,
a fixed large polynomial, say $p=(x+1)^{1000}$ and translate it to a
single bignum $k$ with suitable padding. Then we compute (and time) the
multiplication r=$k \times k$. Compare that to the time to
directly multiply $p$ by itself. Another comparison would be to
multiply the polynomial given by polynomials of substantially
different size, say $(p^3\times p)$.
This test is unfair to the conventional representation (CR) because
the CR is far more versatile. CR allows
\begin{enumerate}
\item more than one variable, stored sparsely.
\item wide variation in the size of the coefficients, without much loss in efficiency. That
is, some may be individually bignums, others zero.
\item the result of the operation (typically multiplication) to be
immediately presented in the ``normal'' form which does not have to be
decoded or re-converted to a representation with more (or possibly
less) padding for a subsequent operation.
\end{enumerate}
In fact, the notion of having to change
representation depending upon what you are going to do subsequently
with the object is discomforting. How would you feel about a
representation for a linked list which had to be entirely recomputed
depending upon how many new links would be appended to it? (We are
willing to manage some discomfort for a major speed improvement, and
if we can encode/decode fast. To be more precise, the encoding time
is linear in the input size and the decoding time is linear in the output size.
If we can make a major improvement in the computation time, say from an
$O(N^2)$ algorithm to $O(N \log N)$ in theory this presents
the prospect for an improvement. Of course benchmarks may show that this
is not actually worthwhile in practice.)
Thus we propose to eliminate the last of these objections
in a second version where we include in the bignum algorithm the cost of
encoding and decoding operations. This requires looking for the
maximum coefficient in each input and finding the appropriate bound for
the product.
Yet this version is somewhat unfair to the bignum program performance,
since we can do multiple operations in the encoding. It would be grossly
unreasonable if we used the simple-minded programs given above in Macsyma's top-level
lnguage. Even if the conversions are compiled into Lisp, possible
from the Macsyma language, they are far
from as efficient as possible. As indicated earlier, we have written a better program
for encoding ({\tt tobig}) that can use bignum shifting or masking and not
really ``evaluate'' a polynomial. A better program for decoding ({\tt
frombig}) can use bignum operations equivalent to division with
remainder, but really just multi-word shift and logical mask
operations. As a result, the coefficients would presumably be laid out
in a convenient data structure of an array or list.
In our earlier draft of this paper, we discussed and presented what we
believed to be an efficient version of these transformations; a challenge from
Bill Hart, who is currently (2010) maintaining FLINT \cite{flint},
showed that we were not really so efficient! In fact both FLINT and
our programs were essentially encoding polynomials to present to
the same library (GMP) to multiply as bignums, and decoding them.
If we were using the same version of GMP, the ``middle'' of the algorithm
would be the same and thus a comparison would be in the programming
skills (and choice of language!) for packing and unpacking.
One consideration is that Hart is using C and our programs were in Lisp \footnote{
An alternative though is our Lisp program to call FLINT to do the packing
and unpacking.}
An observation: our original programs for interfacing Lisp with GMP were adequate
for most of our experiments because in all cases we converted numbers to GMP
form and did not convert them back to ordinary numbers in Lisp until they were
to be printed. This meant that any inefficiency in conversion was masked by
the output costs. This Kronecker encoding/decoding must be done repeatedly within
the computation -- to extract coefficients, to present the results in conventional
form. This presents a much higher demand on efficiency. The old programs
could be lackadaisical about (for example) using GMP programs to produce character-string
versions of decimal numbers. The new programs access the machine-representation
of GMP integers extracting ``limbs'' (the 32-bit sections of multi-precision numbers).
This requires an implementation-dependent (non-ANSI Lisp) program. Even so, our
new Lisp programs have some inefficiencies because the version we use has a different
integer representation. Other Lisps that have adopted GMP directly may be more
efficient in some regards, though just using GMP through a ``functional programming''
veneer may not be as efficient as using the raw GMP, as we do.
%allowing us us to test appropriateness of this approach for particular
%sizes (degree, coefficient bound) of polynomials. These programs, in
%Common Lisp, are included in an appendix.
%[I have assigned a student, Wai Kit Chan, to work on this task, 8/21/03, RJF]
% he never did this.
A note for implementors:
If it is to be as general as the Macsyma program given here,
the polynomial encoding/decoding program must be prepared to work on
coefficients {\em which are themselves bignums}, and so the masking
and shifting can't be written trivially in a language that has only
fixed-length arithmetic (such as C or Fortran). This
subtlety can be avoided if the polynomials are multiplied in a finite
field whose modulus is substantially less than a word-size. There are
such applications, particularly in factoring of polynomials, where a
simple C program might perform the task and such a speedup is
potentially worthwhile.
A final consideration is the issue of multiplying polynomials in the
``normal'' $O(nm)$ way, where $n$ and $m$ are the polynomial degrees
of each input. Is this just too easy a target for improvement? Is it
easy to improve this without going into this proposed encoding? In
fact, for a large enough polynomial multiplication problem we can run
faster with a modicum of additional coding using a ``divide and
conquer'' or ``Karatsuba'' style multiplication. This has asymptotic
growth $O(n^{1.585})$, where $n$ is the maximum degree of the inputs,
but bookkeeping costs make it more expensive for smaller polynomials.
(The asymptotically faster FFT requires more than a modicum of coding.)
%2005
(2005) Consider again the problem of multiplying $p \times p$. On our current
desktop computer\footnote{2.6GHz Pentium 4, Allegro CL 6.2}, we can
perform this calculation in 23.6 seconds. Using a Karatsuba version,
it takes as little as 3.7 seconds. Converting this problem to a GMP
number takes 63 milliseconds. For each of two similar-sized inputs,
this would be 126ms\footnote {This conversion is written in Lisp using
GMP arithmetic. It is possible it might be done even faster by direct
coding in the GMP number representation closer to an assembler
level.}. The multiplication itself takes only 485ms. to produce a
number with 31,251 words. The reverse transformation, also phrased in
compiled Lisp but consisting mostly of calls to special GMP {\em
division and remainder by powers of 2}, (implemented as shifts) takes
an additional 186ms, for a total time of 0.8 seconds. This total then
can be compared to 3.7 seconds using Karatsuba.
%2010
Consider multiplication of $p \times p$. On our current (2010) desktop computer,
2.99GHz Pentium 4, GMP 4.1.2, Allegro CL 8.0, better packing/unpacking programs
than previously, we can encode the polynomial $p$ as a GMP number $K$ with padding 2002
in 53 ms. (we would do this twice, since we are oblivious to the fact that
the two inputs are the same).
Squaring $K$ takes 109 ms, resulting in a GMP integer requiring
125,126 32-bit words. Unpacking $K^2$ takes 200 ms. Total 0.415 sec.
Using a Karatsuba version of polynomial multiplication but using
the Lisp bignum arithmetic for coefficients, takes 3.28 sec., and
a conventional naive multiplication takes 12.9 seconds. Kronecker beats
the next best by about 8X.
One alteration to the technique here is to use GMP bignum arithmetic
on coefficients and use either the naive multiplication or Karatsuba.
Note that the largest coefficient in $(x+1)^{100}$ is 995 bits long, and
so a relatively slow bignum package hampers the native Lisp program.
In fact, rewriting the Lisp programs to always use GMP coefficients
reduces the time from 12.9 seconds cited earlier, to 2.19 seconds.
For the Karatsuba version, the time is reduced from 3.28 seconds to 0.72 seconds,
switching to naive multiplication at size 50.
Kronecker still wins, but only by a factor of 1.7X over Karatsuba.
Let us modify the problem slightly to eliminate most of the cost related
to the coefficient size. It shows Kronecker to considerable advantage.
Indeed, all the programs become much faster (up to an astounding 300X !)
Here is the experiment: We change $p$ to be a polynomial with all coefficients being 1.
How long does it take to square $p$ (including all packing, unpacking, in the case of Kronecker?).
Note that the necessary padding is not 2001, but 11 bits, so several coefficients
can be packed in one word (word sizes might be 32 or 64 bits, or perhaps a few bits less
if there is some memory sacrificed for tags).
Squaring $p$ can now be done in 5.9 ms rather than 415 ms with Kronecker.
It can be done in 43 ms rather than 12,900 ms using naive multiplication.
It can be done in 33 ms rather than 3,280 ms using Karatsuba multiplication.
Kronecker beats the next best by about 5.6X for small coefficients.
Note that if all of the coefficients are ones, but they are GMP ones, then the times
are considerably {\em longer} in our tests. That is, the native Lisp integer representation has
much less overhead than the GMP representation and operations {\em for small integers}.
For example the Karatsuba version with GMP ones takes 160 ms rather than 33 ms. The naive version
takes 516 ms rather than 43 ms.
From these examples we can see there is sometimes a speed advantage in
mapping whole (univariate, dense) polynomial multiplication problems
into large integer multiplications. This mapping also may be useful
for some other operations previously noted. One might ask -- why not
just keep the polynomials like this, never unpacking? While this may
work for finite-field representations, unfortunately for exact integer coefficients,
this is generally not acceptable: the padding may have to be
renegotiated periodically, in fact at each multiplication operation!
For our example problem, the Allegro CL bignum package falls short, and
the Kronecker trick gets about a factor of 1.7 over what we would get by
simply using GMP and a Karatsuba multiplication
as the preferred library routines for our Lisp system's bignums.
The prospective user of such programs should be aware
that converting a very small problem such as $p=(x+1)^{10}$ to a GMP
number and squaring takes about
%% 2005: 3 times longer than the naive calculation in
4.5 times longer than the naive calculation in our program trials. In
a small example the encoding and decoding dominates possible savings,
at least in our programming framework.
%%20005
How important is it to use a power-of-two pad? Intuitively it seems
that shifting should be much preferable to evaluation of a polynomial.
%2010
Earlier experiments suggest that using some arbitrary number as a pad
might be some 4 times slower in packing and 14 times slower unpacking.
Since our current hacked-up packing and unpacking is substantially
faster, and could apparently be even faster if written in a lower-level
language, it seems that there is quite a handicap for a non-binary-power.
If an algorithm truly requires that the padding must be done with
(say) powers of some large prime instead of powers of two, our testing
suggests that such slower encoding may severely impact the efficiency
of encoding and decoding. On the other hand, for carefully chosen
numbers there might be compensating tricks that could speed them up.
A final observation: one might propose to replace {\em all} integer
arithmetic in a Lisp implementation by GMP numbers. We think that this
would likely be a mistake. Most arithmetic is likely to be on small
(single-word) numbers. In the system we used here, Allegro Common
Lisp (version 6.2)
%%in 2010, Allegro version 8.0
on a typical 32-bit computer, the fixnum data type
provides 30 bits: the largest positive fixnum is 536,870,911.
Counting down from this number to zero using compiled Lisp takes 0.3
seconds, but using GMP arithmetic takes 267 seconds. This factor of
890 on common small-number arithmetic is unacceptable since it is used
for array indexing and almost every other integer arithmetic
application. (compare this to (say) C, where there is not really an
alternative in the language for fixed-length arithmetic). Given this
reality, most Lisp systems, as well as computer algebra systems, have
at least two {\em underlying} integer types. Such systems tend to
cover over this distinction for most operations (e.g. arithmetic has
a uniform higher-level model that covers the data type
``integer'') but at least in compiled Lisp, advisory implementation
%declarations that can specify that some integers will always be small,
declarations can specify that some integers will always be small,
and that compilation can produce tight code \cite{gmplisp}.
\section{Conclusions}
We have illustrated a case in which we can get a factor of 26 speedup
by mapping our naive polynomial arithmetic into clever arithmetic on
GMP bignums. This speedup is real, but we can also improve the naive
(but commonly used) old algorithm. That is, we can improve the
standard polynomial multiplication program which is brief and
reasonably straightforward, but perhaps not the most appropriate for
very large examples.
Hacking on the naive algorithm, the improvement factor is reduced to
about 6 or even less: we replace the naive polynomial arithmetic by a
better polynomial multiplication program (using Karatsuba divide and
conquer).
A further improvement on the naive versions is possible: Given this
particular example in which a polynomial $p=(x+1)^{1000}$ is
multiplied by itself, we observe that the polynomial $p$ has
coefficients as large as $10^{299}$. If we replace the Lisp bignum
arithmetic with GMP arithmetic, then the factor is reduced further, to
about a factor of 2 improvement. Still, there is considerable merit in
being ``only'' twice as fast.
The Kronecker version really shines, however, when the original coefficients
are small. In this case (e.g. coefficients all single digits), it can be
100X faster than the next best, even for substantial degree (1000).
Although we have not presented our benchmarks here to support it, in tests,
the Kronecker method may be relatively less advantageous for drastically unequal-sized
input polynomials (depending on the qualities of the underlying bignum multiplication program)
and furthermore will ordinarily fall behind (along with other dense-oriented algorithms)
when compared to algorithms specialized for sparse polynomials, given sufficiently sparse
inputs. Separately we have written about sparse polynomial experiments; here we note that
for some apparently sparse inputs, the answer may be rather dense, and dense algorithms may
be the fastest prospects.
%{2010: Times for $p=(x+1)^{1000}$ squared: by Kronecker, 0.5 seconds,
%Naive, 13.9 seconds, Karatsuba, 3.5 seconds.}
%Other (smaller) examples result in less favorable results for this
%encoding. These examples chosen to include very sparse or
%multivariate examples show that that the Kronecker encoding can be
%factors of 5, 10, or more slower than alternatives.
Daniel Lichtblau of Wolfram Research has suggested that a suitable
implementation in Mathematica 5.0 specifically to use GMP provides
advantages in certain internal algorithms even for relatively small
polynomials {\em over a finite field}. This may be, in part, because
Mathematica does not have a particularly efficient encoding for
ordinary polynomials, lacks speedups like the Karatsuba
multiplication, or like the Lisp we were using, has a relatively
inefficient bignum library compared to GMP, once the coefficients
are more than a few words long.
{%%% added section
On the suggestion of a referee, and against our better judgment
(see comments above) we include some comparison timings:
In Maple 7, on the same machine: let \verb|q:=expand((x+1)^1000). expand(q*q)|
takes 11.20 seconds.
Mathematica 4.1 on the same machine: let \verb|q=Expand[(x+1)^1000]. Expand(q*q)|
takes 13.156 seconds.
GP/PARI version 2.2.7(alpha) takes 1.391 seconds.
% Xmaxima5.9, severely underconfigured memory 270. seconds.
Macsyma 2.4 89.2 seconds (43 in garbage collection).
This can be contrasted with a multiplication time of about 0.485 seconds using GMP bignums,
called from Lisp (Allegro Common Lisp).
In some sense the ``purest'' comparison may be a system like NTL,
assuming you are willing to write and compile a C++ program to
construct the inputs and call the arithmetic routines. Using NTL's
default ``LIP'' arithmetic takes 0.344 seconds, faster by a factor of
1.4. Recompiling NTL to use GMP 4.2.1 arithmetic (on a fresh system)
takes about 0.047 seconds This is remarkably fast, about 10.3 times
faster than encoding as a single GMP number. NTL's ZZX module uses
one of four multiplication methods (classical, Karatsuba, and two FFT
variations); in this case an FFT. This method could obvious also be
programmed in Lisp, or since it is written in C, called from
Lisp.
That is, even faster might be to
translate the whole problem's data into NTL format, link to NTL for
the polynomial arithmetic, and translate back if necessary. Or
performing the calculation entirely in NTL. Or perhaps there is
another system even faster than NTL, in which case that could be
linked to Lisp.
%(There is a point of reductio ad absurdum here; one could reject
%all work on trying to make C programs faster on the basis that
%(historically) there have almost always been superior FORTRAN compilers.
%Thus the right way to optimize C is to translate it into FORTRAN)
Our major point remains: it appears that we obtained a substantial
speedup {\em without having to write an FFT program ourselves}. When
the GMP multiplication program is improved or tuned for a new architecture (or
merely debugged!), we get that benefit without recoding our own
programs.}
Conclusion in brief: this Kronecker technique has promise for
speedups of multiplication for carefully chosen domains of large dense
univariate polynomials, and requires basically access to a fast
arbitrary-precision integer arithmetic library such as GMP. Systems in which
GMP arithmetic is easily available may achieve substantial speedups.
There is one catch: efficient packing and unpacking must be available.
Previously (2005) we speculated that one could beneficially devote
extensive time and effort to specifically code an equivalently clever
polynomial multiplication routine, with appropriate cut-off heuristics
between methods, including an FFT routine (as in NTL), then mapping to
long integers will not likely run faster. It appears that this was done
by FLINT. In 2010 we learned from Bill Hart that this Kronecker trick has
a more substantial range of
usefulness than we had expected. Hart demonstrated by benchmarks in FLINT 1.5,
and by more careful programming we were able, to some extent, reproduce the relative rankings as compared with
a Karatsuba method, or even naive multiplication.
We had previously discounted
Kronecker because our packing and unpacking routines were not
as efficient in our earlier code.
(We expect that they are still considerably less efficient than FLINT!)
Consequently, the choice of algorithm depends more on these ``linear time''
overheads than we realized.
All Lisp programs used are included in an on-line directory linked from
the author's home page, {\tt lisp/mul/} along with additional documentation.
The only other programs needed are those available (free)
from the GMP web site \cite{gmp}, and Allegro
Common Lisp (the Express/free version).
Recommendation to builders of computer algebra systems:
We could spend far more time on more benchmarks and vary parameters
such as denseness and size of coefficients, as well as degree. We
could use implementations in C, C++, GMP, ARPREC, Lisp, Maple,
Mathematica etc. On different machine architectures and even
different versions of compilers under different optimizations. Rather
than attempt such an inevitably fruitless comparison in the hope of
providing CAS system-building advice\footnote{Fruitless because it
will be unlikely to tell you how much {\em your} precise circumstances
will benefit from this technique.}, we hope that this paper provides
the essential impetus to try something that is likely to be simple to
implement and that will possibly be quite fast if you have a good
arbitrary-precision library, and pack/unpack routines.
(Another recommendation \footnote{Which, since 2005, seems to have
caught on}, if you are
considering implementing a CAS from scratch or implementing bignum
arithmetic for a language like Lisp, and you intend to build
very fast polynomial arithmetic yet again: have you considered just using
one or more of the free libraries?)
\section{Acknowledgments}
This research was supported in part by NSF grant CCR-9901933
administered through the Electronics Research Laboratory, University
of California, Berkeley. Thanks to D. Lichtblau for spurring this
exercise and subsequent corrections and comments. Thanks also to
(anonymous) reviewers of an earlier draft for their comments. In
2010, I was prompted me to revise this paper thanks to extensive email
exchanges with Bill Hart regarding FLINT, Kronecker substitution, and
alternatives.
\begin{thebibliography}{99}
\bibitem{arprec} D. Bailey, ARPREC.
http://crd.lbl.gov/~dhbailey/mpdist/
\bibitem{gcdheu} B. Char, K. O. Geddes, and G. H. Gonnet, Heuristic GCD
B. Char, K. Geddes, and G. Gonnet. Gcdheu: Heuristic polynomial gcd
algorithm based on integer gcd computation. {\em J. of Symbolic Computation,
7}:31-48, 1989. The algorithm has been modified for
implementation, and the analysis cleaned up. See for example
http://arxiv.org/PS\_cache/cs/pdf/0206/0206032.pdf
\bibitem{dp}J. H. Davenport, J. A. Padget, ``HEUGCD: How Elementary Upperbounds Generated Cheaper Data,'' European Conference on Computer Algebra (2) 1985: 18-28
\bibitem{gmplisp}
R. Fateman,{``Importing the Gnu Multiple Precision Package (GMP) into Lisp, and implications for Functional Programming''}, draft. 2003.
\bibitem{flint} FLINT: Fast Library for Number Theory. www.flintlib.org
\bibitem{FFTW} The FFTW Home Page, www.fftw.org.
\bibitem{ggs} K. O. Geddes, G. H. Gonnet, T. J. Smedley, ``Heuristic Methods for Operations With Algebraic Numbers,'' (Extended Abstract). ISSAC 98 475-480
%added Jan 18,2006
\bibitem{gcl} K. O. Geddes, S. R. Czapor, G. Labahn:
{\em Algorithms for Computer Algebra}, Kluwer Academic, 1992.
\bibitem{gmp} The Gnu MP Home page. http://swox.com/gmp/
\bibitem{harvey}David Harvey.
``Faster Polynomial Multiplication via Kronecker Substitution,''
arXiv:0712.4046v1 [cs.SC] 25 Dec 2007
\bibitem{ek82}
E. Kaltofen, ``A polynomial reduction from multivariate to bivariate integral polynomial factorization,''
14th ACM Symposium on Theory of Computing 1982, p 261 -- 266.
ISBN:0-89791-070-2
\bibitem{knuth} D.~E.~Knuth, {\em The Art of Computer Programming}, volume 2.
(1969, 1981). Addison-Wesley.
\bibitem{liao}
P. Liao, R. Fateman.
``Evaluation of the heuristic polynomial GCD''
{\em Proc ISSAC 1995} 240--247.
http://doi.acm.org/10.1145/220346.220376
\bibitem{schon88} A. Schonhage,
``Probabilistic Computation of Integer Polynomial GCDs''
{\em J. Algorithms, 1988} 9 365--371.
\bibitem{vzg} J. von zur Gathen and J. Gerhard,
{\em Modern Computer Algebra}, Cambridge Univ. Press 1999, 2003.
\bibitem{shoup} V. Shoup. NTL, (Number Theory Library)
{\tt http://shoup.net/ntl/}.
\end{thebibliography}
\section{Appendix}
DL's email note (excerpt)
``For univariate polynomials mod p we do what is called (I believe) binary
segmentation. Based on polynomial length and size of p we create large
integers appropriately zero-packed, then multiply them and recover the
polynomial product coefficients. An analysis will show that this
compares reasonably well with direct FFT convolution methods, especially
when the modulus is large compared to polynomial length. I'm pretty sure
this is discussed in von zur Gathen and Gerhard\footnote{in section 8.4}.
As we use GMP for integer multiplication, this becomes a case where the
library software decides whether to use schoolbook, Karatsuba,
Toom-Cook, or NTT. My experience is this general approach is quite
efficient in practice, especially as GMP is a fairly robust library. Not
perfect, but fairly robust.''
%Strange to say, I've run across examples where I rather wish we had
%implemented this multiplication for characteristic zero as well, that
%is, multiplication of polynomials with integer coefficients. We may do
%so in future. An efficient implementation along the lines of the mod p
%case is not terribly difficult.
%The code we use is actually accessible in Mathematica, by the way. If
%you have a pair of lists of coefficients, they may be multiplied as
%below.
%\begin{verbatim}
%In[52]:= Algebra`PolynomialTimesModList[{1,2,3,4},{5,6,7,8,9},11]
%Out[52]= {5, 5, 1, 5, 4, 4, 4, 3}
%
%Quick check using more standard Mathematica functions:
%
%In[54]:= CoefficientList[PolynomialMod[{1,2,3,4}.x^Range[0,3] *
% {5,6,7,8,9}.x^Range[0,4], 11], x]
%Out[54]= {5, 5, 1, 5, 4, 4, 4, 3}
%
%\end{verbatim}
\section{Macsyma Programs, Some Repeated, with tests}
\begin{verbatim}
/*padpower2 computes the power k such that 2^k is an adequate
pad for the multiplication of polynomials p1 and p2. Double it
if you expect positive and negative numbers. */
(pad(p1,p2,x):=(1+min(hipow(p1,x),hipow(p2,x)))*maxcoef(p1,x)*maxcoef(p2,x),
padpower2(p1,p2,x):= ceiling(log(bfloat(pad(p1,p2,x)))/log(2)))$
/* Maxcoef is a kludge to compute max abs value of coefficients in a
polynomial. A simple lisp program does this much faster */
maxcoef(p,x):=max (map (abs, subst(1,x,
substinpart("[", ratdisrep(p),0))))$
/* For comparison with the result of frombigL, here is
a program to make a list of the coefficients in a polynomial, zeros too.
They should be the same values if the multiplication worked. */
allcoeflist(p,x):=block([h:hipow(p,x),ans:[],head,rest],
for k:h thru 1 step -1
do ([head,rest]: bothcoef(p,x^k),
push(head,ans),
p: rest),
push(p,ans), /* last piece is the constant */
reverse(ans))$
tobig(p, x, v) := subst(v, x, p);
/* [q,r]:divide(i,v) sets q to quotient, r to remainder of i by v */
frombigL(i,v):=block([ans:[],q,r,signum:if (i<0) then -1 else 1],
i:i*signum,
[q,r]:divide(i,v),
while (i>0) do
(if r>v/2 then (push(signum*(r-v),ans),
i:q+1)
else (push(signum*r,ans),
i:q),
[q,r]:divide(i,v)),
ans);
\end{verbatim}
\section{Appendix: Some Common Lisp Code}
\begin{verbatim}
;;; SEE files http://www.cs.berkeley.edu/~fateman/lisp/mul/
;;; and especially the file just3GMP.lisp and justGMP-run.lisp
\end{verbatim}
\section {Appendix II - GMP and Lisp interface}
\begin{verbatim}
;; this a sample of interface code from file loadgmp2.cl.
;; Set up the Allegro Common Lisp foreign function interface to GMP
;; other Common Lisp implementations have similar but not identical
;; foreign function interfaces.
(eval-when (compile load eval)
(ff:def-foreign-type mpz (* :int))
(ff:def-foreign-call
(mpz_init "__gmpz_init")
((x (* :int)))
:returning :int
:arg-checking nil
:call-direct t)
(ff:def-foreign-call
(mpz_set_str "__gmpz_set_str") ;changes value of mpz x.
((x mpz)
(s (* :char))
(base :int))
:strings-convert t :returning :int
:arg-checking nil)
;; :call-direct t
(ff:def-foreign-call
(mpz_get_str "__gmpz_get_str")
((s :int);;let it allocate the space
(base :int)
(op mpz))
:returning ((* :char) )
:arg-checking nil :call-direct t)
(ff:def-foreign-call
(mpz_get_d "__gmpz_get_d") ((x mpz))
:returning :double
:arg-checking nil :call-direct t)
(ff:def-foreign-call;; remove gmp number from memory
(mpz_clear "__gmpz_clear") ((x mpz))
:returning :int
:arg-checking nil :call-direct t)
(ff:def-foreign-call;; remove gmp number from memory
(mpz_cmp "__gmpz_cmp") ((x mpz) (y mpz) )
; returns pos if x>y, zero if x=y, neg if x i 0)))
(setf i (abs i))
(while (> i 0)
(setf q (ash i mv))
(setf r (logand i mask))
(cond ((> r vby2)
(push (if signum (- r v)(- v r)) ans)
(setf i (1+ q)))
(t
(push(if signum r (- r)) ans)
(setf i q))))
(coerce (nreverse ans)'array)))
;;padpower2 computes the power k such that 2^k is an adequate pad
;;for the multiplication of polynomials p1 and p2 in variable x.
(defun padpower2(p1 p2) ;; p1 and p2 are arrays of lisp numbers
(assert (arrayp p1))
(assert (arrayp p2))
(+ (integer-length (min (length p1) (length p2)))
(integer-length (maxcoef p1))
(integer-length (maxcoef p2))))
;; maxcoef is needed for computing padpower2. It returns the abs val of
;; the largest (in absolute value) of the coefficients in a polynomial.
(defun maxcoef(p1)
;; accumulate pos and negs separately to avoid computing/storing abs.
;; compare at end and make positive.
(declare (optimize (speed 3)(safety 0)(debug 0))
(type (array t) p1 ))
(let ((pans 0)(mans 0) temp)
(do ((i (1- (length p1)) (1- i)))
((< i 0)(max pans (- mans)))
(declare (fixnum i))
(setf temp (aref p1 i))
(if (< temp 0)(setf mans (min mans temp))
(setf pans (max pans temp))))))
;; round trip test to see if tobig and frombig work
(defun rt(p)(let((logv (padpower2 p p)))
(frombig (tobig p logv) logv)))
;; use LISP bignum arithmetic to multiply polys with Lisp coefs
(defun polymultb(p q)
(let((logv (padpower2 p q )))
(frombig (* (tobig q logv)(tobig p logv)) logv)))
;; use LISP bignum arithmetic to compute a power of p
(defun ppowerb(p n) (if (= n 1) p (polymultb p (ppowerb p (1- n)))))
;;Now try the same thing with GMP arithmetic
(defun tobigG(p logv)
(declare (optimize (speed 3)(safety 0)(debug 0)))
(declare (fixnum logv))
(assert (arrayp p))
(assert (integerp logv))
(let ((res (create_mpz_zero)))
(declare (fixnum len))
(do ((i (1- (length p)) (1- i)))
((< i 0) res)
(declare(fixnum i))
(mpz_mul_2exp res res logv) ;; res <- res*2^logv
(mpz_add res res (togmp (aref p i))))))
(defun frombigG(i logv)
(declare (optimize (speed 3)(safety 0)(debug 0)) )
(let* ((ans nil)
(q (create_mpz_zero))
(r (create_mpz_zero))
(one (lisp2gmp 1))
(vby2 (create_mpz_zero)) ; will be v/2
(v (create_mpz_zero)) ; will be v
(signum (>= (signum (aref i 1)) 0))) ; t if i non-neg
(mpz_mul_2exp vby2 one (1- logv)) ;set to v/2
(mpz_mul_2exp v one logv) ;set to v
(if signum nil (mpz_neg i i)) ; make i positive
(while (> (aref i 1) 0) ; will be 0 when i is zero
(setf r (create_mpz_zero))
(mpz_fdiv_r_2exp r i logv) ;set r to remainder
(mpz_fdiv_q_2exp q i logv) ;set q to quotient
;; (format t "~% q=~s, r=~s" (gmp2lisp q)(gmp2lisp r))
(cond ((> (mpz_cmp r vby2) 0) ; it is a negative coef.
(if signum (mpz_sub r r v) (mpz_sub r v r))
(push r ans)
(mpz_add i q one))
(t
(if signum nil (mpz_neg r r))
(push r ans)
(setf i q)))
)
(coerce (nreverse ans) 'array)))
(defun padpower2g(p1 p2) ;; p1 and p2 are arrays of gmps
(assert (arrayp p1))
(assert (arrayp p2))
(+ (integer-length (min (length p1) (length p2)))
(mpz_sizeinbase(maxcoefg p1) 2)
(mpz_sizeinbase (maxcoefg p2) 2)))
(defun maxcoefg(p1) ;assume coefs are gmp numbers, use gmp arith
;; accumulate pos and negs separately to avoid computing/storing abs.
;; compare at end and make positive.
(declare (optimize (speed 3)(safety 0)(debug 0))
(type (array t) p1))
(let ((pans (create_mpz_zero))(mans (create_mpz_zero)) temp)
(do ((i (1- (length p1)) (1- i)))
((< i 0)(mpz_neg mans mans) ; make minus ans into pos
(if (> (mpz_cmp mans pans) 0) mans pans)) ; return larger
(declare (fixnum i))
(setf temp (aref p1 i))
(if (< (aref temp 1) 0)
(if (< (mpz_cmp mans temp)0) nil (setf mans temp))
(if (> (mpz_cmp pans temp) 0)nil (setf pans temp))))))
(defun polymultg(p q) ;; use GMP bignum stuff to multiply polys with gmp coefs
(let((logv (padpower2g p q)))
(frombigG (mp* (tobigG q logv)(tobigG p logv)) logv)))
(defun ppowerg(p n) (if (= n 1) p (polymultg p (ppowerg p (1- n)))))
;;; ..........................some GMP/lisp utilities
;; convert numbers to gmp numbers if they aren't already
(defun togmp(x)(if (numberp x)(lisp2gmp x) x))
;; gmp sequence to lisp sequence
(defun gs2l (s)(map 'array #'gmp2lisp s))
;; lisp sequence to gmp sequence
(defun ls2g (s)(map 'array #'lisp2gmp s))
;;Here are programs to deal with bignums. I don't know how specifically
;; it is dependent on implementation, but it works on intel allegro 6.2.
;;Convert a bignum into an array of 16-bit fixnums
(eval-when (compile load)(require :llstructs))
(defun nth-bigit (x n) (sys:memref x -14 (* 2 n) :unsigned-word))
(defun bignum-size(x) ;; in 16-bit quantities
(excl::bm_size x))
;; btoa works for any bignum or fixnum. It converts the
;; MAGNITUDE of the number into a sequence of 16-bit integers stored
;; in an array.
;; I suspect this would be a no-op in C...
;; (Re-typing an existing structure)
(defun btoa(x) ;;bignum or fixnum to array of bigits, 16 bits
(if (fixnump x) (vector (abs x))
(let* ((size (bignum-size x))
(ans (make-array size :element-type '(unsigned-byte 16))))
(declare (optimize (speed 3)(safety 0)(debug 0))(fixnum size))
(do ((i 0 (1+ i)))
((= i size) ans)
(declare (fixnum i))
(setf (aref ans i)(nth-bigit x i))))))
(defun atob(a) ; array to POSITIVE integer (bignum)
(let ((num 0))
(declare (optimize (speed 3)(safety 0)(debug 0)))
(do ((i (1- (length a)) (1- i)))
((< i 0) num)
(declare (fixnum i))
(setf num (+ (aref a i) (ash num 16))))))
;; signum works on bignums or fixnums.
;; convert array of unsigned-byte-32 to a bignum
(defun atob32(a)
(let ((num 0))
(do ((i (1- (length a)) (1- i)))
((< i 0) num)
(setf num (+ (aref a i) (ash num 32))))))
;; convert lisp integer to gmp-like stuff
(defun btoa32(x)
;;bignum or fixnum to array of unsigned-byte-32 bigits.
;;This is, I think, the same as the contents of GMP integer.
;; (other parts of GMP integer are k, number of bigits*sign,
;; and allocated size >=k)
(if (fixnump x) (vector (abs x))
(let* ((size (bignum-size x))
(newlen (ceiling size 2))
(a32 (make-array newlen :element-type '(unsigned-byte 32)
:initial-element 0)))
(do ((i 0 (+ 2 i))) ;; collect low order 16 bits of each bigit
((>= i size) nil)
(setf (aref a32 (truncate i 2))
(nth-bigit x i)))
(do ((i 1 (+ 2 i)))
;; collect high order. [We do this order so even/odd sizes work]
((>= i size) a32)
(incf (aref a32 (truncate i 2)) (ash (nth-bigit x i) 16))))
))
;; convert from a lisp number to gmp.
(defun lisp2gmp(x)
(let ((r (create_mpz_zero)))
;; returns a gmp number equal to any lisp integer
(cond ((fixnump x) (mpz_set_si r x));; works faster for fixnums
((bignump x) (let ((a (btoa x)) ;; bignum to array.
(h (create_mpz_zero)))
(do ((i (1- (length a))(1- i)))
((< i 0) r)
(declare (fixnum i))
(mpz_set_si h (aref a i))
(mpz_mul_2exp r r 16)
(mpz_add r r h))
(if (< x 0) (setf (aref r 1)(- (aref r 1))))
))
(t (error "can't convert ~s to gmp" x) ))
r))
;; convert a gmp number to lisp
(defun gmp2lisp(g);;
(let* ((size (abs (aref g 1)));; number of limbs
(signum (signum (aref g 1)))
(ans 0)) ; the bignum result
(do ((i (1- size) (1- i)))
((< i 0) (* signum ans))
(declare (fixnum i))
(setf ans (+ (let ((k (mpz_limbn g i)))
(if (< k 0)(+ k #.(ash 1 32)) k));compensate for unsigned
(ash ans 32))))))
(defun rtg(x) ;round trip test for number x
(if (= x (gmp2lisp(lisp2gmp x))) 'true 'false))
;; append two arrays
(defun append-seq(r s)(let* ((rlen (length r))
(slen (length s))
(new (make-array (+ rlen slen))))
(do ((i 0 (1+ i)))
((= i rlen))
(setf (aref new i)(aref r i)))
(do ((i 0 (1+ i)))
((= i slen) new)
(setf (aref new (+ i rlen ))(aref s i)))))
;;;;;;;;;;;;;;now some regular polynomial arithmetic (n^2 method)
(defun polymult(p q) ;; by ordinary poly multiplication.
(assert (arrayp p))
(assert (arrayp q))
(let* ((plim (length p)) ;; degree is plim-1
(qlim (length q))
(ans(make-array (+ plim qlim -1) :initial-element 0)))
(declare(fixnum qlim plim)(type (array t) ans))
(do ((i 0 (1+ i)))
((= i plim) ans)
(declare (fixnum i))
(do ((j 0 (1+ j)))
((= j qlim) nil)
(declare (fixnum j))
(incf (svref ans (+ i j))
(* (svref p i)(svref q j)))))))
;; compute a power of p
(defun ppower(p n) (if (= n 1) p (polymult p (ppower p (1- n)))))
;;An implementation of Karatsuba multiplication
#|
Assume p is of degree 2n-1, i.e. p= a*x^n+ b where a is degree n-1 and b is degree n-1.
assume q is also of degree 2n-1, i.e. q= c*x^n+d
compute U= (a+b)*(c+d) = a*c+a*d+b*c+b*d, a polynomial of degree k=2*(n-1)
compute S= a*c of degree k
compute T= b*d of degree k
compute U = U-S-T = a*d+b*c of degree k
Then V=p*q = S*x^(2n)+U*x^n + T.
From first principles, V =p*q should be of degree 2*(2n-1) = 4n-2.
from the form above, k+2n= 4n-2.
note that U*x^n is of degree 2*(n-1)+n = 3n-1 so that it OVERLAPS the S*x^(2n) component.
note that T is of degree 2*(n-1) = 2n-2 so that it OVERLAPS the U*x^(n) component.
(the trick is that, used recursively to multiply a*c etc,
this saves coefficient mults, instead of nm, max(n,m)^1.5)
If the smallest of p and q is of some small degree *karatlim* 16 or so,
we will use conventional arithmetic. There may be better characterizations
of the switch-over size, but some fiddling around gave us this guess.
Is it critical that the size be odd?
What if it is not 2n-1 business? Find the max degree of p, q. If it is odd, we
are set. If it is even, 2n, do this:
p=a*x^n+ b where b is of degree n-1, but a is of degree n. Same deal works. I think.
|#
(defvar *karatlim* 16) ;; an educated guess -- don't use karatsuba for small probs.
(defun polymultk(pp qq);; by Karatsuba.
(assert (arrayp pp))
(assert (arrayp qq))
(let* ((plim (length pp))
(qlim (length qq)))
(declare (fixnum qlim plim)(type (array t) aa pp qq))
(if (< (min plim qlim) *karatlim*)(polymult pp qq); don't use for small polys
(let* (aa
(h (truncate (max plim qlim) 2))
(A (subseq pp (min h plim) plim))
(B (subseq pp 0 (min h plim)))
(C (subseq qq (min h qlim) qlim))
(D (subseq qq 0 (min h qlim)))
(U (polymultk(polyadd A B) (polyadd C D)))
(TT (polymultk A C))
(S (polymultk B D)))
(polysub-into S U 0) ; change U to U-S
(polysub-into TT U 0) ; change U to U-S-T
(setf aa (polyshiftleft TT (ash h 1))) ;aa = T*x^(2h)
(polyadd-into U aa h) ; change aa to T*x^(2h)+U*x^h
(polyadd-into S aa 0) ; change aa to T*x^(2h)+U*x^h +S
(pnorm aa) ;remove trailing 0's if any
))))
(defun polyshiftleft(p k) ; like multiplying by x^k
;; k is positive integer, p is a raw polynomial, no header
(let* ((oldlen (length p))
(newlen (+ oldlen k))
(newpoly (make-array newlen :initial-element 0)))
(do ((i (1- oldlen) (1- i)))
((< i 0) newpoly)
(setf (svref newpoly (+ k i))(svref p i)))))
(defun polyadd(p q)
(let* ((plim (length p))(qlim (length q))
(aa (make-array (max plim qlim) :initial-element 0)))
(if (> plim qlim)
(polyadd-1 p q plim qlim aa)
(polyadd-1 q p qlim plim aa))
;; We could normalize here, removing all trailing zeros but one.
;; We would rather wait until we know we care about it.
;;(pnorm aa)
))
(defun polyadd-1 (p q pl ql aa)
;; p is longer, length pl. same length as array aa, filled with zeros.
;; p and q are unchanged.
(do ((i 0 (1+ i)))
((= i pl))
(setf (svref aa i)(svref p i))) ; copy longer input to answer
(do ((i 0 (1+ i)))
((= i ql) aa)
(incf (svref aa i)(svref q i)))); add other arg
(defun polyadd-into (p aa start)
;; p is longer than answer aa
;; p is unchanged. aa is changed to aa+ p*x^start
(do ((i (+ (length p) start -1) (1- i)))
((< i start) aa)
(incf (svref aa i)(svref p (- i start)))))
(defun polysub-into (p aa start)
;; p is longer than answer aa
;; p is unchanged. aa is changed to aa+ p*x^start
(do ((i (+ (length p) start -1) (1- i)))
((< i start) aa)
(decf (svref aa i)(svref p (- i start)))))
;; remove trailing zeros.
(defun pnorm (x) ; x is a vector
(declare (optimize (speed 3)(safety 0)(debug 0)))
(let ((x x) pos)
(declare (simple-vector x) (fixnum pos)
(inline position-if-not coefzerofunp delete-if))
(setq pos (position-if-not #'zerop x :from-end t))
(cond
((null pos) #(0)) ; change if we want #() to be zero!
((= pos (1- (length x))) x) ; already normal
(t (subseq x 0 (1+ pos) )))))
;; Whew! That simple idea results in a lot of programming.
;; compute a power of p using Karatsuba
(defun ppowerk(p n) (if (= n 1) p (polymultk p (ppowerk p (1- n)))))
;;; return to the previous classical multiplication but assume
;;; that Lisp bignum arithmetic is slow; use GMP instead.
;; USE GMP coefficients with classical and with karatsuba multiplication
(defun polymultGMP(p q) ;; by ordinary poly multiplication on GMP numbers
(declare (optimize (speed 3)(safety 0)(debug 0)))
(assert (arrayp p))
(assert (arrayp q))
(let* ((plim (length p)) ;; degree is plim-1
(qlim (length q))
(ans(make-array (+ plim qlim -1) :initial-element 0)))
(declare(fixnum qlim plim)(type (array t) ans))
(setf ans (ls2g ans)); convert to gmp zeros
(do ((i 0 (1+ i)))
((= i plim) ans)
(declare (fixnum i))
(do ((j 0 (1+ j)))
((= j qlim) nil)
(declare (fixnum j))
(mpz_addmul (svref ans (+ i j)) (svref p i)(svref q j))))))
;; Karatsuba version with GMP numbers
(defun polymultkG(pp qq);; by Karatsuba.
(assert (arrayp pp))
(assert (arrayp qq))
(let* ((plim (length pp))
(qlim (length qq)))
(declare (fixnum qlim plim)(type (array t) aa pp qq))
(if (< (min plim qlim) *karatlim*)(polymultGMP pp qq); don't use for small polys
(let* (aa
(h (truncate (max plim qlim) 2))
(A (subseq pp (min h plim) plim))
(B (subseq pp 0 (min h plim)))
(C (subseq qq (min h qlim) qlim))
(D (subseq qq 0 (min h qlim)))
(U (polymultkG(polyaddG A B) (polyaddG C D)))
(TT (polymultkG A C))
(S (polymultkG B D)))
(polysub-intoG S U 0) ; change U to U-S
(polysub-intoG TT U 0) ; change U to U-S-T
(setf aa (polyshiftleftG TT (ash h 1))) ;aa = T*x^(2h)
(polyadd-intoG U aa h) ; change aa to T*x^(2h)+U*x^h
(polyadd-intoG S aa 0) ; change aa to T*x^(2h)+U*x^h +S
(pnormG aa) ;remove trailing 0's if any
))))
(defun polyshiftleftG(p k) ; like multiplying by x^k
;; k is positive integer, p is a raw polynomial, no header
(let* ((oldlen (length p))
(newlen (+ oldlen k))
(newpoly (make-array newlen :initial-element 0)))
(setf newpoly (ls2g newpoly))
(do ((i (1- oldlen) (1- i)))
((< i 0) newpoly)
(declare (fixnum i))
(setf (svref newpoly (+ k i))(svref p i)))))
(defun polyaddG(p q)
(let* ((plim (length p))(qlim (length q))
(aa (make-array (max plim qlim) :initial-element 0)))
(setf aa (ls2g aa))
(if (> plim qlim)
(polyadd-1G p q plim qlim aa)
(polyadd-1G q p qlim plim aa))
;; We could normalize here, removing all trailing zeros but one.
;; We would rather wait until we know we care about it.
;;(pnorm aa)
))
(defun polyadd-1G (p q pl ql aa)
;; p is longer, length pl. same length as array aa, filled with zeros.
;; p and q are unchanged.
(do ((i 0 (1+ i)))
((= i pl))
(mpz_add (svref aa i)(svref aa i)(svref p i))) ; copy longer input to answer
(do ((i 0 (1+ i)))
((= i ql) aa)
(mpz_add (svref aa i)(svref aa i)(svref q i)))); add other arg
(defun polyadd-intoG (p aa start)
;; p is longer than answer aa
;; p is unchanged. aa is changed to aa+ p*x^start
(do ((i (+ (length p) start -1) (1- i)))
((< i start) aa)
(mpz_add (svref aa i)(svref aa i)(svref p (- i start)))))
(defun polysub-intoG (p aa start)
;; p is longer than answer aa
;; p is unchanged. aa is changed to aa+ p*x^start
(do ((i (+ (length p) start -1) (1- i)))
((< i start) aa)
(mpz_sub (svref aa i) (svref aa i)(svref p (- i start)))))
;; remove trailing zeros.
(defun pnormG (x) ; x is a vector
(declare (optimize (speed 3)(safety 0)(debug 0)))
(let ((x x) pos)
(declare (simple-vector x) (fixnum pos)
)
(setq pos (position-if-not #'(lambda(r)(= (aref r 1) 0)) x :from-end t))
(cond
((null pos) (vector (create_mpz_zero))) ; change if we want #() to be zero!
((= pos (1- (length x))) x) ; already normal
(t (subseq x 0 (1+ pos) )))))
;; compute a power of p using Karatsuba and GMP
(defun ppowerkG(p n) (if (= n 1) p (polymultkG p (ppowerkG p (1- n)))))
;; benchmarks
#|
(setf ones (make-array 1000 :initial-element 1)) ;1
(setf bigs(make-array 1000 :initial-element (expt 2 40))); 1,000
(setf moreones(make-array 10000 :initial-element 1)) ;10,000
(setf short #(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30))
(time (prog nil (polymult ones ones))) ;;62ms
(time (prog nil (polymultk ones ones)));; 16ms
(time (prog nil (polymult bigs bigs))) ;; 410ms
(time (prog nil (polymultk bigs bigs)));; 109ms
(time (prog nil (polymultb bigs bigs)));; 188ms
(time (prog nil (setf m (ls2g bigs)))) ;; 0ms
(time (prog nil (setf ans (polymultg m m)))) ;;15 ms
(time (prog nil (gs2l ans))) ;;16 ms
;; total 31 ms
(time (prog nil (polymult moreones moreones))) ;; 5484ms
(time (prog nil (polymultk moreones moreones)));; 656ms *karatlim* is 20.. 922ms if *karatlim*=10
(time (prog nil (polymultb moreones moreones)));; 4017ms
(time (prog nil (setf m (ls2g moreones)))) ;; 16ms ;twice
(time (prog nil (setf ans (polymultg m m)))) ;; 203 ms
(time (prog nil (gs2l ans)));; 32 ms
;; total 267 ms.
(time (prog nil (ppower short 50))) ;;1140ms
(time (prog nil (ppowerk short 50)));;1159ms
(time (prog nil (ppowerb short 50)));;9884ms
(time (prog nil (gs2l(ppowerg (ls2g short) 50)))) ;; 687ms
(time (prog nil (gs2l(ppowerkG (ls2g short) 50)))) ;; 1874 ms
(setf p1000 (ppower #(1 1) 1000)) ;(x+1)^100
(time (prog nil (polymult p1000 p1000)));; 23,611ms
(time (prog nil (polymultk p1000 p1000)));; 3,723ms with *karatlim*=3
(time (prog nil (polymultk p1000 p1000)));; 5,423ms with *karatlim*=16
(time (prog nil (polymultk p1000 p1000)));; 7,000ms with *karatlim*=20
(time (prog nil (polymultb p1000 p1000))) ;; fails: creates too large a bignum for Lisp
(time (prog nil (gs2l (polymultg (ls2g p1000)(ls2g p1000))))) ;785ms total
;;;; suggested by WKC: use GMP with classical or Karatsuba multiplication.
(time (prog nil (gs2l (polymultkG (ls2g p1000)(ls2g p1000))))); 1,266 ms
;;; how much is conversion, how much arithmetic?
(time (setf m (ls2g p1000))) ;47 ms
(time (prog nil (setf ans (polymultkG m m)))) ;;1,050 ms ;*karatlim*=16
(time (prog nil (gs2l ans))) ;;200 ms
(time (prog nil (setf ans (polymultGMP m m)))) ;; 3,578 ms
We made minor modifications to the classical polynomial mult (polymult)
creating polymultGMP. The changes were to
to (a) assume all inputs were arrays of GMP numbers, and (b) do all
the arithmetic in GMP data. This reduced the time from 23.6 seconds
to 3.578 seconds. Turning then to converting
the Karatsuba multiplication to use GMP numbers
gets the time is down to 1.05 seconds.
Thus encoding it all into
one number allows us to do the internal multiplication in 485 ms versus
the collection of smaller (but still multi-precision) multiplications
and the associated bookkeeping in 1.05 ms. about 2.6 times faster.
There are various ways one might prefer the answers to be presented.
We could choose to use GMP forms throughout a system, though having
them all packed into one big number is not really acceptable: the
padding has to be renegotiated periodically, at each operation!
So for this problem, the Lisp bignum package really falls short, and
the Kronecker trick gets about a factor of 2.6.
;; times on 2.6GHz Pentium 4 Allegro CL 6.2 512 MB RAM Sept 2003.
|#
;; this is from file loadgmp2.cl. Set up the foreign function interface to GMP
(eval-when (compile load eval)
(ff:def-foreign-type mpz (* :int))
(ff:def-foreign-call
(mpz_init "__gmpz_init")
((x (* :int)))
:returning :int
:arg-checking nil
:call-direct t)
(ff:def-foreign-call
(mpz_set_str "__gmpz_set_str") ;changes value of mpz x.
((x mpz)
(s (* :char))
(base :int))
:strings-convert t :returning :int
:arg-checking nil)
;; :call-direct t
(ff:def-foreign-call
(mpz_get_str "__gmpz_get_str")
((s :int);;let it allocate the space
(base :int)
(op mpz))
:returning ((* :char) )
:arg-checking nil :call-direct t)
(ff:def-foreign-call
(mpz_get_d "__gmpz_get_d") ((x mpz))
:returning :double
:arg-checking nil :call-direct t)
(ff:def-foreign-call;; remove gmp number from memory
(mpz_clear "__gmpz_clear") ((x mpz))
:returning :int
:arg-checking nil :call-direct t)
(ff:def-foreign-call;; remove gmp number from memory
(mpz_cmp "__gmpz_cmp") ((x mpz) (y mpz) ); returns pos if x>y, zero if x=y, neg if x= (signum (aref i 1)) 0))) ; t if i non-neg
(setf v (togmp v)) ; just in case v is not gmp already
(mpz_fdiv_q_ui vby2 v 2) ;set to v/2
(if signum nil (mpz_neg i i)) ; make i positive
(while (> (aref i 1) 0) ; will be 0 when i is zero
(setf r (create_mpz_zero))
(mpz_fdiv_qr q r i v) ;set q, r to quotient, remainder of i/v
;; (format t "~%q=~s r=~s" (gmp2lisp q)(gmp2lisp r))
(cond ((> (mpz_cmp r vby2) 0) ; it is a negative coef.
(if signum (mpz_sub r r v) (mpz_sub r v r))
(push r ans)
(mpz_add i q one))
(t
(if signum nil (mpz_neg r r))
(push r ans)
(setf i q)))
)
(coerce (nreverse ans) 'array)))
(defun pgh(p q) ;univariate gcd.
;; returns G=gcd(p,q), U=p/G, V=q/G
(let* ((pcontent (reducegcd p))
(qcontent (reducegcd q))
(g (gcd pcontent qcontent)))
;; first remove content
(unless (= g 1)(setf p (map 'array #'(lambda(r)(/ r g)) p))
(setf q (map 'array #'(lambda(r)(/ r g)) q)))
(let* ((pmax (maxcoef p)) ;infinity norm of p and q
(qmax (maxcoef q))
(pdeg (length p))
(qdeg (length q))
(lcp (aref p (1- pdeg))) ;leading coefs of p and q
(lcq (aref q (1- qdeg)))
(beta (+ 29 (* 2 (min pmax qmax))))
(xi (max (min beta (* 1000 (isqrt (ceiling beta 101))))
(+ 2 (* 2 (min (ceiling pmax lcp)(ceiling qmax lcq))))))
(res (togmp 0))
(U (togmp 0))
(V (togmp 0))
PP QQ)
;; (format t"~%xi=~s beta=~s" xi beta)
(if ;;(> (*(integer-length xi) pdeg qdeg) 4000) ;; b1
nil ;; never fail this way..
'failure
;; insert here other heuristics, other tests... maybe a loop
;; to bump up xi
(let (ans anscontent)
(mpz_gcd res (setf PP(tobigGx p xi)) (setf QQ(tobigGx q xi)))
;; (format t "~%res =~s pp=~s qq=~s" (gmp2lisp res)(gmp2lisp PP)(gmp2lisp QQ))
;; (mpz_fdiv_q U PP res)
;; (mpz_fdiv_q V QQ res)
(setf ans(gs2l(frombigGx res xi)))
(setf anscontent (reducegcd ans))
(unless (= anscontent 1) (setf ans (map 'array #'(lambda(r)(/ r anscontent)) ans)))
;; test stuff here as well to see if answer is credible, eg ans divides p and q
;; fiddle with cofactors of p and q otherwise.
(if (= g 1) ans (map 'array #'(lambda(r)(* r g)) ans))
;;(values ans (frombigGx U)(frombigGx V))
ans )))))
(defun pghext(p q) ;univariate extended gcd. buggy?
;; returns G=gcd(p,q), A B, a*p+b*q=G
(let* ((pcontent (reducegcd p))
(qcontent (reducegcd q))
(g (gcd pcontent qcontent)))
;; first remove content
(unless (= g 1)(setf p (map 'array #'(lambda(r)(/ r g)) p))
(setf q (map 'array #'(lambda(r)(/ r g)) q)))
(let* ((pmax (maxcoef p)) ;infinity norm of p and q
(qmax (maxcoef q))
(pdeg (length p))
(qdeg (length q))
(lcp (aref p (1- pdeg))) ;leading coefs of p and q
(lcq (aref q (1- qdeg)))
(beta (+ 29 (* 2 (min pmax qmax))))
(xi (max (min beta (* 1000 (isqrt (ceiling beta 101))))
(+ 2 (* 2 (min (ceiling pmax lcp)(ceiling qmax lcq))))))
; (setf xi (+ 100000000000000000000 xi)) ;; to reconstruct a,b?
(res (togmp 0))
(U (togmp 0))
(A (togmp 0))
(B (togmp 0))
(V (togmp 0))
PP QQ)
;; (format t"~%xi=~s beta=~s" xi beta)
(if ;;(> (*(integer-length xi) pdeg qdeg) 4000) ;; b1
nil ;; never fail this way..
'failure
;; insert here other heuristics, other tests... maybe a loop
;; to bump up xi
(let ((ans nil))
(mpz_gcdext res A B
(setf PP(tobigGx p xi))
(setf QQ(tobigGx q xi)))
;; (format t "~%res =~s pp=~s qq=~s" (gmp2lisp res)(gmp2lisp PP)(gmp2lisp QQ))
;; (mpz_fdiv_q U PP res)
;; (mpz_fdiv_q V QQ res)
(setf ans(primpart(gs2l(frombigGx res xi))))
;; test stuff here as well to see if answer is credible, eg ans divides p and q
;; fiddle with cofactors of p and q otherwise.
(if (= g 1) ans (setf ans (map 'array #'(lambda(r)(* r g)) ans)))
(values ans (primpart(gs2l (frombigGx A xi) ))
(primpart(gs2l (frombigGx B xi) ))))))))
(defun primpart(p);divide the polyn through by integer content
(let ((pcontent (reducegcd p)));; could do this faster, probably
(if (= pcontent 1) p
(map 'array #'(lambda(r)(/ r pcontent))p ))))
(defun gcdtest3(d m)
(let* ((a1 (randpoly d m))
(a2 (randpoly d m))
)
(multiple-value-bind (g a b)(pghext a1 a2)
(let((g2 (pnorm (polyadd (polymult a1 a)(polymult a2 b)))))
(if (and (arrayp g2)
(every #'= g g2))
0 1 )))))
(defun gcdtest3a(d m)
(let* ((a1 (randpoly d m))
(a2 (polymult a1(randpoly d m)))
)
(multiple-value-bind (g a b)(pghext a1 a2)
(let((g2 (pnorm (polyadd (polymult a1 a)(polymult a2 b)))))
(if (and (arrayp g2)
(every #'= g g2))
0 1 )))))
(defun gcdtest4(a1 a2) ;; two polynomials
(multiple-value-bind (g a b)(pghext a1 a2)
(let((g2 (pnorm (polyadd (polymult a1 a)(polymult a2 b)))))
(if (and (arrayp g2)
(every #'= g g2))
0 1 ))))
(defun pghco(p q)
;;univariate gcd. test cofactors.
;; returns G=gcd(p,q), U=p/G, V=q/G
(let* ((pcontent (reducegcd p))
(qcontent (reducegcd q))
(g (gcd pcontent qcontent)))
;; first remove content (make p, q primitive)
(unless (= g 1)
(setf p (map 'array #'(lambda(r)(/ r g)) p))
(setf q (map 'array #'(lambda(r)(/ r g)) q)))
(let* ((pmax (maxcoef p)) ;infinity norm of p and q
(qmax (maxcoef q))
(pdeg (length p))
(qdeg (length q))
(lcp (aref p (1- pdeg))) ;leading coefs of p and q
(lcq (aref q (1- qdeg)))
(beta (+ 29 (* 2 (min pmax qmax))))
(xi (max (min beta (* 1000 (isqrt (ceiling beta 101))))
(+ 2 (* 2 (min (ceiling pmax lcp)(ceiling qmax lcq))))))
(res (togmp 0))
(U (togmp 0))
(V (togmp 0))
PP QQ)
;; (format t"~%xi=~s beta=~s" xi beta)
(if ;;(> (*(integer-length xi) pdeg qdeg) 4000) ;; b1
nil ;; never fail this way..
'failure
;; insert here other heuristics, other tests... maybe a loop
;; to bump up xi
(let ((ans))
(mpz_gcd res (setf PP(tobigGx p xi))
(setf QQ(tobigGx q xi)))
;; (format t "~%res =~s pp=~s qq=~s" (gmp2lisp res)(gmp2lisp PP)(gmp2lisp QQ))
;; this next test is wrong. We shouldn't test p divided by res
;; as integers (surely that would return yes), but
;; test division as polynomials.
;; we would rather find cofactors u, v, such that p=u*g, q=v*g
;; proves the case.
(cond
((or (= 0 (mpz_divisible_p p res))
(= 0 (mpz_divisible_p q res))) 'failure)
(t
(setf ans(primpart(gs2l(frombigGx res xi))))
(if (= g 1) ans (map 'array #'(lambda(r)(* r g)) ans))
ans )))))))
(defun reducegcd
(p)
;; semantically the same as (reduce #'gcd p) but with a short-cut.
;; As soon as a gcd result is 1, the rest of the terms need not be
;; examined.
(let ((size (1- (length p))))
(declare (fixnum size))
(do* ((i (1- size) (1- i))
(gcdsofar
(if (> size 0)(abs (aref p size)) 0);initially last item or 0
(gcd gcdsofar (aref p i))))
((or (= 1 gcdsofar)(= i 0)) gcdsofar)
(declare (fixnum i)))))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Here are comments from TOMS refereeing process; some items have
been incorporated in the text , e.g. typo fixes. and perhaps other
correct.
The original tex as sent to ACM-TOMS is stored separately.
Subject:
TOMS submission(s)
From:
George Labahn
Date:
Thu, 22 Apr 2004 20:46:26 -0400 (EDT)
To:
Richard Fateman
Hi Richard,
I will be handling the editorial duties for your latest TOMS submission.
As for your first submission I have now gotten four reviews back and am
forced to reject the paper. Just to be fair I also ran the reviews by
the Editor in Chief and he basically agreed with referees 2 and 3. Indeed
his comment was:
In spite of the fact that the idea is not new, a paper on this topic could
indeed be publishable, but only if it was a more complete and systematic
study (e.g., using several different systems, using a substantial test
suite, determining cutoff points in sparsity and size, etc.) In fact, a
polyalgorithm that was self-tuning and optimal in each case might well be
very useful.
For me I agree that it needs more depth for TOMS. ISSAC might work because
the computer algebra community would be quite interested in revisiting the
idea. But it might also have the same objections - the CA community still
seems to want ISSAC to be a mini journal for computer algebra.
Sorry for the news.
George
The four reviews:
=== review 1 ===
Review of "Can you save time in multiplying polynomials by encoding them
as integers?" by Richard Fateman, submitted to ACM Transactions on
Mathematical Software.
For author and editor
=====================
The paper discusses the efficiency of polynomial multiplication by
evaluating the inputs at a suitable power of two and using fast
integer multiplication. Sample code and benchmarks in Macsyma and
Common Lisp are provided. The author's main conclusion is that this
approach is useful for large dense univariate polynomials.
I like the idea of re-using known efficient code as much as possible
instead of re-inventing the wheel over and over, and I tend to agree
with the conclusion for a general purpose system; for specific
applications, a carefully optimized hand-coded implementation will
very likely be faster.
The idea itself is not new. (Interestingly, it is somewhat paradox,
since fast integer multiplication a la Schoenhage & Strassen in turn
reduces to multiplication of polynomials, at least conceptually.) The
main original contribution of the paper is the evaluation together
with the benchmarks. I'm not sure about the relevance of the
benchmarks, though, and I would expect that the results may differ
significantly in a C/C++ environment. It would be interesting to see a
comparison with Victor Shoup's NTL (which is also based on GMP). Why
re-use only the basic operation of integer multiplication, but not
carefully optimized univariate polynomial multiplication as well?
The paper is clearly in the scope of the journal. It is well written,
and the style of presentation is clear. I did not review the source
code. I found about 10 typos, listed below. Relevant literature is
being cited, although I'm missing references to Karatsuba's and
Schoenhage & Strassen's work.
Detailed remarks and corrections
--------------------------------
- General remark: The Kronecker subsitution method x <-- 2^n can be
viewed as a single modulus (x - 2^n) modular algorithm.
- p 1, Abstract, l 3: Typo: superfluous "its"
- p 1, Section 1, l 4: Typo: the second author of [7], Jurgen Gerhard,
is missing
- p 1, Section 1, l -4: Typo: superfluous "choice"
- p 2, 1st paragraph: Remark: This level of abstraction may be
somewhat too strong.
- p 2, 2nd paragraph: Typo: ungrammattical sentence "... our programs not"
- p 2, last paragraph of Section 1: I don't understand the relevance to the
problem of the remarks in this paragraph.
- p 4, 3rd paragraph: Two typos: "called" instead of "Called", and the
correct spelling is "MuPAD" (only the "u" should be lowercase)
- p 4, last paragraph: These remarks are true for all kinds of modular
algorithms.
- p 5, Section 4: Suggestion: make this subsection 3.4
- p 7: Typo: Missing comma between "integer" and "but" in line -3
before Section 6
- p 8, Section 8: Possible typos, not sure whether they were part
of DL's email (in which case I would consider correcting them anyway):
The "We" in the first line should be lowercase, and "fft" in the
second last line of the first paragraph should be all caps.
Moreover, closing quotes are missing at the end of the paragraph.
- p 8, Section 8: Indeed, this approach is discussed in Section 8.4
of [7].
- p 16, l 12: Typo: should be max(n,m)1.585 instead of ^1.5
== review 2 ===
Review of
Can you save time in multiplying polynomials by encoding them as integers?
by Richard Fateman.
There exists a number of extremely efficient libraries for handling arithmetic
for large integers. These libraries have the advantage of being optimized for
both various architectures and for choice of fast algorithms. Taking a practical
point of view the author proposes to contruct univarite polynomial arithmetic
on top of large integer arithmetic using an evaluation/decoding type of
mechanism. The univariate integer polynomials are converted into large integers,
the arithmetic operations are then done via large integer operations using
the fast, tuned libraries and then the final results converted back into
coefficient/power form. The primary focus is on polynomial multiplication.
In the paper itself, the first two sections are devoted to showing how the
conversion into integer data is done, section 4 looks at the case of polynomials
over finite fields while section 5 discusses some timings. The main results
appear to be in this last section.
My one big complaint is that, after reading the paper, I have no idea if the
author has answered his own question. For one thing, the paper only uses one
example - (x + 1)1000 to base its conclusions (or at least this was the
impression I got reading the paper). The author admits at the start of section
5 that a conventional sparse representation of polynomials has some important
advantages and so points out that one would only do a conversion to integer
data for some arithmetic operations, for example univariate polynomial
multiplication. In this case one has the issue that small problems now run
much slower. Where has the system gained? Where is the cut off? What about the
fact that this forces a dense representation of polynomial data?
The paper is written in a very informal way. This makes reading the introduction
quite interesting and is even fine for section 2 and 3.1. However, this
informality becomes quite unsatisfying when we get to the technical nuts and
bolts of the paper, particularly section 5, since the study itself does not seem
to have been done in any formal way. At the end of the day the entire paper seemed
incomplete, an exercise more than a study. I would say that this would be fine
for a conference where initial ideas or small studies are given. However it
does not seem to be (at least for me) enough for a journal paper.
===== review 3 ====
RE: Can you save time in multiplying polynomials by
encoding them as integers? by R. Fateman.
Given two polynomials f(x) and g(x) in Z[x], the paper investigates
the efficiency of the ``single point integer evaluation/interpolation''
method for computing f(x)*g(x) by multiplying two integers f(z)*g(z)
where z is an integer > (1+deg(f)+deg(g))*maxnorm(f)*maxnorm(g)) using
GMP arithmetic, which has an FFT based integer multiplicaiton algorithm,
then recovering f(x)*g(x) using single point interpolation.
The idea is that since a lot of work has been spent in optimizing the
long integer multiplication algorithm, even for specific architectures,
we can exploit this with little effort for polynomial operations in Z[x]
(and Z[x,y] too) rather than trying to code an FFT based
polynomial multiplication method.
The paper is nicely written and conveys the idea clearly. The idea is a
good idea but it is not new. It was exploited on a non-trivial problem
by Char, Geddes and Gonnet in [1] as mentioned in the paper, but also later
analyzed by A. Shoenhage in "Probabilisitic Compuation of Integer Polynomial
GCDs" (J of Algorithms, Vol 9, 1988) and extended by Geddes, Gonnet and
Smedley to operations on polynomials over Q(alpha) a number field with one
field extension in "Heuristic Methods for Operations With Algebraic Numbers"
(ISSAC '98 proceedings). This latter paper focusses on GCD computation but
also shows how to do polynomial trial division. There are at least two other
papers in the computer algebra literature on this subject; one by Davenport
et. al., and the other by Smedley (both in ISSAC proceedings).
Reading through the paper: In Section 2 Multiplication via the single point
interpolation algorithm is described, then described again in section 3.1
Negative Coefficients where code is presented. But this algorithm is already
described in [1] with a proof and in the text Algorithms for Computer Algebra
by Geddes, Labahn and Czapor. The algorithm is not new and should be
properly referenced. Section 3.3 mentions some other operations, citing [1],
mentioning that the algorithm in [1] is used in Maple and MuPad (it is also
used in Axiom) and ends with a ``suggestion'' that the idea could be used to
compute determinants of matrices of polynomials in Z[x], but, nothing is
actually demonstrated in this section.
Section 5 gives ONE benchmark. For the problem being tested, squaring
p=(x+1)1000, we have 0.8 seconds for the integer based method, verses 3.58
seconds for a simple polynomial representation and a quadratic multiplication
algorithm (the timing where GMP arithmetic is also used for the coefficient
arithmetic for a fair comparison). This is a factor of 5 improvement.
I actually think the paper would be of some interest to the reader simply
because it is a nice idea and perhaps should be further exploited. But, as
an original research paper, this paper does not properly reference the
literature and has far too little original content; basically, a single
benchmark comparing a known idea applied to a basic problem (polynomial
multiplication) demonstrating a factor of 3.58/0.8 = 5 improvement. Thus I
conclude that the paper does not warrant publication in TOMS as an original
research paper. It is more of a "look, this idea might have potential" paper.
It is more a "since we now have GMP arithmetic, maybe we should take
another look at this single point interpolation method" because we can
use it to, with relatively little effort, get an asymptotically fast method.
I suggest that the author submit it to the SIGSAM bulletin where it will have
greater interest.
The paper would be stronger if there was more than one benchmark and
more than one problem investigated, e.g. polynomial trial division in Z[x]
or sqrt in Z[x] (NB: GMP has an asymptotically fast integer long division
and sqrt but no asymptotically fast integer gcd I believe).
==== review 4 ====
Review of: Can you save time in multiplying polynomials by
encoding them as integers? by R. Fateman.
This paper evaluates the use of large integer ("bignum") representations
for polynomials in order to take advantage of highly-tuned integer
multiplication routines in libraries such as GMP potentially to speed up
polynomial multiplication.
I am somewhat concerned about the depth of the paper's contribution, as
the techniques it describes are already quite well-known (the author
admits as much). It does contain some extensions to what is already
described in the literature, but I find those to be fairly trivial. The
primary contribution revolves around empirical analysis of the
performance of integer representation-based multiplication versus
traditional algorithms. I think the exposition thereof could be improved
-- the paper gives a two-page textual description (cf. section 5) of
results that could be better placed in a table or graph.
Section 3.2 could be fleshed out somewhat better -- a proper bound on
the size of the integer modulus would be useful. Also, there are some
passing references to using a very small modulus (say, multiplication of
two polynomials over a small finite field) and thus packing multiple
coefficients into the same machine word. There does not, however, appear
to be much empirical analysis of this case.
...............
more notes, RJF 1/18/06
For trial division, what one needs is a bound on the size of coefficients
in a factor of a polynomial.
This is provided by
http://mathworld.wolfram.com/Landau-MignotteBound.html
where the (Landau?)-Mignotte
bound computes
B = binomial (d-1, floor(d/2)-1)
+ binomial(d-1, floor(d/2)) * twonorm(P).
d= floor(deg(P)/2)
twonorm(P)= sqrt(sum of squares of coeffs in P)
B gives
an upper bound for the absolute value of coefficients of any nontrivial factor of P,
a univariate polynomial over Z[x].
......
more benchmarks, august 19,2010
:cd c:/lisp/mul
;; for pentium 4
:ld p4-gmp.dll
:ld gmp-gen.fasl
:ld just3GMP.fasl
... set up examples e.g.
(setf lones '(#(0 1 2 3 4 5 6 7 8 9) . #(1 1 1 1 1 1 1 1 1 1))) ;; small example
(time (setf l2 (mul lones lones)))
(time (setf l4 (mul l2 l2)))
(time (setf l8 (mul l4 l4)))
(time (setf l16 (mul l8 l8)))
(time (setf l32 (mul l16 l16)))
:ld 4f.fasl ;; get fft
:ld min5karat.fasl ;; get karatsuba/ using "unbalanced" hack
#| Some timing data 8/14/10, pentium 4 (times averaged over many runs)
Timing on a Pentium D CPU, 2.99GHz, 3.0GB of RAM, Microsoft Windows XP Professional SP3
Allegro Common Lisp 8.0
ones is a polynomial of all 1's coefficients, dense to degree 999.
one2 is ones X ones, and is of degree 1,998
(mulfft3 one2 ones) is 12.5 ms floating point FFTW. (limited coef size)
(m-ks one2 ones) is 25.0 ms karatsuba, karatlim 50, degree 50 switches to mul-dense
(mul one2 ones) is 12.5 ms kronecker, GMP 4.1.4
(mul-dense one2 ones) 75.0 ms dense multiplication.
for one2 X one2 i.e. degree 1998 X 1998
fft: 15.6 ms
m-ks: 79.7 ms
mul: 25.0 ms
mul-dense 429.7 ms
for ones X ones i.e. degree 999 X 999
fft: 6.0 ms
m-ks: 12.8 ms
mul: 6.0 ms
mul-dense 37.5 ms
consider L32 X L32 , max coefficient 2443553412892220489195278947230, degree 288 each
fft: 12.2 ms
m-ks: 23.0 ms
mul: 9.0 ms
mul-dense 44.5 ms
consider L32 X L8 degrees 288 and 72. max coeff of L8 is 4816030
fft: 12.2 ms ;; float approx only
m-ks: 16.6 ms
mul: 3.9 ms
mul-dense 21.1 ms
SMALLER SIZES
for lones X lones, degree 9 X degree 9
fft: .1641 ms
m-ks: .0172 ms
mul: .0750 ms
mul-dense .0156 ms
for L8 X L8 degree 72 X 72
fft: .1250 ms
m-ks: .1125 ms
mul: .0656 ms
mul-dense .1329 ms
for L4 X L4 degree 36 X 36
fft: .453 ms
m-ks: .062 ms ; just calls mul-dense
mul: .219 ms
mul-dense .062 ms
Unbalanced
for L4 x L32
fft: 57.8 ms
m-ks: 9.8 ms
mul: 3.6 ms
mul-dense 10.2 ms
Conclusion: Kronecker subst is pretty good except for small examples
(degree 36 X 36, 36 X 72). Earlier we did not find Kronecker to be so
advantageous, but prompted by comparison with FLINT (courtesy Bill
Hart) we rewrote programs for coding/decoding. It may also be relevant
that we here are using a specific pentium 4 GMP library.
There may be prospects for improved speed from upgrades to the GMP
library or its fork MPIR, especially for recent improvements for
unbalanced multiplication. It may also be possible to use Harvey's
suggestions on using 2 or 4 multiplications and disentangling the
overlapping results. Also, Allegro Common Lisp is now at version 8.2.
These results include times to convert inputs from and to a
representation which is a pair of arrays: the exponents and the
coefficients. Compared to FLINT using Kronecker substitution, the
time spent "in GMP multiplication" should be identical.
|#