%% begun, May 1. now May 5, 2005, and May 17, 2005. RJF
%% again, July 12, 2005 RJF
%% looked at briefly on Jan 17, 2006 RJF
%% and yet again, March 2, 2006
%%% comments from ISSAC 2006 reviewers at end.
%% no substantial change since March 2006.
%% again, June, 2010, July, 2010
\documentclass{article}
\usepackage{fullpage}
\title{When is FFT multiplication of arbitrary-precision polynomials practical?}
\author{Richard J. Fateman\\
University of California\\
Berkeley, CA 94720-1776}
\begin{document}
\maketitle
\begin{abstract}
{It is well recognized in the computer algebra theory and systems
communities that the Fast Fourier Transform (FFT) can be used for
multiplying polynomials. Theory predicts that it is fast for ``large
enough'' polynomials.
We set out some benchmarks for polynomial multiplication to see when the FFT
is practical. We also suggest an alternative version of FFT
polynomial multiplication using floating-point numbers that has some
nicer properties than the typical finite-field ``Number Theoretic
Transform'' approach. In particular, in spite of the fact that a
floating-point FFT gives {\em approximate} answers, we can
nevertheless use it to produce EXACT answers to a class of polynomial
multiplication problems for arbitrary-precision coefficient
polynomials. A sufficiently-accurate coefficient, namely
where the error is less than 0.5, provides the exact integer answer.
We also discuss how degree, size of coefficients,
and density affect the notion of ``large enough'' to warrant the
use of FFT. }
\end{abstract}
\section{Introduction and Some History}
In 1981 we wrote about inter-language communication to improve
performance of a computer algebra system written in Lisp, but a Lisp
system in which one could easily call programs written in C. In
particular we illustrated the use of an FFT for polynomial
multiplication\cite{fftvsC}. The examples given were the squaring of
$(1+x)^n$ for various values of $n$. The three programs compared were
an implementation of FFT in Lisp, the default Macsyma Lisp code, and a
third program ``hsCmult'' which was an implementation of a
conventional $O(n^2)$ multiplication program written in C (so-called
high-school method). Our conclusion was that the standard Macsyma
code, oriented toward sparse polynomials, was slowest on these
tasks. We expected the FFT to be good for larger $n$, but the hsCmult
was up to several times faster for $n<32$, and continued to be good
for $n<128$ or more. Finally, for all $n \ge 256$, the FFT routine
was better. Since these routines were timed computing the products of
equal-sized totally-dense polynomials in a single-word-size finite
field, they constituted a domain in which the FFT should have
excelled.
The conclusion: the naive program was a winner for the kinds of problems we believed
were most common, includng all small sizes.
Twenty-four years later, (2005) we were prompted by statements made in
the literature praising the FFT as a {\em practical} program for
polynomial multiplication, to re-examine
implementations using the FFT. This resulted in
an unpublished report suggesting it was not so great\footnote{Our draft 2005 paper was not received kindly by
ISSAC referees, but it sat in an online file for years.}.
Then again in 2010 we revisited the issue yet again, since we had another
motivation to bring up an FFT library, and we wanted it to be a fast one.
\section{The FFT and Speed}
{Although polynomial multiplication via FFT ``should be'' fast based on asymptotic complexity
analysis, it appears that no major {\em general-purpose} computer
algebra system uses the FFT as a general default technique for this
purpose. Some more specialized systems\footnote{in particular, MAGMA,
NTL} may use an FFT method, and others may provide (optional) access
to FFT techniques, and GMP and other systems probably will use an FFT
method for long integers (which is tantamount to polynomial multiplication,
suitably encoded).
It is possible that CAS designers are making a rational decision
against using FFT, or it may be that the appeal of the FFT is limited
to the implementors of those more systems typically specializing in
univariate, dense polynomials over finite or infinite fields. Some
implementors may assume the FFT is good enough, in practice, even for
smallish problems but they are deemed to be infrequent and the time
devoted to them inconsequential anyway. A possible alternative view
for a system design (which we endorse) is that a CAS should make a
quick estimate of which algorithm would be fastest, and choose that,
just as purveyors of super-long-integer multiplication routines
compute various cross-over points for one technique or another. }
We described, in a recent paper \cite{polysbyGMP} our experiments
using a kind of shorthand method for packing coefficients of
polynomials into long integers or ``bignums'' and using efficient
integer library routines to do polynomial multiplication. The
algorithm consisted in scanning the coefficients and with a simple
linear-time computation finding a bound on the coefficients in the
answer. This bound provides a ``spacing'' for packing and unpacking
coefficients. We then called the ``bignum'' multiply on the packed
numbers, and unpacked the result. This technique is sometimes called
Kronecker substitution, and is really only attractive if
you have very efficient bignum multiplication.
In our experiments we used the GMP \cite{gmp} (GNU Multiprecision
number package) library routines. This package uses various
tricks, including FFTs, for bignum
arithmetic when it deems them to be appropriate. We found some cases
\cite{polysbyGMP} in which such an apparently roundabout method was
indeed faster than the most obvious classical method, a method which
takes $O(n^2)$ coefficient operations for two polynomials of degree
$n-1$. Note that the coefficient operations are not constant-time: they depend
on the lengths of the coefficients, and the Kronecker substitution
technique is the only one that can take substantial advantage of really small coefficients:
in this (perhaps unusual) case we can effectively pack more than one in a word. More often than not,
we expect that the spacing between coefficients.
In 2005, we pursued the situation a bit more: Just because a new
program runs faster than one other program does not mean it is the
program of choice. We went back to our problems and thought about the
other clever polynomial methods, none of them in common use in CAS,
but that may have been because the designers were unwilling to
optimize for the same kinds of problems we were forced into make in
order to implement the bignum-GMP-FFT version. (Univariate, dense,
similar-size inputs). In fact, with a little programming effort we
could substantially improve on the classical method, and push the
crossover point where the FFT finally is advisable, out some distance.
In particular, our results in 2005 showed that a ``Karatsuba'' style
multiplication was a much better competitor for any but rather small
cases, and therefore it should be compared to the FFT implementation
we had at hand.
More recently we found that hacking on the obvious classical program
some more, or perhaps using a more sophisticated compiler, the program
improved. We also were now willing to use an FFT library (FFTW)
at hand. For these reasons it appeared appropriate to run some
new benchmarks.
In brief: the Karatsuba version faded from prominence, being the program
of choice in a narrow region; the FFT program showed considerable strength,
but a dense $n^2$ algorithm was hard to beat for a major chunk of problems.
Looking back at comments from reviewers of the 2005 paper, we see that
we followed their suggestion that we implement the FFT ourselves on
the coefficients of the polynomial, rather than packing them into
bignums.
\footnote{ (These and other suggestions were of the kind sometimes referred
to as ``Sure, your program works in practice, but does it
work in theory?'')}.
Is this faster? In practice, somewhat. Yet the point of that earlier paper
{\em was to avoid writing and tuning a bignum polynomial FFT, and to just
use someone else's carefully tuned program for integer multiplication.}
%In practice, we would be testing our
%program against the prowess of the GMP programmers who may have tuned
%their FFT using pipelined assembly code, etc. We would have to
%do a careful balancing act to try to come up with optimization setting
%of a compiler relative to particular machines and problems.
\section{Which FFT program to use? and what else to compare to it?}
We've written some FFT programs; frankly it is tedious to get all the
details right to run {\em as efficiently as possible}, and an endless
task to continue to refine such programs in the face of changes in
computer architecture and memory implementation, compiler software,
and the occasional theoretical
advance\footnote{http://www.fftw.org/benchfft/}. Compounding this is
the realization that an FFT-based program is, according to both theory
and practice, {\em expected to work poorly for what is (we speculate)
probably the common case in computer algebra systems: sparse
polynomials of modest degree in many variables}. This may be the
reason used by system builders to be reluctance to use FFT routines.
{This paper is not going to describe in detail the fast multiplication
algorithms or FFT; we presume the reader has some familiarity with
them. For descriptions, see for example, Gathen and Gerhard \cite{vzg},
or (especially) Knuth \cite{knuth}.
In the list below, the algorithm we call Toom is the second of a
possible sequence of variations of polynomial multiplication that can
be devised by dividing the multiplication into subproblems
\cite{knuth}. Our Toom divides each input in three, and uses five
multiplications of one-third the size. Its asymptotic growth rate for
multiplying polynomials of degree $n$ is about $O(n^{1.4649})$. The
better-known Karatsuba, which is Toom with ``$r=1$'' \cite{knuth},
divides each input in two, and uses 3 multiplications of half the
size. Its asymptotic growth rate for multiplying polynomials of
degree $n$ is about $O(n^{1.5849})$.
{For exact polynomial multiplication of two polynomials of degree $n$,
one variable, small ``fixnum''
integer coefficients in the input and output, we
can arrange these algorithm from least (asymptotically) efficient to most efficient this way:
\noindent
1. Classical ($O(n^2)$) \\
2. Karatsuba ($O(n^{1.58})$), dropping to Classical for small problems\\
3. Toom ($O(n^{1.46})$), dropping to Karatsuba for small problems\\
4. FFT ($O(n \log n)$).} \\
In this list we are neglecting to add an additional (asymptotically dominant!)
term which arises if one has a sparse data structure for the polynomials
and if one must consider the possible costs for sorting the resulting
coefficients in the answer. In this case the cost
may involve an additional log term, whereas in the case of a dense representation,
typically the terms are already sorted, and the asymptotic complexity
is as indicated above.
To elaborate for a moment, in the case of an algorithm oriented
towards generating terms for a sparse answer, the number of
exponent-comparison operations needed to sort the results, will grow
at a faster rate than the coefficient-arithmetic operations. Whether
the total costs of exponent comparisons and the accompanying
rearrangements for sorting in fact are a significant component of the
actual cost, or are dominated by coefficient arithmetic costs depends
on the details of the algorithm and the test data.
The Kronecker method, mapping to a bignum and consequent multiplication routine (Whatever GMP chooses,
which might be FFT, Karatsuba, or something else) is yet another approach.
It works, but it is rarely faster than FFTW, and when it is faster, not by much.
On the other hand, it is faster than the $O(n^2)$ method, and much more
importantly, gets the fully accurate answer. For example multiplying a degree 500
dense polynomial with 332-bit coefficients takes 900 ms for dense method,
470 ms for Kronecker method, but only 15 ms for the FFT, but this last method
uses floating-point and the coefficients are not fully accurate. Doing this about
13 times on 15-bit modular images and interleaving the results with the Chinese
Remainder Algorithm would be necessary to truly compete, and that might take
close to 470ms.
\section{Recent claims}
E. Kaltofen, in a talk at ECCAD 2005, Ashland, Ohio, suggested that
the FFT for polynomial multiplication was {\bf quite practical even
for modest size polynomials} and cited a paper by Victor Shoup,
\cite{shoup95}, in his assertion that even polynomials of degree less
than 100 could be multiplied faster with FFT. Quoting that paper:
\begin{quote}
``In all of the known factoring methods, a key operation is
multiplication of polynomials over ${\bf F _p}$. We shall assume that
multiplication of two degree $d$ polynomials uses $O(M(d))$ scalar
operations, where $M(d)=d \log d \log \log d$. This running-time bound
is attained using the Fast Fourier Transform or FFT (Aho {\em et
al.} 1974), and (as we shall later see) is quite realistic in
practice.''
\end{quote}
As we see, in Shoup's paper\footnote{Incidentally, a very nice
blending of theory and implementation}, the task being considered in
this paragraph is polynomial multiplication {\em in a finite field,}
not over the integers. This changes the ground-rules {\em
substantially} since it eliminates algorithmic issues that must be
addressed if there is the potential for unanticipated large
coefficients in the field. In Shoup's careful tests, his FFT becomes
better than a traditional multiplication at degree 30. At degree 511,
with 332-bit coefficients, (table 4 \cite{shoup}) shows that the FFT
is some 24 times faster than Shoup's own implementation of classical
arithmetic.
This is impressive. But as we have said, for the FFT to be
``practical'' at a given size and with a given coefficient bound it
should be the fastest of all methods, not merely faster than some
other\footnote{We have already given the FFT an enormous advantage
by asking it only to do single-variable dense cases; sparse or
multivariate programs are probably inefficient, viewed as a function
of the number of non-zero terms}.
In our own earlier (2005) implementation in Common Lisp, using an
off-the-shelf Lisp program for the FFT, we found that at size 512
(i.e. degree 511), the FFT does not win versus using a Karatsuba or
Toom (\cite{knuth} 4.3) algorithm or even what we thought of as a
carefully-coded Lisp classical version, {\em optimized for use with
fixed-precision arithmetic}. How could it be that Shoup sees a
factor of 24X in favor of FFT, and for us it is slower than the
competition? In the details of different computers, different
implementation languages, and different versions of arithmetic
libraries, it was difficult to make a comparison that was clear enough
to the referees of that draft paper.
For other reasons unrelated to this multiplication task, we took up a
task of linking to a carefully tuned FFT program library (written in
C) from Lisp, in 2010. Using this FFT library (FFTW), for which the
authors claim very high performance, we find that for dense 100 x 100
problems, the fastest method is Karatsuba, some 1.4X faster than an
FFT. For a 100 term x 100 term problem with average spacing 7.5 between
non-zero terms, a simple dense method is about 2.4X faster than an
FFT.
So our conclusions are consistent more or less with our own paper of 1981,
and somewhat at odds with our colleagues views.
%To see if we are just not using a good enough FFT, we have also tried
%to interface NTL to Lisp\footnote {these programs are posted on the
%author's web site ``papers/lisp2ntl''directory.}. If we use NTL, we
%cannot be accused of using a poor implementation of the FFT: we are
%using Shoup's, the same one cited previously. That is, we are timing
%calls directly to NTL's polynomial multiplication routines.
There are many peripheral issues that matters substantially in
assessing costs: for example how to account for
the time and space needed to coerce a polynomial into the form needed
for the FFT programs to work. This comes into play in three contexts
below:
\begin{enumerate}
\item Perhaps we must translate (small) integers to complex
double-float machine numbers (cheap), or if we must translate
(arbitrary precision) integers into high-precision quad-double (QD)
software-implemented floating-point (not so cheap), or
arbitrary-precision floating-point (not so cheap). However these are
all linear time costs.
\item Perhaps we must translate between representations: How are the
polynomials initially presented and how is the result delivered? For
example, we can use a presentation that consists of a list of
exponents and a list of corresponding coefficients. This is
convenient for, say, $3x^10+45$ encoded as (0, 10), (45, 3). If we
believe that all or most of the coefficients are nonzero, we can use
a dense array, e.g. (45 0 0 0 0 0 0 0 0 0 3), which for this
particular polynomial looks wasteful, but for others is quite
efficient. In particular, even though the time taken to make this
conversion is again linear in the size of the polynomial, this can
take half the time to do the multiplication.
\item For the FFT algorithm, the input polynomials $p$ , $q$ probably
{\em each} need to be copied to arrays $p'$ and $q'$ of length
size($p$)+size($q$), perhaps rounded up to a power of 2 (actually,
FFTW software does not require this). Furthermore, the coefficients
must be converted to floats. For the Karatsuba algorithm, a quite
different representation is convenient for programming, a
representation where one polynomial can be easily partitioned into
two, while also considering the necessary padding to match the size
of the other multiplicand.
\end{enumerate}
(MUST REDO EXAMPLES -- new computer, new software 2010)
(From 2005) First we pick a single example and examine times for
multiplying polynomials of size 512 with small coefficients, namely
all 1's. We think this is a large problem in terms of degree, though
hardly the largest we can accomodate, and hardly the largest that some people suggest are
worth testing\footnote{see Appendix 3}.
The coefficients are small, but
the results accumulate in one direction; if the coefficients were of
mixed sign there would be a chance for cancellation and smaller
results.
Notes:
Converting from a pair of arrays (sparse) to a single dense array
takes about 25 ms. (per polynomial). This includes the time to
allocate the arrays. Here's a trick, though. If you know you are
going to use the array(s) temporarily, pre-allocate working arrays,
enlarging them if you need to, and copy data into / out of the working
arrays, ultimately storing the results, as needed.
The naive dense multiplication has an advantage in that no padding is
necessary (as FFT or Karatsuba). Note that the padding for FFT
depends on {\em both} inputs, and thus it is not possible to
``pre-digest'' the individual polynomials.
FFT methods using double-floats have potential issues regarding
accuracy. Although these can be overcome, it adds to the complexity
and the cost.
?? How did fftw do it so fast?? I think by copying into a pre-allocated array.
OLD DATA
\noindent
{\begin{table}[h!]
\begin{center}
\begin{tabular}{|l|r|l|}
\hline
Method& Time (ms)& Notes\\
\hline
FFT & 3.1 &over double-floats, including conversion from ints\\
\hline
FFT & 267 & over QD, including conversion from ints\\
\hline
Karatsuba & 2.3 & \\
\hline
Classical & 2.6& \\
\hline
NTL (in C) & 7.8 & (jumping to 12.1 ms if size is 513)\\
\hline
\end{tabular}
\end{center}
\end{table}}
\begin{verbatim}
(new data)
on Pentium D 3.00GHz 3GB RAM, 511X511 degree, all ones, All programs except for fftw library
written in Allegro CL version 8.0
June 23, 2010 RJF
fftw 2.18 ms (interface hacked mercilessly, not preconverted!!)
fftw 1.09 ms (omitting the fft, including data conversion and product only)
mulfftd 1.76 ms (already converted to array of double floats!, using fft!!)
mul-dense 20.47 ms (classical, from muldense.lisp, starting sparse)
polymult 24.20 ms (classical, from newmultpa.lisp, pre-converted to arrays) ??
mul-karat 64.69 ms
karat 11.72 ms (data pre-converted to arrays). so this conversion is about 53 ms.
raw,old fft 2.9 ms (already converted to array of coefs, but not floated.) so add 53ms. (PA2D is 25.5 ms. X 2)
polymultd 47.21 ms (already converted to array of double floats!)
REVISE, see good-data-summary.txt
how about 512 X 11? polymultd is .96 ms, mulfftd is 1.28 ms
how about 490 X 10? .94 ms 1.56 ms. ; could save time rounding up to power of 2.
how about 512 X 256 2.26 ms 1.56 ms
how about 512 X 512 46.61 ms 1.87 ms
............... introducing sparseness, switching N^2 algorithm and representation to mul-dense ............
512 X 512 32.64 ms 1.87 ms
512 X 512 with 3/4 zero 8.75 ms 1.87 ms ; one argument is dense, the other 1 0 0 0 1 ..
512 X 512 with 7/8 zero 4.53 ms 1.76 ms ; one argument is dense,
512, 7/8 zero, squared 0.46 ms 1.87 ms ; both arguments are 1, 7x0, 1, ...
512 X 512, 15/16 zero 2.26 ms 1.76 ms ; one argument is dense
512, 15/16 zero, squared 0.29 ms 1.87 ms ; both arguments are 1, 15x0, 1 ...
\end{verbatim}
Note the rigidity of the FFT time. Its running time is predictable from
the degree of the output. It does not change if the inputs
are of dramatically different sizes. It does not change
if most of the coefficients are zero. This is in marked contrast to
(some implementations of) the standard $O(n^2)$ method. Carefully
programmed, this can excel even for squaring a 512 degree polynomial
of sufficient sparsity.
In terms of complexity, the FFT cost is $O(N log N)$
where $N$ is the maximum of the {\em degrees} of the inputs,
and thus it is a relatively poor algorithm for $n$ term by $m$ term if
$m$ is asymptotically smaller than $O( \log m)$, whatever that might
mean in practice for finite examples. Note also that the number of
{\em terms} in a sparse polynomial is much smaller than the {\em
degree}.
Now consider in more detail the alternative ``obvious'' algorithm, for
sparse inputs. Consider the last of these examples, squaring a
polynomial $p$ that has only 32 terms, but is of degree 496. It looks
like $1 + x^{16}+x^{32}+ ...+x^{496}$. If we represent $p$ as
denserep = (1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0 1.0 0.0 ....) and multiply, it takes about 2.15 ms. If we
represent $p$ as sparserep = ((0 16 32 ...) (1.0 1.0 1.0 ...)), it
takes only 0.29 ms. That is, if we step along only the non-zero
elements, we have only $32^2 = 1024$ coefficient multiplies instead of
$496^2=246,016$ multiplies.
How much times does it take to convert the input $p$ from dense to sparse? About 0.08 ms.
How much times does it take to convert the input $p$ from sparse to dense? About 0.009 ms.
Consequently, if we knew we had a sparse problem expressed in a dense representation, and
really wanted the answer in the denserep form, we could convert the two inputs, do the multiplication,
and convert the output in time 0.16+0.29+0.009 = 0.459 ms, rather than 2.15 ms. for the direct multiplication.
How to display the results of our testing?
We can perhaps reduce the issues to 4 parameters.
Degree of the answer
Number of terms in the answer
Average size (number of bits) in the coefficients
Ratio of largest multiplicand to smallest. (1 = same size).
and the data we benchmark: the time to complete the computation for each algorithm.
We can reduce the issues this way:
Assume completely dense polynomials with floating-point coefficients of degree d.
Graph: p1*p2 where p1 is of degree d/n and p2 is of degree d*(1-1/n) for n = 2, 4, 8, 16.
(note that n=2 is squaring)
Draw a graph for fft and for other algorithms.
Next, assume completely polynomials with floating-point coefficients
of sparsity x percent of degree d. Same graphs.
NOT REVISED BELOW
Now with 332 bit initial coefficients, as suggested by Shoup,
the computer must do more work.
\noindent
{\begin{table}[h!]
\begin{center}
\begin{tabular}{|l|r|l|}
\hline
Method& Time (ms)& Notes\\
\hline
FFT using double & 15 & (but answer has wrong coefficients!)\\
\hline
FFT using double & 9 & (if 332-bit integers pre-converted to double)\\
\hline
FFT with 212 bits & 485 & (that is, using QD arithmetic \cite{}\\
\hline
FFT with 212 bits & 284 & (if 332-bit integers pre-converted to QD)\\
\hline
Karatsuba & 438 & [65\% in bignum multiply] lim=100\\
\hline
Classical & 985 & [67\% in bignum mult, 12\% in add]\\
\hline
NTL (in C)& 100 &\\
\hline
NTL (in C) & 183 & (size is 513 instead of 512)\\
\hline
\end{tabular}
\end{center}
\end{table}}
%%% here, should be able to do NTL+GMP, though not in lisp.
The floating-point double FFT is expected to get the answer wrong for three reasons:
\begin{enumerate}
\item The
coefficients in the inputs were substantially truncated to fit them
into double-floats.
\item Even if there were 332 available bits for input,
the arithmetic would have to be done to higher precision than
available with hardware floating point.
\item We would have to find
sufficient space to represent the up-to 673-bit numbers in the answer.
\end{enumerate}
The QD FFT also gets the wrong answer because even 212 bits is not
enough here. Consequently, neither of the first two versions of FFT
are contenders.
Yet, if the FFT is done (say) 23 times over with 23 different inputs,
modulo ``single precision'' primes and stitched together with the
Chinese Remainder Algorithm (CRA), {\em we could get the right answer}. This
would cost at least $15 \cdot 23=345$ ms (plus time for CRA).
But if we wanted to do that, it seems that we would be better off
using Karatsuba or even classical multiplication with small coefficients
23 times over, plus the CRA.
But then, how can we figure that the NTL polynomial multiplier program, which
is {\em relatively slower for degree 512 with small coefficients}, is
so much faster when the only change is that coefficients are larger? By our own
measurements, Shoup's FFT is about 10 times faster (maybe really 5
times faster if you choose a ``pessimal'' rather than an ``optimal''
problem size) compared to classical, and about 4.4 times faster than
Karatsuba (2.4 discounting the exact power of two).
What if we could beat down the time for integer multiplication by a
factor of 3 or more for arithmetic multiplying two numbers of about
size 332? This is entirely feasible by using better big arithmetic
rather than our native Lisp's\footnote {The bignum package in Allegro
CL 6.2 uses only 16 bit ``bigits'' in its arithmetic. Multiplication
can be expected to be at least 4 times slower than a comparable system
which uses 32 bit units, assuming that a $32 \times 32$ multiplier is
available; even higher efficiency may be obtained with 53-bit fraction
double-floats. This is without using any clever algorithms.}. At a
factor of 3, we get down to about 270 ms. After corresponding with
NTL's author, Victor Shoup, we learned that native NTL arithmetic is
not as fast as GMP arithmetic, and for best performance we should use
GMP. We encountered (so far insurmountable) difficulties compiling
NTL with GMP under Windows, but we compiled the NTL-5.4 + GMP 4.1.4
system using a version of Cygwin with GCC 3.4\footnote{The timing of
some of these systems is very ticklish: setting different optimization
parameters---and there are large numbers of them in NTL, GMP, and the
compiler--- some of which may affect times substantially. We compiled the
system as suggested by the NTL documentation.} We used the default
configurations computed for a 933Mhz Pentium III and found that, over
repeated trials of multiplying degree 1023 polynomials modulo a
1024-bit number, the version using GMP was on average about 40 percent
faster than the NTL version. We can also use GMP arithmetic in our
Lisp: we could also get about a 40 percent speedup using 332-bit
coefficients for classical or Karatsuba multiplication.\footnote{In
our tests of using Lisp + NTL+GMP, we got incorrect answers except on
numbers with more than 250 bits, and so we are cautious about trying
to make a direct comparison in exactly the same environment.}
%%(We are using NTL version 5.4 compiled in MS visual studio C++ 6.0).
Our basic finding is that NTL is slower than Lisp for sufficiently
small coefficients (fixnums), even if
we spend time to discover that the coefficients are ``fixnums''. For
(considerably) larger coefficients, our tests {\em eventually} agree
with the predictions of algorithm complexity theory that the FFT
should be faster. But even at degree 10,000, if the routines are
compiled to know the types of input, the FFT advantage is only about a
factor of 3 in speed versus the best of the others.
%*karatlim* 50 gives 125ms on terms all 2^323
% classical gives maybe 344ms.
Turing to a standard reference in computer algebra algorithms,
we looked at the figures 9.10 and 9.11 of Gathen/Gerhard \cite{vzg} based on
benchmarks of Shoup's NTL. They seem to suggest that even at degree 63, if
coefficients are of 1024 bits, the ``Fermat number FFT'' wins; and if
coefficients are of 64 bits, then the ``modular FFT'' wins over all
other methods tested, at some rather low degree. It is not possible to
be sure from their graph but it appears to be something less than degree
500. The alternatives include a Karatsuba-style multiplication routine
as well as a ``classical'' $O(n^2)$ version for multiplying degree-$n$
polynomials. As we have already remarked, any such comparison is
strongly influenced by the cleverness of the implementations of the
algorithms being compared, the details of the compiler and memory
management used, the setting of parameters for optimization, and
especially for large polynomials, the size of the cache relative
to the size of the polynomials. We have seen that a number of
the comparisons between runs on a Pentium 2 and Pentium 4 computer
do not reflect the speedup of the ratio (2.5GHz / .933GHz) but
rather the ratio of data cache sizes.
%MOSTLY OLD TEXT..
%We corresponded with Victor Shoup, who confirmed that he
%believed the FFT to be practical in some settings\footnote
%{especially dense, univariate, finite field} for rather small
%polynomial degrees. (see Appendix 1)
%
%Implementing an efficient numeric FFT for a large sequence that
%essentially stands in for a dense univariate polynomial, is
%tricky. Which set of tricks works best is primarily an exercise
%in tuning parameters.
\section{Review of FFT implementations}
The usual way of running an FFT for a computer algebra algorithm is
to use a different field from the (approximate, floating-point)
complex numbers. Typically the computations
use a finite field, e.g. the integers
modulo a prime $P$. To perform an FFT on a sequence of length $n=2^k$,
a primitive $n$th root of unity $\omega_n$ must be available in the
field. $P$ is either chosen sufficiently large so as to bound the
size of the coefficients in the answer, or a more complicated
arrangement is set up, based on doing the FFT computation in some
small number\footnote
{T. Granlund suggests this strategy doesn't make sense for more than
about 5 primes, in his experience.}
of finite fields, that is, mod $P_1$, $P_2$, ... and then
pasting the results together with the Chinese Remainder Algorithm.
These FFTs all produce exact integers along the way, rather than
floats (in particular, $\omega_n$ looks like an integer between
$-(P-1)/2$ and $(P-1)/2$ in a so-called ``balanced representation''
of the field), and arithmetic is integer arithmetic {\em where each operation
requires a remainder computation} (remainder modulo $P$). This represents
a substantial overhead. Of course for the complex float FFT,
multiplication
is ``complex float'' typically using four ``real float'' multiplications and two
add operations (and slightly more work if unnecessary overflow is to
be avoided)\footnote{if multiplications are far more expensive than
adds, the computation can be done with only three multiplications.}.
For computers where
integers are 32 bits and double-floats are 64 bits, we are probably getting
more work done faster from the floating-point processor, replacing
the integer remainder calculation\footnote{See the
GMP Manual for a brief discussion of exact remainder by word-sized P.}.
The FFT is (eventually) going to be fast as the problem size grows. The cutoff
point where it is superior to all alternatives
depends of course on their implementation and some particular characteristics
of the input (density, size of coefficients).
Are there ways of making it more practical? Perhaps useful for relatively
smaller sequences? It could be made better in several ways:
making it more accurate in computation of complex roots of unity, making
it faster by numerous strategies including memory allocation tricks
as well as fitting the FFT to other (non power-of-two) sequence lengths.
We could also build upon someone else's library, as we eventually did,
using FFTW.
{\em
Given that the usual floating-point FFT that is the object of
so much numerical interest gives approximate answers, can we use it
to produce EXACT answers to multiplication problems for
arbitrary-precision coefficient polynomials?}
Can we use the information from the floating-point FFT to
\begin{itemize}
\item Inform us when the answer is right exactly.
\item In case it is inaccurate,
allow us to get the correct answer (with more work, but still fast).
\item Do so with a guaranteed modest amount of work.
\end{itemize}
Consider that the IEEE double-precision fraction has 53 bits (52 bits
plus sign, 11 bit exponent), compared to the 32 bits for an integer
(on most machines), and given the possibility of several arithmetic
units, we may have access to
a better encoding of numbers, even integers, as parts of
floats. Not all numbers can be represented by floats,
e.g. numbers exceeding about $1.8 \times 10^{308}$ exceed the range,
and the encoding of the fraction gives something less than 17 decimal digits.
Some polynomial problems never need more than what can be comfortably
represented in the fraction of double-floats. The exact integer
answer can be obtained by coercing to the nearest integer. We must be
able to guarantee that the number we are rounding has error less than
half in the last place!
Sometimes though, the answer cannot be represented, e.g.
$r$=123,456,789,123,456,789 requires 56 bits to represent to full
accuracy, so it seems we cannot use the floating FFT for a polynomial
with this coefficient.
{\em NOT SO!}
%Consider the two primes $m_0= 39582418599937$ and $m_1= 79164837199873$.
%We can represent $r mod m_0 =38807928853223$ exactly as a double-float,
%and similarly for $r mod m_1 = 38807928854782$. We can use these numbers
%in two separate {\em floating point FFTs}, and combine them using the Chinese
%Remainder Algorithm.
%Problem: these primes appear to be way too big. The accuracy of the fft
%is not good enough.
{Consider the primes $m_0=33554393$, $m_1=33444383$, and
$m_2=33554371$, all less than $2^{25}$, so their pairwise products fit
comfortably\footnote{How comfortably they must fit is an important
question.} in 53 bits. We can represent $r \bmod m_0 =27733531$
exactly as a double-float, and similarly for $r \bmod m_1 = 7390453$
and $r \bmod m_2 = 5709040$. We can use these numbers in three
separate {\em floating-point FFTs}, and combine them using the Chinese
Remainder Algorithm. For example, if we are squaring
$(1+123456789123456789x)$, we would instead square $(1+27733531x)$,
$(1+5709040x)$, etc. As long as these computations can be done
exactly in a floating-point FFT, we are safe.}
How did we know that three FFTs were needed? We compute a bound for
the coefficients in the product of two polynomials. Based on this, as
well as the comfort bound (see below) we can tell how many images we
need. We can then modularly reduce the inputs so that the answer's
coefficients can be represented correctly in the floating-point
fraction. One coefficient bound for the answer is easy to
compute. For polynomials $p_1$ and $p_2$, the bound is
maxcoef($p_1$)$\cdot$maxcoef($p_2$)$\cdot$(1+min(deg($p_1$),deg($p_2$))).
(Proof left as an exercise: consider squaring a polynomial with
coefficients that are all 1).
How comfortable do we need to be? That is, how much head-room do we
need to make sure the answer, rounded to the nearest integer, is
correct? A substantial discussion can be found in Percival
\cite{percival}. A simpler characterization can be found,
however. According to the FFTW
website\footnote{http://www.fftw.org/accuracy/method.html}, we can
expect that accuracy is proportional to $\sqrt{\log(N)}$ for a length
$N$ transpose. Since the computed bounds tend to be somewhat
pessimistic, we tried some experiments. We can choose a particular sequence
size, and try to find the largest $k$ such that an input polynomial $p$
whose coefficient bound is $2^k$, can be squared and the exact answer
obtained, using only double-precision floats. This is what we found:\\
\noindent
\begin{center}
\begin{tabular}{r r}
size&k\\
32&23\\
64&23\\
128&21\\
256&20\\
512&20\\
1024&19\\
2048&19\\
4096&18\\
8000&17\\
16000&17\\
\end{tabular}
\end{center}
That is, if you have a polynomial $p$ of length 2048, whose coefficients
are random integers less than $2^{19}$, you can multiply $p$ by $p$ and
get the {\em exact answer} by running the floating-point FFT procedure
we tested, and rounding the coefficients in the answer to the nearest integer.
We believe these numbers may be somewhat optimistic, and probably
should
be subjected to some kind of worst-case analysis.
The details will depend on the particular FFT implementation.
If the problem can't meet these bounds, there are several common approaches, and
two more which we suggest below.
\begin{enumerate}
\item Don't believe the bound. Compute a better one. {\em We think
this idea is new.} If the input coefficients fit in the double-float
format, compute the product with the double-float FFT, or for that
matter, using classical multiplication, with double-floats. Or even
single-floats if that is faster. Our only concern is that the output
coefficients must have magnitude less than the overflow threshold
(about $1.8 \times 10^{308}$). This operation takes very little time,
and it might not get the exact right answer, but except for the
possibility of overflow, it will compute the approximate coefficients
in the answer. For example, if the largest coefficient in this
approximate answer is a floating point $3.0 \times 10^{80}$ we know
that we should have at least 81 decimal digits of accuracy in our
FFT, and based on the other components of the error bound, we can
determine a precision width W for the FFT.
\item If W can be met by doubled-double (about 32 decimal digits) or
quad-double arithmetic (about 64 decimal digits), we can redo the FFT in that
precision. The truly arbitrary-precision (e.g. MPFR using GMP)
software is more flexible but substantially slower.
(Example:
Consider that the number of bits B needed in the largest coefficient is
computed by a double-float FFT and found to be 59. That is,
the magnitude of the coefficients in $p^2$ is bounded by $2^{59}$.
If you wish to compute exactly $p \times p$ of length 2048, our table suggests
19 more bits, so quad-double should be fine.)
\item Find a bound $B$ for the largest coefficient in the
answer, as above. Identify a single finite field of size at least $2B$ which also contains a
value $\omega$ which is a primitive $2^n$th root of unity, where
$d=2^n$ exceeds the degree of the answer. Computing
two forward transforms, $d $ multiplications,
and one reverse transform gives us the product.
\item Do the arithmetic {\em not} in a single finite field of size greater
than $2B$, but in a number of finite fields whose moduli, when multiplied
together, exceed $2B$. The Chinese Remainder Algorithm (CRA)
allows these pieces to be put together, in a final step.
(There are numerous minor variations \cite{vzg}, some of which can
make substantial timing differences.)
\item {\em We think this idea is new.} We do the arithmetic not in a
single finite field, or a collection of finite fields each with an
$\omega$ or over a high-precision floating-point domain. Rather, we
use the existing, presumably highly-tuned floating-point computations
familiar in numerical computation, {\em repeatedly.} The novelty is
that the {\em inputs only} have been reduced modulo selected primes
$q_1$, $q_2, \cdots$, (e.g. each less than $2^{19}$ if the inputs are
1024 term polynomials) The arithmetic (as we have observed) done
rapidly {\em in floating-point} and then rounded to integers. We know
these integers are exactly correct, and so we can reduce them modulo
those same primes $q_1$, $q_2, \cdots$. The resulting exact images
can be collected via the Chinese Remainder Algorithm (Garner's
algorithm) in a final step.
\item {\em We think this idea is new.} Use one finite field image,
(perhaps even computed using traditional multiplication!) to check the
TRAILING bits of the coefficients, to make sure the rounding is done
to the right integer. As a simple example, reduce the inputs modulo 3,
multiply them. This preserves the information of which of the
coefficients are even and which are odd.
\end{enumerate}
The big win here is to observe that any time a better {numerical} FFT
is available on whatever computer we are using, we can use it. The
FFTW website shows how intense this development activity has been.
We recently (May-June, 2010) looked at the FFTW website for reasons
not initially related to polynomial multiplication, but for obtaining
a good discrete cosine transform (DCT) for Chebyshev series
calculations. Revisiting (sparse) multiplication through discussions with
M. Monagan and R. Pearce, it occurred to us to reconsider the FFTW
algorithms (called from Lisp) for this task. R. Pearce suggested an
improvement on the FFT whereby the polynomial inputs p, q, are placed in
the real and imaginary parts of one array, so that a single FFT can,
with some extra calculations, provide the transforms P, Q of both
inputs. This saves some time and space compared to our previous
programs.
\section{Conclusion}
{Benchmarking is always prone to misinterpretation. In comparing two
or more algorithms there is always the temptation to continually
embroider your program's implementation until it is faster than the
competition, which, grabbed at a particular point in time, tends to be
stationary. Or one can run the programs on datasets, improbable or
not, that especially show your ideas in the best light. Benchmarks
are also subject to variations in compiler optimization, memory speed
and cache size, or in our case whether GMP or other underlying
arithmetic is used.}
We find rather little support in our own experiments for the idea that
the FFT should be used for {\em typical} polynomial multiplication
problems which, we contend, are unlikely to be univariate and dense or
very large.
In our original testing, we found that even in the FFT's favored realm
(univariate, dense), it might be advantageous to switch to a version
of the Karatsuba algorithm, but only if, in turn, it switches to the
classical method for inputs of degree below 12 or so.
Our Toom program has advantages, but benefits if it
switches to Karatsuba at about 300. (Choosing these cutoff numbers for
a particular environment would probably have to be done by testing and tuning;
it would be preferable to have a more principled approach to this art. Examples
of software projects that try to provide a framework for making choices based
on test include ATLAS and FFTW. A similar approach might work for polynomial
multiplication, but would require a better characterization of the dense vs. sparse
spectrum).
Basically, since my environment is very likely to be different from yours, we cannot produce
cutoffs for your computer.
If
you know that your inputs have small coefficients and a single
floating-point FFT can either get you an answer, or a good distance
toward the answer, FFT starts looking promising. But if you have
small coefficients and tell this to the classical method, it beats the
FFT for polynomials of degree even 1,024 or more in our spot check of
polynomials with all-one coefficients, given our original FFT
implementation
(Lisp, translated from Numerical Recipes).
For our testing in 2005, in this set of cases, only
beyond degree 2048 did the FFT become superior to alternatives.
But we now have the considerably faster FFTW, not written in Lisp,
and not copied from a relatively naive implementation.
Combined with a much more substantial effort to
optimize the conversion of lists to arrays suitable for the FFT,
we have found a more favorable (for the FFT) cutoff point, which in our case appears to be
roughly at degree 100 dense, where a naive dense multiplication,
a Karatsuba-based program, and the FFTW program are about equal in
speed. Unfortunately, the FFT is still a floating-point based version,
and the FFTW code does not easily generalize to bignums (June, 2010).
Indeed, the FFTW version now is far superior at size 2000, being about 15 times faster
than the next-best version (Karatsuba).
However, for degree 40 dense, the FFTW code is 3X slower, and if
most problems are of that size, you would pay a substantial cost if this
were the default program.
In 2005 we considered that if you have extremely
large degree and large coefficient {\em univariate} problems, at some
point the cost of conversion of the data from Lisp into NTL is
negligible compared to the time in polynomial multiplication, and,
since it takes almost no effort to load NTL into Lisp as we have done,
we suggested that it might be plausible to consider just using NTL or similar programs from
Lisp-based CAS.
Now it seems that one could take the effort to incorporate the FFTW library
and have somewhat more immediate access -- FFTW can operate on Lisp double-float
arrays without any conversion -- though NTL may have useful trade-off calculations
as to how to do a particular problem.
Note that the general idea of using multiple representations for
efficiency or convenience was used in the Macsyma CAS since the late
1960's, and in Matlab '68 before that, when a distinct ``rational
function'' package was included. Along this line, NTL encoding could
be just another possibility along with arrays, lists, trees, etc.
The SAGE project\footnote{SAGE www.sagemath.org}
has taken something of that view, attempting
to load essentially every computer program of possible interest\footnote{and compatible
licensing} into one package.
At this point in time (2010) we are disinclined to repeat our
experiments with NTL; however, this (or FFTW) may still be a viable alternative for someone
building a system by picking and choosing among existing alternatives.
(Incidentally, anticipating some readers' questions, we have not included times for
Maple and Mathematica because they are slower.)
\section{Acknowledgments}
{I would like to thank Dr. Kamal Abdali, of the US National Science
Foundation who was visiting UC Berkeley from the NSF in 2005, for
numerous useful discussions on the topics in this paper and much
more. Student Yozo Hida provided advice on accuracy of the
FFT. Additionally Charles Cox provided advice on the C++ NTL interface
to Lisp. Numerous emails exchanges with Victor Shoup were helpful in
understanding NTL. Thanks to the MPFR team helped in getting pieces
their software running in Windows, which will enable us to run FFTs to
higher precision. Also thanks for comments by P.~Zimmermann and
T.~Granlund. Special thanks to Brian Gladman and Mikel Bancroft in
supplying critical help to overcome my ignorance of producing and
using Windows dlls.
In revisiting the FFTW library, Duane Rettig assisted in decoding the
interface and memory allocation requirements, an apparently daunting
prospect which in retrospect seemed fairly simple.
}
\begin{thebibliography}{99}
\bibitem{arprec} D. Bailey, ARPREC.
http://crd.lbl.gov/~dhbailey/mpdist/
\bibitem{gmp} The GNU MP Home page. http://swox.com/gmp/
\bibitem{knuth} D.~E.~Knuth, {\em The Art of Computer Programming}, volume 2.
(1969, 1981, 1998). Addison-Wesley.
\bibitem{fftvsC}
R. Fateman,
``A Case Study in Interlanguage Communication: Fast LISP Polynomial
Operations Written in `C',
Proc of 1981 ACM Symposium on Symbolic and Algebraic Computation p 122--124.
\bibitem{polysbyGMP} R. Fateman, ``Can you save time in multiplying
polynomials by encoding them as integers?'', draft, 2004
\bibitem{vzg} J. von zur Gathen and J. Gerhard,
{\em Modern Computer Algebra}, Cambridge Univ. Press 1999, 2003.
\bibitem{percival} Percival, Colin. ``Rapid multiplication modulo the sum and
difference of highly composite numbers,''
{\em Mathematics of Computation, 72:241} (Jan 2003) pp. 387--395.
\bibitem{preiss}
William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling.
{\em Numerical Recipes in Fortran} (various editions), Lisp version online
at\\
www.library.cornell.edu/nr/cornell\_only/othlangs/lisp.1ed/kdo/readme.htm.
\bibitem{shoup} V. Shoup. NTL, (Number Theory Library)
{\tt http://shoup.net/ntl/}.
\bibitem{shoup95} Victor Shoup.
`` A New Polynomial Factorization Algorithm and its Implementation.''
{\em J. Symbolic Comp., 20(4)} pp. 363--397, 1995.
\end{thebibliography}
Program Appendices omitted, all source programs available from the author.
\end{document}
\section{Appendix 1}
Extract from correspondence with Victor Shoup. (slightly edited)
\begin{verbatim}
Victor Shoup wrote:
Hi, I still support and maintain NTL, but I don't have any good
numbers for comparison right now. My Z[X] multiplication routine
chooses from among a number of different implementations, and there
are certainly cases when FFT is faster than Karatsuba, for quite small
inputs. If the coefficients are rather small, then the ``small prime''
FFT is the method of choice. In this case, the FFT itself does not
use any multi-precision arithmetic, and so is quite fast. The CRA is
also quite fast.
-- Victor
\end{verbatim}
\section{Appendix 2: Program listings}
\begin{verbatim}
program listings available from the author's web site.
\end {verbatim}
\section{Appendix 3}
{If we consider polynomials that we encounter in a CAS as functions
reflecting an input-output relationship with some connection to the real world,
then the kinds of polynomials that I, at least, am using for some of these tests
are unlikely-looking functions. A better target for simple functional
representation would be something like piecewise cubic splines.
For example,
consider $p(x):= \sum_{i=0}^{1000}{x^i}$. A polynomial with all coefficients =1.
Evaluating it outside the interval [-1,1] gives such huge numbers that it is
unlikely, in my opinion, to be helpful there. For example, $p(2)$ is about 2 e301,
$p(-2)$ is about 7 e301.
Now you may think that it is just that polynomial, and ``the others'' are not like that.
I did the same test but with $p(x):= \sum_{i=0}^{1000}{r(10)*x^i}$ where $r(10)$ is a random
number in [0,9]. $p(2)$ came out 1.6e302 and
$p(-2)$, 3.6e301.
If the random numbers are given random signs, I found $p(2)$ is 1.1e302. $p(-2)$ is 3e301.
Your results may differ, but probably not by much!
Evaluating inside [-1,1], or more generally inside the unit circle in the complex plane,
such a function will probably behave better, but can, I think, be approximated extremely well by a
polynomial of substantially lower degree.
So what might large polynomials be used for? Number-theoretic calculations of some
sort? Maybe some conjecture involving series and generating functions? Largely pointless
(my opinion) attempts to solve symbolically systems of polynomial equations via Grobner bases,
when the sensible approach via numerics to find (some) roots or perhaps just one root is well-studied and
highly successful.
%(Personal experience in this regard includes studying prospects
%for GB solution of "inverse problems" involving (a) oil reservoirs (b) CT imaging.)
A better choice of a basis, say using Chebyshev polynomials T[i] instead of the power basis x^i
may make ``higher degree'' expressions more useful, though still a polynomial of degree more than
200 (say) suggests that this expression should be broken up into pieces.
Chebyshev polynomials and their zeros have
these nice relationships with minimax approximation, Fourier series, quadrature formulas, etc.
I have written programs that multiply Chebyshev series. Now I have
programs for fast dense approximate Chebyshev series as well. One you
swallow the bullet that you are dealing with approximations, you can
take cosine of a polynomial $p$, in various ways in fact. Composing
the Chebyshev series of cosine with the Chebyshev series for $p$. Or
compute an inverse discrete cosine transform (DCT) of $p$ to get its
value at $n$ points. compute cos at those places, and do a DCT to get
a Chebyshev series.
This kind of thing is explored in great detail by the Chebfun team
at www.maths.ox.ac.uk/chebfun.
}
\end{document}
% old stuff
;; the FFT file (dfft.lisp 5/4/05 RJF)
;; FFT partly based on numerical recipes in lisp
(eval-when (compile load)
(declaim (optimize (speed 3) (safety 0) (space 0) (compilation-speed 0)))
(declaim (special modulus hmodulus)))
; Fourier Transform Spectral Methods
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;Routines translated with permission by Kevin A. Broughan from ;;;;;;;;;;;
;;Numerical Recipes in Fortran Copyright (c) Numerical Recipes 1986, 1989;;;;;;
;;;;;;;;;;;;;;;Modified by Ken Olum for Common Lisp, April 1996;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(in-package :cl-user) ;; changed RJF
;; notes 4/27/05, RJF
;; We use four1 only from NR12.lisp
;; Caveats.
;; the input data will be over-written by the result.
;; the inverse fft is indicated by :isign -1, though it must be
;; multiplied by 1/nn to be really correct.
;; The only NR function we use is four1:
;; four1: fourier transform (FFT) in one dimension
;; The rest of the file is written by RJF.
;; Programs to transform the data for the FFT.
(defun v2dfa(a &optional (m (length a)) )
;;coerce a real vector to a double-float-array of length m. (really 2m)
;; since here, complex numbers are stored in 2 adjacent real locations.
(let* ((k (length a))
(ans (make-array (+ m m) :element-type
'double-float :initial-element 0.0d0)) )
(dotimes (i k ans) ;just copy the actual numbers
(setf (aref ans (* 2 i))(coerce (aref a i) 'double-float)))))
(defun dfa2v(a &optional (m (/ (length a)2) ))
;;coerce real parts back to integers, more or less.
;; a is an an array of even length.
;; if you know that there are trailing zeros, set the actual length
;; with the optional second parameter m.
(let* ((k (/ (length a) 2))
(ans (make-array m)))
(dotimes (i m ans)
(setf (aref ans i)(round (aref a (* 2 i)) k)))))
;; example
;; (four1 (v2dfa #(1 2 3) 16) 16) -->
;; #(6.0d0 0.0d0 4.969079408582216d0 2.886687208289823d0 2.4142135623730923d0..... )
;; (dfa2v (four1 * 16 :isign -1) 3) --> #(1 2 3)
;; utility to copy an array 'cause fft tends to clobber inputs!
(defun copyarray (a)(make-array (length a):initial-contents a))
(defun prodarray(r s len)
;; r and s are the same length arrays
;; compute, for i=0, 2, ..., len-2
;; ans[i]:= r[i]*s[i]-r[i+1]*s[i+1] ;; real part
;; ans[i+1]:=r[i]*s[i+1]+s[i]*r[i+1] ;; imag part
(let ((ans (make-array (* 2 len) :element-type 'double-float)))
(dotimes (i len ans)
(let* ((ind (* 2 i))
(ind1 (1+ ind))
(a (aref r ind))
(b (aref r ind1))
(c (aref s ind))
(d (aref s ind1)))
(declare (double-float a b c d)(fixnum ind ind1))
(setf (aref ans ind)(- (* a c)(* b d)))
(setf (aref ans ind1)(+ (* a d)(* b c)))))))
;; put all the pieces together to do fft mult in double precision
;; starting with integer array
;; This is the part from NR in Lisp
(defun four1 (data nn &key (isign 1))
(declare (type (simple-array double-float (*)) data))
(declare (type fixnum nn isign))
(prog ((wr 0d0) (wi 0d0) (wpr 0d0) (wpi 0d0) (wtemp 0d0)
(theta 0d0) (tempr 0d0) (tempi 0d0) (j 0) (n 0) (m 0)
(mmax 0) (istep 0))
(declare (type double-float wr wi wpr wpi wtemp theta tempr tempi))
(declare (type fixnum j n m mmax istep))
(setf n (* 2 nn))
(setf j 1)
(do ((i 1 (+ i 2)))
((> i n) t)
(declare (type fixnum i))
(when (> j i)
(setf tempr (aref data (1- j)))
(setf tempi (aref data j))
(setf (aref data (1- j)) (aref data (1- i)))
(setf (aref data j) (aref data i))
(setf (aref data (1- i)) tempr)
(setf (aref data i) tempi))
(setf m (floor (/ n 2)))
label1
(when (and (>= m 2) (> j m))
(setf j (- j m)) (setf m (floor (/ m 2)))
(go label1))
(setf j (+ j m)))
(setf mmax 2)
label2
(when (> n mmax)
(setf istep (* 2 mmax))
(setf theta (/ 6.28318530717959d0 (* isign mmax)))
(setf wpr (* -2.0d0 (expt (sin (* 0.5d0 theta)) 2)))
(setf wpi (sin theta)) (setf wr 1.0d0) (setf wi 0.0d0)
(do ((m 1 (+ m 2)))
((> m mmax) t)
(declare (type fixnum m))
(do ((i m (+ i istep)))
((> i n) t)
(declare (type fixnum i))
(setf j (+ i mmax))
(setf tempr (+ (* wr (aref data (1- j)))
(* (* -1d0 wi) (aref data j))))
(setf tempi (+ (* wr (aref data j))
(* wi (aref data (1- j)))))
(setf (aref data (1- j)) (+ (aref data (1- i)) (- tempr)))
(setf (aref data j) (+ (aref data i) (* -1d0 tempi)))
(setf (aref data (1- i)) (+ (aref data (1- i)) tempr))
(setf (aref data i) (+ (aref data i) tempi)))
(setf wtemp wr)
(setf wr (+ (+ (* wr wpr) (* (- wi) wpi)) wr))
(setf wi (+ (+ (* wi wpr) (* wtemp wpi)) wi)))
(setf mmax istep)
(go label2))
(return data)))
;;end of [selections from] nr12.l
;; pieces needed for modular arithmetic
(defun fmod(r)
(let ((hp hmodulus)(p modulus))
(declare (special hmodulus modulus))
(cond ((<= (- hp) r hp) r)
((< 0 r p) (- (- p r)))
((< (- p) r 0 )(+ p r))
(t (fmod (mod r p))))))
(defun mod* (a b)(fmod (* a b)))
(defun mod+ (a b)(fmod (+ a b)))
(defun modpow(x n mod) ;; x^n modulo mod
(cond((= n 1) x)
((= n 0) 1)
((evenp n)
(let ((h (modpow x (ash n -1) mod) ))
(mymod (* h h) mod))) ; x^(2n)= (x^n)^2
(t(mymod (* x (modpow x (1- n) mod)) mod)))) ; x^(2n+1) = x*x^(2n);
(defun mymod (r p)
(let ((halfp (ash p -1)))
(cond ((<= (- halfp) r halfp) r)
((< 0 r p) (- (- p r)))
((< (- p) r 0 )(+ p r))
(t (mymod (mod r p) p)))))
;;;;;;;;;;;;;;;;;
;; file polymultx.cl
;;; experimentation in encoding polynomials for faster multiplication.
;;; Richard Fateman
;;; September, 2003
;;; This file just does polynomial mult of univariate arrays of bignums.
;;; The variable is NOT mentioned anywhere.
;;; More notes, added 4/15/05 to 5/4/05 RJF
;;; This is simpler than using tags for certain implementations of algorithms.
;;; In particular, using a Karatsuba or Toom splitting.
;;; If your polynomial is #(x 10 20 30 40) then half-splitting
;;; means creating #(x 10 20) and #(x 30 40). Compare to the case of
;;; #(10 20 30 40). Then the splitting can be done by indexing
;;; into the array or vector to get #(10 20) and #(30 40), and
;;; no new data. Or as we are doing here, by using subseq to pick out
;;; subsequences. It would be kind of ugly to pick out a subsequence
;;; and then paste a header on the front of it.
;;; Also, we are explore the programming of a Toom type
;;; algorithm which uses n^1.46 mults, by splitting into triples,
;;; i.e. ax^2+bx+c.
;;; How to encode zero? #(0) works better, I think, than #().
;;; A Normal polynomials should have no trailing zeros. not enforced.
;;; defect is it doesn't normalize so that p-p might be #(0 0 0 0...).
;; Example: #(10 30 21) means 10+30*x+21*x^2 = 10+x*(30+x*21)
(eval-when (compile load)
(declaim (optimize (speed 3) (safety 0) (space 0) (compilation-speed 0)))
(declaim (inline svref = eql make-array)))
(defun polymult(p q) ;; by ordinary poly multiplication.
(assert (arrayp p))
(assert (arrayp q))
(let* ((plim (length p)) ;; degree is plim-1
(qlim (length q))
(p_i 0)
(ans(make-array (+ plim qlim -1) :initial-element 0)))
(declare(fixnum qlim plim)(type (array t) ans))
(do ((i 0 (1+ i)) )
((= i plim) ans)
(declare (fixnum i))
(setf p_i (aref p i))
(do ((j 0 (1+ j)))
((= j qlim) nil)
(declare (fixnum j))
(incf (aref ans (+ i j))
(* p_i ; (aref p i)
(aref q j)))))))
(defun v2d(a) ;; vector to declared array
(let* ((m (length a))
(ans (make-array m :element-type 'double-float)))
(declare (fixnum m))
(dotimes (i m ans) ;just copy the actual numbers
(declare (fixnum i))
(setf (aref ans i)(coerce (aref a i) 'double-float)))))
(defun polymultfixnum(p q)
;; by ordinary poly multiplication. Assum FIXNUM coeffs. This is fast.
(declare (optimize (speed 3)(safety 0)) (type (simple-array fixnum(*)) p q))
(assert (arrayp p))
(assert (arrayp q))
(let* ((plim (length p)) ;; degree is plim-1
(qlim (length q))
(ans(make-array (+ plim qlim -1):element-type 'fixnum :initial-element 0))
(p_i 0))
(declare(fixnum qlim plim)
(type (simple-array fixnum (*)) ans))
(do ((i 0 (1+ i)))
((= i plim) ans)
(declare (fixnum i))
(setf p_i (aref p i))
(do ((j 0 (1+ j)))
((= j qlim) nil)
(declare (fixnum j))
(incf (aref ans (+ i j))
(the fixnum (* (the fixnum p_i)(the fixnum(aref q j))))))) ))
;; compute a power of p
(defun ppower(p n) (if (= n 1) p (polymult p (ppower p (1- n)))))
(defun ppowerk(p n) (if (= n 1) p (polymultk p (ppower p (1- n)))))
;;An implementation of Karatsuba multiplication
#|Assume p is of degree 2n-1, i.e. p= a*x^n+ b where a is degree n-1 and b is degree n-1.
assume q is also of degree 2n-1, i.e. q= c*x^n+d
compute U= (a+b)*(c+d) = a*c+a*d+b*c+b*d, a polynomial of degree k=2*(n-1)
compute S= a*c of degree k
compute T= b*d of degree k
compute U = U-S-T = a*d+b*c of degree k
Then V=p*q = S*x^(2n)+U*x^n + T.
from first principles, V =p*q should be of degree 2*(2n-1) = 4n-2.
from the form above, k+2n= 4n-2.
note that U*x^n is of degree 2*(n-1)+n = 3n-1 so that it OVERLAPS the S*x^(2n) component.
note that T is of degree 2*(n-1) = 2n-2 so that it OVERLAPS the U*x^(n) component.
(the trick is that, used recursively to multiply a*c etc,
this saves coefficient mults, instead of nm, max(n,m)^1.5)
if p, q, are of degree (say) 10 or so, use conventional arithmetic. Just a guess.
What about this 2n-1 business? Find the max degree of p, q. If it is odd, we
are set. If it is even, 2n, do this:
p=a*x^n+ b where b is of degree n-1, but a is of degree n. Same deal works. I think.
|#
(defvar *karatlim* 12) ;; an educated guess -- don't use karatsuba for small probs.
(defun polymultk(pp qq);; by Karatsuba.
(assert (arrayp pp))
(assert (arrayp qq))
(let* ((plim (length pp))
(qlim (length qq)))
(declare (fixnum qlim plim)(type (array t) aa pp qq))
(if (< (min plim qlim) *karatlim*)(polymult pp qq); don't use for small polys
(let* (aa
(h (truncate (max plim qlim) 2))
(A (subseq pp (min h plim) plim))
(B (subseq pp 0 (min h plim)))
(C (subseq qq (min h qlim) qlim))
(D (subseq qq 0 (min h qlim)))
(U (polymultk(polyadd A B) (polyadd C D)))
(TT (polymultk A C))
(S (polymultk B D)))
(polysub-into S U 0) ; change U to U-S
(polysub-into TT U 0) ; change U to U-S-T
(setf aa (polyshiftleft TT (ash h 1))) ;aa = T*x^(2h)
(polyadd-into U aa h) ; change aa to T*x^(2h)+U*x^h
(polyadd-into S aa 0) ; change aa to T*x^(2h)+U*x^h +S
(pnorm aa) ;remove trailing 0's if any
))))
(defun polyshiftleft(p k) ; like multiplying by x^k
;; k is positive integer, p is a raw polynomial, no header
(let* ((oldlen (length p))
(newlen (+ oldlen k))
(newpoly (make-array newlen :initial-element 0)))
(do ((i (1- oldlen) (1- i)))
((< i 0) newpoly)
(setf (aref newpoly (+ k i))(aref p i)))))
(defun polyadd(p q)
(let* ((plim (length p))(qlim (length q))
(aa (make-array (max plim qlim) :initial-element 0)))
(if (> plim qlim)
(polyadd-1 p q plim qlim aa)
(polyadd-1 q p qlim plim aa))
;; we could normalize here, removing all trailing zeros but one
;; we would rather wait until we know we care about it.
;;(pnorm aa)
))
(defun polysub(p q) ;; subtraction. p-q
(let* ((plim (length p))(qlim (length q))
(aa (make-array (max plim qlim) :initial-element 0)))
(polysub-1 p q plim qlim aa)
;; we could normalize here, removing all trailing zeros but one
;; we would rather wait until we know we care about it.
;;(pnorm aa)
))
(defun polyadd-1 (p q pl ql aa)
;; p is longer, length pl. same length as array aa, filled with zeros.
;; p and q are unchanged.
(do ((i 0 (1+ i)))
((= i pl))
(setf (aref aa i)(aref p i))) ; copy longer input to answer
(do ((i 0 (1+ i)))
((= i ql) aa)
(incf (aref aa i)(aref q i))))
(defun polysub-1 (p q pl ql aa)
(do ((i 0 (1+ i)))
((= i pl))
(setf (aref aa i)(aref p i))) ; copy first arg
(do ((i 0 (1+ i)))
((= i ql) aa)
(decf (aref aa i)(aref q i)))) ; subtract second arg
;; defined in case p is not normalized
;; and has trailing zeros
(defun polyadd-into (p aa start) ;; hacked up a bit..
;; p is shorter than answer aa
;; p is unchanged. aa is changed to aa+ p*x^start
(declare (fixnum start))
(do ((i (+ (length p) start -1) (1- i)))
((< i start) aa)
(declare (fixnum i))
(let ((m (aref p (- i start))))
(unless (= 0 m)
(incf (aref aa i) m)))))
(defun polysub-into (p aa start) ;; hacked up a bit..
;; p is shorter than answer aa
;; p is unchanged. aa is changed to aa- p*x^start
(declare (fixnum start))
(do ((i (+ (length p) start -1) (1- i)))
((< i start) aa)
(declare (fixnum i))
(let ((m (aref p (- i start))))
(unless (= 0 m)
(decf (aref
aa i) m)))))
;; remove trailing zeros.
(defun pnorm (x) ; x is a vector
(declare (optimize (speed 3)(safety 0)(debug 0)))
(let ((x x) pos)
(declare (simple-vector x) (fixnum pos)
(inline position-if-not coefzerofunp delete-if))
(setq pos (position-if-not #'zerop x :from-end t))
(cond
((null pos) #(0)) ; change if we want #() to be zero!
((= pos (1- (length x))) x) ; already normal
(t (subseq x 0 (1+ pos) )))))
;;; The Toom r=2 version,
(defun evpnum(a b c)
;; return a list of evaluating a*x^2+b*x+c at x= -2,-1,0,1,2
;; a,b,c are numbers.
(let* ((w (+ a c))
(k2 (+ w b))
(k (+ k2 (* 3 a) b)))
(list (- k (* 4 b)) (- w b) c k2 k)))
;; example (evpnum 3 2 1) --> (9 2 1 6 17)
(defun evp(a b c)
;; return a list of evaluating a*x^2+b*x+c at x= -2,-1,0,1,2
;; a,b,c are polynomials. Note that the constant 3 as a polynomial is
;; the vector #(3). To construct it once at read-time, use #.#(3).
(let* ((w (polyadd a c))
(k2 (polyadd w b))
(k (polyadd k2 (polyadd (polymult #.#(3) a) b))) )
(list (polyadd k (polymult #.#(-4) b))
(polysub w b)
c k2 k)))
;; example (evp #(3) #(2) #(1)) --> (#(9) #(2) #(1) #(6) #(17))
(defvar *toomlim* 300)
(defun polymulttc(pp qq);; by T-C
(assert (arrayp pp))
(assert (arrayp qq))
(let* ((plim (length pp))
(qlim (length qq)))
(declare (fixnum qlim plim)(type (array t) aa pp qq)(special *toomlim* ))
(if (< (min plim qlim) *toomlim*)
(polymultk pp qq) ; don't use for small polys, use karat
(let* ((ans (make-array (+ -1 plim qlim) :initial-element 0))
(h ;(round (max plim qlim) 3)
(truncate (max plim qlim) 3) )
;; so there should be
;; A B C A+B*x^h+C*x^(2h)
;; h h -maybe-lessthan-h
(i1 (min (* 2 h) plim))
(C1 (subseq pp i1 plim))
(i0 (min h plim))
(A1 (subseq pp 0 i0))
(B1 (subseq pp i0 i1))
(j1 (min (* 2 h) qlim))
(C2 (subseq qq j1 qlim))
(j0 (min h qlim))
(A2 (subseq qq 0 j0))
(B2 (subseq qq j0 j1))
;; the following could be done by multiple value returns, in-line code..
;; if it seems to be slower than necessary.
(p1 (evp A1 B1 C1)) ;; evaluate at points -2, -1, 0, 1, 2
(p2 (evp A2 B2 C2))
;; recursively call toom-cook multiplication
(q (mapcar #'polymulttc p1 p2))
;; q is list of q(-2) q(-1) q(0) q(1) q(2)
;; (elt 0) (elt 1) (elt 2) (elt 3) (elt 4)
(t1 (polymult #.#(1/2) (polysub (elt q 3) (elt q 1))))
(t3 (polysub (polymult #.#(1/2) (polyadd (elt q 3) (elt q 1)))
(elt q 2)))
(b (polymult #.#(1/3)
(polysub (polymult #.#(1/4)
(polysub (elt q 4) (elt q 0)))
t1)))
(a (polymult #.#(1/3)
(polysub
(polymult #.#(1/4)
(polysub
(polymult #.#(1/2)
(polyadd (elt q 4) (elt q 0)))
(elt q 2)))
t3))))
(declare (fixnum h i0 i1 j0 j1))
(polyadd-into
(elt q 2)
(polyadd-into
(polysub t1 b)
(polyadd-into (polysub t3 a)
(polyadd-into b
(polyadd-into a ans 0)
h)
(* h 2))
(* h 3))
(* h 4))
))))
(defun polymultbyC (c q) ;; constant X poly, in place.
(assert (numberp c))
(assert (arrayp q))
(let ((qlim (length q)))
(declare(fixnum qlim))
(do ((i 0 (1+ i)))
((= i qlim) q)
(declare (fixnum i))
(setf (aref q i)
(* c(aref q i))))))
#| Kamal Abdali's suggestion for a 5-mult quadratic X quadratic program
qmulk(a1,b1,c1,a2,b2,c2):= /* take 6 coefficients as input */
/* two quadratics, a1*x^2+b1*x+c1, a2*x^2+b2*x+c2, with
product q(x)=a*x^4+b*x^3+c*x^2+d*x+e.
qat1 =q(1), qatm1=q(-1),qat2=q(2) */
block([a,b,c,d,e,qat1,qatm1,qat2,t4,t5,t6,t10],
a: a1*a2,
e: c1*c2,
qat1: /* a+b+c+d+e,*/ (a1+b1+c1)*(a2+b2+c2),
qatm1: /* a-b+c-d+e,*/ (a1-b1+c1)*(a2-b2+c2),
qat2 : /*16*a+8*b+4*c+2*d+e,*/ (4*a1+2*b1+c1)*(4*a2+2*b2+c2),
/*4 shifts and one mult*/
t9: (qat1-qatm1)/2,
t8: (qat1+qatm1)/2-a -e,
t10: (qat2-16*a-4*t8-e)/2, /* t10= 4*b+d */
b: (t10-t9)/3 ,
/* returns the coefficients of the product, correct, 4/10/05 */
[a,b, t8, t9-b, e])$
|#
(defvar *abdalim* 30)
;; works!! not so fast as tc, though
;; polymult by KA to make it faster
(defun pka(pp qq);; by Kamal Abdali
(assert (arrayp pp))
(assert (arrayp qq))
(let* ((plim (length pp))
(qlim (length qq)))
(declare (fixnum qlim plim)(type (array t) aa pp qq))
(if (< (min plim qlim) *abdalim*)
(polymultk pp qq) ; don't use for small polys, use karat
(let* ((ans (make-array (+ -1 plim qlim) :initial-element 0))
(h ;(round (max plim qlim) 3)
(truncate (max plim qlim) 3) )
;; so there should be
;; C B A C+B*x^h+A*x^(2h)
;; h h -maybe-lessthan-h
(i1 (min (* 2 h) plim))
(A1 (subseq pp i1 plim))
(i0 (min h plim))
(C1 (subseq pp 0 i0))
(B1 (subseq pp i0 i1))
(j1 (min (* 2 h) qlim))
(A2 (subseq qq j1 qlim))
(j0 (min h qlim))
(C2 (subseq qq 0 j0))
(B2 (subseq qq j0 j1))
(A (pka A1 A2))
(E (pka C1 C2))
(A1plusC1 (polyadd A1 C1))
(A2plusC2 (polyadd A2 C2))
(qat1 (pka (polyadd A1plusC1 B1)(polyadd A2plusC2 B2)))
(qatm1(pka (polysub A1plusC1 B1)(polysub A2plusC2 B2)))
(qat2 (pka
(polyadd
(polyadd
(polymultbyC 4 A1) ;; destroy A1
(polymultbyC 2 B1)) ;; destroy B1
C1)
(polyadd
(polyadd
(polymultbyC 4 A2) ;; destroy A2
(polymultbyC 2 B2)) ;; destroy B2
C2)))
(t9 (polymultbyC 1/2 (polysub qat1 qatm1)))
(t8 (polysub (polysub(polymultbyC 1/2 (polyadd qat1 qatm1))A)E))
(t10 (polymultbyC 1/2 (polysub
(polysub
(polysub qat2
(polymult #.#(16) A)) ;; keep A for later
(polymult #.#(4) t8)) ;; keep t8 for later
E)))
(B (polymultbyC 1/3 (polysub t10 t9))))
(declare (fixnum h i0 i1 j0 j1))
(polyadd-into
A
(polyadd-into
B
(polyadd-into t8
(polyadd-into (polysub t9 B) (polyadd-into E ans 0)
h)
(* h 2))
(* h 3))
(* h 4)) ))))
\end{verbatim}
{\footnote {private communication, Yozo Hida,}
Let the lengths of the vectors be $2^n$,
$u$= machine precision (here, $2^{-53}$) and let $b$ be
the absolute accuracy of the weights.
If rounded from exact
values $b=u/\sqrt{2}$. Let $r=3u+(3+1/n)\sqr(5)u+3b$ (assumes $r<1$)
then
$||z^\prime-z||_\infty < ||x||_2 ||y||_2 r n /(1-rn)$
%;; r is like 1.5615944520686534d-15 ; log r base 2 is -49.18
%
%;; bound is sort of like 1.56X10^{-15}*xnorm*ynorm*n.
While the details depend on some improvements in the FFT
to get the weights exactly right, and if we approximate the
2-norm by $\infty$-norm, if we take rather large inputs and outputs
(polynomial of degree $2^n$), $n \le 16$,
we have to make sure that any coefficient in the answer,
bounded by about
$2^n||x||_1||y||_1$ is rounded correctly to the right integer
we are talking about 29 bits correct.
So if we have arbitrary precision coefficients in each of the
inputs, we could try reducing all of them to 14-bit moduli.
oof. not so good.
}
{Appendix I
Richard,
Indeed. In fact, I ran across your paper on this topic
just last week, while I was googling for something (I can't remember
what). As I recalled from your paper, it seemed like this idea
did not work so well -- but I agree, it certainly does simplify things.
As it happens, I did implement this Kronecker substitution idea
in implementing polynomial multiplication GF(2^n)[X], when n is less than the
word size. This is definitely a good idea, since in either case,
I just use Karatsuba.
-- Victor
On Mar 22, 2005, at 5:40 PM, Richard Fateman wrote:
> Thanks. But then small enough coefficients can be packed several in a
> word, so the multiplication of polynomials of small degree can
> be done in one machine instruction! I have done so with GMP, packing
> (any number of ...) coefficients, appropriately padded, into long integers.
>
> This provides a method for multiplying polynomials which inherits
> from the integer GMP arithmetic the choice of classical vs. various
> alternatives. In some sense this reduction from polynomial to
> long-integer arithmetic (essentially evaluating at a power of 2)
> seems "backwards". But it works.
>
> Regards
> RJF
>
>
{> Victor Shoup wrote:
>
>> Hi,
>> I still support and maintain NTL, but I don't have
>> any good numbers for comparison right now.
>> My Z[X] multiplication routine chooses from among a number
>> of different implementations, and there are certainly cases when FFT
>> is faster than Karatsuba, for quite small inputs.
>> If the coefficients are rather small, then the "small prime" FFT is the method of choice.
>> In this case, the FFT itself does not use any multi-precision arithmetic,
>> and so is quite fast. The CRA is also quite fast.
>> -- Victor}
>> On Mar 22, 2005, at 3:54 PM, Richard Fateman wrote:
>>
>>> Re JSC article, 1995, on new poly. factorization etc...
>>>
>>> It has been a while since this was published, but I didn't
>>> as at it carefully when it first appeared. I was prompted
>>> to look at it from a comment made by Erich Kaltofen, regarding
>>> FFT being practical for polynomial multiplication.
>>>
>>> My impression continues to be that FFT is not good, and that
>>> its primary purpose is as an asymptotic marker for M(n), allowing
>>> theoreticians to make claims of asymptotic superiority of
>>> one algorithm over another. That is, the actual practicality
>>> may differ.
>>>
>>> Your work has been far more practical than some, involving actual
>>> high speed implementation, so I figure I should perhaps explain
>>> my impression, and ask your opinion.
>>>
>>> Using FFT in asymptotic arguments is not a problem.
>>>
>>> To claim that FFT is a practical algorithm for polynomials of
>>> degree 30 or so suggests that one should actually use some
>>> version of the FFT. In my own programming experience, some
>>> version of a Karatsuba-like algorithm is faster than classical
>>> multiplication, as well as faster than FFT, for polynomials of
>>> substantial degree. There may be a cross-over point where FFT
>>> is faster, practically, than Karatsuba, but it is far larger than 30.
>>>
>>> Have you experimented with this (since 1995)? Do you have any
>>> more data?
>>>
>>>
>>> Karatsuba has many of the same defects as FFT in practice, because
>>> it benefits from equal-sized inputs, preferably dense. But FFT
>>> has more defects in practice because of the need to bound coefficients
>>> and use the CRA.
>>>
>>>
>>> Any thoughts?
>>>
>>>
>>> Best wishes
>>>
>>> Richard Fateman
>>>
>>>
>
>
}...........................
Fresh times, 5/25/05
construct two polynomials of length 512 with random pos. coefficients
of 232 bits.
On windsor 933Mhz pentium III, NTL 5.4 :
Classical multiply programmed in
Lisp using ordinary Lisp Bignums: 22.903 sec ///THIS DATA IS WRONG
Classical multiply programmed in
Lisp using NTL's integers.
the multiply itself takes 3.485
the in/out conversion 200ms 3.685 sec
I think this could be sped up by
doing a unitary a:=a+b*c
instead of separate add, multiply.
Karatsuba multiplication programmed in
Lisp using Lisp bignums, 0.440 sec
USING NTL totally, calling mul(zzx,zzx)
Pre-converting to NTL,(taking about 200 ms)
then the multiply takes 281 ms
then the out conversion, 71 ms. total 0.552 sec
Note that this is size critical:
bumping up the degree to 514 takes 440 ms
plus the conversion time, total 0.711 sec
Karatsuba multiplication,
using NTL bignums. not yet programmed.
(Probably it will be somewhat faster)
(I speculate that a Toom version
would not be faster.)
....
My conclusion: NTL's use of FFT at this degree
and coefficient size is of marginal utility.
Other tests show that
NTL is significantly slower at this degree
if the coefficents are say, all 1.
.....
Any thoughts?
....
5/25, later..
;; data from polymultx.cl
;; times on windsor, 933Mhz pentium
(setf h (randpol 512 332))
(setf j1 (lisp2NTLZZX h)) ;;.140 sec
(setf j2 (lisp2NTLZZX h)) ;;.140 sec
(time (LMul2 j1 j2)) ;; just call NTL polynomial multiplier ;;.401 sec
(time (NTLZZX2lisp *)) ;; .120 sec
;; total is .802 sec
;; I changed code in ZZX1.cpp so that only classical multiplication is
;; used. Then the time taken is 5.527 seconds!!
(time (polymultk h h)) ;;.571 sec
(time (polymultclass h h)) ;; 2.453 sec
(time (polymultclassntl h h)) ;; 6.720 sec includes conversions
;;;
(setf h (randpol 512 232)) ;;degree 512 coef between 0 and 2^232
(time (setf j1 (lisp2NTLZZX h))) ;;.100 sec
(time (setf j2 (lisp2NTLZZX h))) ;;.100 sec
(time (LMul2 j1 j2)) ;; just call NTL polynomial multiplier ;;.300 sec
;;; note: if using (randpol 512 232), the time jumps to .500 sec
(time (NTLZZX2lisp *)) ;; .130 sec
;; total is .630 sec
(time (setf ja1 (map 'array #'Lispint2NTLZZ h))) ;.100 sec
(time (polymultk h h)) ;;.331 sec
(time (polymultclass h h)) ;; 1.452 sec
(time (polymultclassntl h h)) ;; 4.216 sec includes conversions ???
(time (polymultntl ja1 ja1)) ;; 3.65 sec NO conversion time.
(time (polymult h h)) ;; 1.783 sec ;; adaptive.
another piece of data: on 933Mhz, Maple 7:
q:=randpoly([x],degree=511,dense,coeffs=rand(0..2^232)):
q1:=q+1;
t1:=time():
expand(q*q1):
time()-t1 is 1.972 seconds
Slower than most, and answer is unsorted.
maxima [rat] does this in about 7.22 seconds; if the coefficients
are all one instead of 2^232, the time is about 0.13 sec
RJF
7/12/05, on windsor, 933Mhz
.........
C:\cygwin\usr\fateman\ntl-5.4usingGMP\src>QuickTest
QuickTest
This is NTL version 5.4
Basic Configuration Options:
NTL_STD_CXX
NTL_GMP_LIP
Resolution of double-word types:
long long
unsigned long long
Performance Options:
NTL_SPMM_ASM
NTL_AVOID_BRANCHING
NTL_GF2X_ALTCODE
NTL_GF2X_NOINLINE
test is OK
time for 1024-bit mul: 9.01us
time for 2048/1024-bit rem: 11.82us
time for 1024-bit modular inverse: 170us
time to multiply degree 1023 polynomials
modulo a 1024-bit number: 0.29695s
......................
C:\cygwin\usr\fateman\ntl-5.4\src>QuickTest
QuickTest
This is NTL version 5.4
Basic Configuration Options:
NTL_STD_CXX
Resolution of double-word types:
long long
unsigned long long
Performance Options:
test is OK
time for 1024-bit mul: 25.14us
time for 2048/1024-bit rem: 36.65us
time for 1024-bit modular inverse: 391us
time to multiply degree 1023 polynomials
modulo a 1024-bit number: 0.5027s
|#
Date: Tue, 14 Mar 2006 13:56:36 +0100
From: Torbjorn Granlund
Subject: When is FFT practical?
Sender: tege@king.swox.se
To: fateman@cs.berkeley.edu
Cc: zimmerma@loria.fr
Hello,
Paul Zimmermann showed me your paper in FFT on arbitrary large
coefficient polynomials.
I have some remarks:
* page 2, under Theoretically ... Switch log_2(3) and log_3(5).
* Mr Cook translated Toom's Russian paper to English, but I don't
think that feat makes him deserve credit for the algorithm.
Therefore, the proper term is "Toom's algorithm".
* 933MHz Pentium Pro must be a mistake. The "Pro" never ran more than
about 200MHz. Perhaps you mean Pentium 3? (Pentium 2 also didn't
reach 933MHz.) (These processors share their guts, the only
differences are that they got more and more SIMD instructions.)
* 0.930 GHz should be 0.933 GHz for consistency with "933MHz".
* Nitpick: You make space before GHz but not before MHz.
* Section 4, around line 6: suggest "in some number of finite fields"
is changed to "in some small number of finite fields" for clarity.
It doesn't make sense to use more than about 5 primes, from my
(quite extensive) experience.
* Section 4, 2nd paragraph. I think no "remainder hardware" needs to
be involved in computing mod these word sized P. I think either
Montgomery's REDC can be used, or one of several other methods
suggested by various authors. In facts, Victor Shoup has some
tricks involving precomputation of the roots tables wrt each P that
might be better than REDC.
* (see the FFTW reference). => (See the FFTW reference.)
* A IEEE double has 53 bits plus sign (although the msb is implicit,
but that should be irrelevant in the present context).
* I don't think testing the accuracy with plain random input estimates
the needed precision very well. I think you'll end up too
optimistic, and risk getting incorrect results.
* Please write "GNU", not "Gnu".
* The latest edition of Knuth's Seminumerical Algorithms is from 1998.
* Any serious FFT implementation today needs Bailey's twostep (or
something like it). You might want to mention it.
--
Torbjörn
.........................ISSAC 2006 reviewers
-------------------- review 238 --------------------
Overall comment: The paper explores important issue of crossover point between classical and asymptotically fast multiplication of polynomials. However, the paper is poorly organized and does not present enough data to make the point. The paper is also vague in its methodology: measurement methods, data presentation, and language and compiler use.
p. 2: What version of Common List was used?
p. 2: The timing data - how was it obtained? It would be better to have data for more than one particular degree.
p. 3: The CRA "stitching together" performance estimation. I would actually measure this.
p. 3: Could we have results on a computer that is a bit more updated than Pentium Pro?
p. 3: Please report results for debugged code only.
p. 3: The reason why Lisp is faster for small coefficients is because both NTL and GMP have high overhead.
p. 4: On implementing efficient numeric FFT, why not use automatic tuning techniques?
p. 4: Common Lisp - What is the highest-runtime-speed optimization setting?
p. 4: The 2.33 Intel machine has which processor – Pentium 4?
p. 5: Experimental data for degree 9,999 is insufficient. I would want to see data for many degrees. From degree 100 to 10000 would be nice.
p. 5: The comment that computers with 32-bit ints and 64-bit doubles will have faster floating point arithmetic is not always true. This will depend on the processor -- how many int and fp alus are available for the ILP and how well you implementation of the FFT will use the superscalar features of the target processor.
p. 6: Which memory allocation "tricks" are implied?
p. 6: On current machines, GMP arithmetic uses 64-bit digits to represent "bignum" integers.
-------------------- review 237 --------------------
The submitted paper discusses the implementation and
usage of FFT-based univariate polynomial multiplication.
The paper contains some interesting ideas such as
(1) using a floating-point FFT for multiplying polynomials
of small degree (say, less than 2^{11}) and modulo
a small prime (less than 2^{19), say).
(2) using a combination of (1) and CRA for handlingation
polynomials with larger degrees or larger coefficients.
However:
(a) The paper does not contain any precise analysis that,
given two input polynomials, would tell us whether (1)
or (2) could be used for computing their product. Such
analysis is trivial for the classical multiplication, but
requires to write things down (more precisely than
what is down in Section 5 of the paper) for an FFT-based
multiplication algorithm.
(b) The paper contains poor experimental results, in particular for (1)
(c) the paper does not contain experimental results for (2)
whereas this is, in my opinion, the main application of
FFT-based polynomial arithmetic.ation
Additional weak points:
(1) In Section 3, the author mentions an Appendix 2
which is not included in the paper
(2) The author claims in the abstract that
<< no major general-purpose computer algebra system uses the FFT as a
general default technique for this purpose. >>
Actually, MAGMA does use FFT-based arithmetic with its dense
polynomials!
(3) In the conclusion, the author tries to suggest a recipe
for multiplying univariate polynomials, covering the
different cases, based on input degrees and coefficient sizes.
I think that the reported experiments are too sketchy to
allow this conclusion