\documentclass{article}
\usepackage{fullpage}
\title{DRAFT 11: What's it worth to write a short program for polynomial multiplication?}
\author{Richard Fateman}
\begin{document}
\maketitle
\begin{abstract}
One of the major advantages of writing computer programs in a
higher-level language is that the programs tend to be shorter, easier
to write, read, modify, and debug. Numerous programs for
multiplication of univariate polynomials over the integers are
available in open-source packages. What is the smallest expression in
a programming language for such a program? (Assuming that the language
does not have ``multiplication of polynomials'' as a primitive!) Is
it possible that this program runs acceptably fast? How much
elaboration is required to make it faster? How much elaboration is
required to make it work for multivariate polynomials? Or coefficient
domains other than integers? For concreteness, we provide programs in
Lisp, and we also assume a certain representation that may or may not
be best for any given application.
\end{abstract}
\section{The task}
Abstractly we can consider a polynomial as a list of
coefficient-exponent pairs. For example,
$7+9x+34x^5$ might be
{\tt((0 . 7)(1 . 9)(5 . 34))}
For concreteness we have decided that the list
will be in descending order of exponent, and the pairs will be ordered as
(exponent . coefficient). Furthermore, in order to avoid a situation in
which two ``equal'' polynomials will have different representations,
we require that only non-zero coefficients are recorded. The value $0$ is
represented by an empty list.
The task is to write a program that takes two such polynomials and
returns a new polynomial which is their product, in the same form, fast.
Complicating the mapping of this abstraction into a conventional
linked list is that one does not ordinarily have easy $O(1)$ access to
the length of the list\footnote{Unless one associates a length with
the list and it is updated as necessary.} Writing in Lisp, where a
linked list is easily manipulated, a programmer would ordinarily use
lists for all intermediate and final results. As we shall see, this
is not necessarily efficient in time or space. Fortunately, lists can
usually be converted to (and from) other useful representations in
time linear in the length of the inputs. One representation is as a
pair of vectors: we separate the coefficient vector and the exponent
vector, considering that the exponents will all be integers, but the
coefficients might be any sort of number. In the case of the example
above, still ordering the terms from low to high degree, would be ([0
1 5], [7 9 34]).
Among other plausible representations, we can store the coefficients
{\em only} and let the exponents be implicit in the order. We then
have to represent zero coefficients. For our example: [7 9 0 0 0
34]. If there are large gaps, this representation requires large
sequences of zeros and is potentially very uneconomical. This
hazard of ``dense'' representation must be balanced against
the simplicity and speed of accessing coefficients:
it is trivial in such a representation to retrieve any coefficient or
store a newly-computed replacement coefficient: Just index into that
vector.
At this point let us elaborate on the nature of coefficients in a
computer algebra system intending to ``do arithmetic right.'' It is
necessary to consider the possibility that the set of coefficients may
include very large numbers that do not fit into a single computer
``word''. Indeed, some tasks require the manipulation of exact
integers of hundreds or thousands of digits. It may also be useful to
consider other types of coefficients, typically rational numbers,
hardware or software floating-point numbers, or elements from a finite
field.
Since it is also possible to construct cases where the exponents are larger
than can be stored in a single word, any representation must allow
for this, unlikely as it may seem at first glance. This may happen as
a consequence of ``packing'' multivariate polynomials into univariate ones.
One technique we programmed is to assume that exponents fit in single words,
and in the case of larger exponents, construct a superstructure which maps
the problem into collections of problems with smaller exponents, with
a ``scaling factor''.
\section {Solutions}
There is a long history of trying different methods for solving this
problem ``best''. See for example, the paper by Robert Moenck
\cite{moenck}. The ground has shifted somewhat since then, partly
because people are testing larger problems in which they believe that
the time for multiplication of sparse polynomials of size $N$ and $M$
is not so much dominated by the $NM$ coefficient multiplications
(typically arbitrary precision integers), but by the time to coalesce
and sort the $K=NM$ coefficients into a sorted collection. This time
is asymptotically larger: $K\log K$, and can potentially dominate the
calculation for large polynomials with simple coefficients.
In these larger problems and using newer computer designs, cache
misses related to patterns of memory access may affect execution time
more than operation counts. Furthermore, even in a conventional
memory model, finding space for the coefficients may routinely cost
more than computing them. For example, the time for cache-resident
arithmetic instructions to compute multiple-precision multiplication
and addition needed for a coefficient might be 50 or 100 cycles, a
substantially shorter time than the perioid needed to find an
appropriate new storage location for the result.
Furthermore, some attention may be paid (at least lip service) to
parallelism, now that multi-core machines are more common.
{\section{How likely is it that a polynomial product will be sparse?}
First of all, let us say what we think of as sparse. Start with two
univariate polynomials $f$ and $g$ with $N$ and $M$ non-zero terms
respectively. We know that their product will have a maximum degree
that is the sum of the degrees of $f$ and $g$, $d_f$ and $d_g$
respectively, but with some of the intermediate terms zero. A
non-sparse answer will have size (that is, number of non-zero terms)
not much larger than $N+M$, and perhaps even less if cancellations
occur. A sparse answer will have size close to $NM$. It can't have
more than $d_f \times d_g +1 $ terms, so that means, for a sparse
output, the degrees $f_d$ and $g_d$ must be (extravagantly) higher than the
number of terms $N$ and $M$.
For a concrete example, take two polynomials $f$ and $g$ each with
$N=1000$ terms and compute $fg$. If the result $fg$ is to have nearly
the maximum possible number of distinct terms ($1,000,000$) then the
$d_f+ d_g$ must be at least $1,000,000.$ Say each is $500,000$. That
means that only about one in 500 coefficients in $f$ will be non-zero,
and similarly for $g$. Mighty sparse, it seems. Indeed, a little
testing on randomly generated polynomials suggests that this is an
underestimate, and the degree of each had better be at least
$5,000,000$ and perhaps even $50,000,000$. This means the input
polynomials will contain a non-zero coefficient for every $50,000$
zeros. This is our estimate if you want to get 99.3 percent of the
total possible number of terms in the result.
Our conclusions are based on tests where we in fact generate two polynomials
and multiply them. We first choose a specified spacing $s$ such that
the spacing between coefficients in $f$ and $g$ is a random
number between 1 and $s$, generate two polynomials, multiply them,
and look at the result.
\begin{verbatim}
size(f)=size(g)=1000
s deg(f) deg f*g size f*g
1 999 1998 1999 dense
10 5493 11009 10970
100 49789 99659 94751
500 247878 493993 373392 250 zeros for every non-zero coef
1000 499820 1003671 569750 500 zeros for every non-zero coef
5000 2371811 4812134 875713 2500 zeros for every non-zero coef
10000 5001026 9916978 935760 5000 zeros. Close to fully sparse answer.
100000 49257894 98668403 993174
1000000 496930601 990887951 999345
10000000 4931905803 9815952362 999936
\end{verbatim}
What this means is that it is probably acceptable to do extra work
which is proportional to the number of terms in the input, or, often,
proportional to the number of terms in the output, as a preliminary
step to doing the multiplications. This will burn you only in
extraordinarily sparse situations. Fortunately it is probably
easy to detect those cases if you care to look: compare the degree and
the number of terms. This is not linear but even better: $O(1)$ in cost.}
\subsection{Dense arrays for internal multiplication}
\begin{verbatim}
;; convert a list to a dense array of coefficients
(defun list2array(L) ;; ((low-exponent . coeff) ...)
;;(list2array '((4 . 3.14d0)(10 . 1/2 )(12 . 300)) )
;;#(0 0 0 0 3.14d0 0 0 0 0 0 1/2 0 300)
(let ((ans (if (null L) #()
(make-array
(+ 1 (caar (last L))) ;how many items in list?
:initial-element 0))))
(dolist (i L ans) (setf (aref ans (car i))(cdr i)))))
(defun array2list(A) ;; inverse of above, convert array to list
(let ((ans nil))
(do ((i (1- (length A))(1- i))) ;for each element in array, store in list
((< i 0) ans)
(unless (= 0 (aref A i)) (push (cons i (aref A i)) ans)))))
(defun ptimes (L1 L2)(array2list (ptimesA (list2array L1)(list2array L2))))
(defun ptimesA(A1 A2)
(let ((ans (make-array (+ (length A1)(length A2) -1) :initial-element 0)))
(dotimes (i (length A1) ans)
(dotimes (j (length A2))
(incf (aref ans (+ i j)) (* (aref A1 i)(aref A2 j)))))))
;;test
(ptimes '(( 0 . -1) (1 . 1)) '((0 . 1) (1 . 1)))
;; answer is (should be ..)
((0 . -1) (2 . 1))
\end{verbatim}
This solution is not bad if the input data is relatively small and low
degree\footnote{We have not quantified ``relatively small and low
degree'' but at least in some tests it appears that we could use the
phrase ``less than ginormous'' ... thousands of terms. That is,
arrays are often fast.}, and the built-in Lisp arithmetic for
arbitrary-precision arithmetic, rational numbers, floats, etc. is
needed for generality, and the arithmetic implementation is good, and
array referencing is compiled nicely. Some optimizations might be
worthwhile in {\tt ptimesA} such as declaring the index variables as
fixnums. If they are not, it is unlikely that you would be able to
allocate the temporary array, or if you could do so, iterate through
the task in a reasonable time. A simple improvement is to take {\tt
(aref A1 i)} out of the inner loop, if the compiler has not already
figured this out.
If the input polynomials are sufficiently large and dense, and the
input coefficients are of modest size as well as the answer
coefficients: that is, they easily fit (exactly) in a double-precision
word, then it can pay off to replace the program {\tt ptimesA} with a
more elaborate program based on a numerical ``convolution''
implemented via a Fast Fourier Transform (FFT). We have discussed this
elsewhere \cite{fateman}, and have programs (in Lisp) to support
this. Calling an FFT library written in some (primarily numerical)
language like Fortran or C is an alternative to using Lisp as an
implementation language; competitions for the cleverest, perhaps
parallel, FFT are more commonly held using these
languages. Convolution is useful in various contexts, and is, for
example, a primitive function in Matlab ({\tt conv}).
With an FFT-based multiplication algorithm one computes two FFTs of
size sufficient to represent the answer, with coefficients in a
sufficiently-large computation structure. We compute the FFT of each
input, then multiply these values point-wise, and then compute the
inverse FFT of this product. The details of this are shown in an
appendix for multiplying by FFT, taking in total less than 80
non-comment lines. The program {\tt ptimesfft} is a direct replacement
for {\tt ptimesA}, but will get exactly correct answers {\em only if
the computation of the FFT in floating double-precision is
sufficiently accurate that the answer rounds to the correct integer
result}. This may be likely for modest sizes \cite{fateman}. If it is
not sufficiently accurate, it still may be worth doing in order to
find the approximate value (and magnitude) of the coefficients, which
would then give us valuable information (in particular, {\em the
precision needed to re-do the FFT to get the answer rounded exactly to
the nearest integer.}) It may also be the case that the inputs or the
results overflow the floating-point format, although this may be
excluded by prior analysis if numbers exceeding 308 decimal digits
cannot occur. An alternative to higher precision is a version of the
FFT in a discrete domain that is large enough to encode the problem;
alternatively we can compute a collection of FFTs in different finite
fields, each with a single-word modulus. These can be combined via the
Chinese Remainder Algorithm, for fully accurate results.
There are other schemes which do not go so far afield from basic
arithmetic but still provide an advantage, especially for multipling
mostly-dense nearly-equal-size polynomials. The simplest of these is
the Karatsuba algorithm\footnote{more complex versions are available
under the general concept of Toom algorithms}. Since these do not
require, as does the FFT, the computation of sine or cosine, or roots
of unity, these can be done directly in exact arithmetic, rather than
floating-point or using finite-field remaindering.
Another observation is that since A1 and A2 are referenced
sequentially, we could leave them as lists, which brings us to this
solution.
\subsection{Dense arrays for internal multiplication, data as lists}
\begin{verbatim}
;; array2list same as above.
(defun ptimes (L1 L2)(array2list (ptimesA2 L1 L2)))
(defun ptimesA2(L1 L2)
(if (or (null L1)(null L2)) nil
(let ((ans (make-array (+ 1(caar (last L1))(caar (last L2))) :initial-element 0))
(e 0)(c 0))
(dolist (i L1 ans)
(setf e (car i) c (cdr i)) ;the compiler should do this for us
(dolist (j L2)
(incf (aref ans (+ e (car j)))
(* c (cdr j))))))))
\end{verbatim}
Note that there are only three programs needed here, the one-line {\tt
ptimes}, the 7-line {\tt ptimesA2}, and the 4-line {\tt
array2list}. These line counts are somewhat arbitrary since
line-breaks and indentation are not legislated, and the body of {\tt
ptimesA2} could just be stuffed into {\tt ptimes}, since it is called
only once. Anyway, it looks like 13 lines.
\subsection{Lists only, collect terms after}
In this version we just multiply everything and put the results into a
big list. This has some problems because we will, in typical cases,
have many terms with the same exponent that must be combined. That is,
for inputs of size $n$, we will produce $n^2$ terms, which may
collapse to size $n$ or even less if some coefficients are zero.
Nevertheless, the program is fairly small, and looks like what we had above.
\begin{verbatim}
(defun ptimes (L1 L2)(compress-terms (sort (ptimesL L1 L2) #'< :key #'car)))
(defun ptimesL(L1 L2)
(if (or (null L1)(null L2)) nil
(let ((ans nil))
(dolist (i L1 ans)
(dolist (j L2)
(push (cons(+ (car i)(car j)) (* (cdr i)(cdr j))) ans))))))
;; here's the hard part.
(defun compress-terms(L)
(cond ((null L) L)
((eql (cdar L) 0) (compress-terms (cdr L)))
((null(cdr L)) L)
((eql (caar L)(caadr L)) ;same exponent for 2 terms
(setf (cdadr L)(+ (cdar L)(cdadr L))); add coefs in place
(compress-terms (cdr L)))
(t (cons (car L)(compress-terms (cdr L))))))
;; optional data abstraction to explain compress-terms
(defun exponent-of(H) (car H))
(defun coefficient-of(H) (cdr H))
;; thus (coefficient-of(first L)) turns out to be (cdr(car L)) or (cdar L)
;; and (exponent-of (second L)) turns out to be (car(cadr L)) or (caadr L)
;; and (coeffient-of (second L)) turns out to be (cdr(cadr L)) or (cdadr L)
(setf w '((0 . 1)(1 . 1) (2 . 1) (3 . 1)(4 . 1)(5 . 1)))
;;(ptimes w w);; compress-terms reduces 36 terms to 11 in the answer, which is
;;((0 . 1) (1 . 2) (2 . 3) (3 . 4) (4 . 5) (5 . 6) (6 . 5) (7 . 4)
;; (8 . 3) (9 . 2) (10 . 1))
\end{verbatim}
The program above is 15 lines, and in some examples can be very fast,
basically when {\tt sort} and {\tt compress-terms} do not have to do much. It avoids
the problem with the dense array, an approach that allocates space
that is proportional to the {\em degree} of the answer, rather than
the number of {\em non-zero terms}. This program is the first of the
programs that is able to, for example, multiply $x^{100000}+1$ by
itself in negligible time. A somewhat longer but more efficient
version of this program interleaves the compression with the generation
of terms; we can also produce terms in order so that sorting is not necessary,
though not without its own costs.
\subsection{Hash tables instead of arrays}
Here we assume that each polynomial is a list but we should accumulate
the result in a hash-table.
\begin{verbatim}
(defun ptimesLH(r s)
(let ((ans (make-hash-table)))
(dolist (i r ans)
(let ((co (cdr i)) (ex (car i)))
(dolist (j s)
(incf (gethash (+ ex (car j)) ans 0) (* co (cdr j))))r)) ans))
\end{verbatim}
An alternative model supported below is that each polynomial is {\em
already converted to a hash-table} prior to multiplication, and the
result {\em can remain in the form of a hash-table}. In the slot
indexed by exponent {\tt ex} there is the value {\tt co}, the coefficient.
\begin{verbatim}
(defun ptimesH(r s)
(let ((ans (make-hash-table)))
(maphash #'(lambda(ex co) ; exponent, coefficient
(maphash #'(lambda (ex2 co2)
(incf (gethash (+ ex ex2) ans 0) (* co co2))) r))
s) ans))
\end{verbatim}
That program is 6 lines. We could leave all of our polynomials in
hash-table form for internal consumption, but it is perhaps unfair
given our initial task statement. The Common Lisp default output
format for printing a hash-table leaves us somewhat in the dark as to
its constituent elements when it responds with an answer like
\begin{verbatim}
#.
\end{verbatim}
We could either change the output format in the printing program, or
explicitly add programs to convert from list to hash-table:
\begin{verbatim}
(defun list2hash(k) ;;k is a list of pairs, Return a hash table
(let ((ans (make-hash-table)))
(dolist (i k ans)(setf (gethash (car i) ans) (cdr i)))))
(defun hash2list(r) ;; r is a hash table
(let ((ans nil))
(maphash #'(lambda(key val)(unless (= 0 val)(push (cons key val) ans))) r)
(sort ans #'< :key #'car)))
(defun ptimes(L1 L2)(hash2list(ptimesH(list2hash L1)(list2hash L2))))
\end{verbatim}
This version adds 8 lines, for a total of 14. The nice part about this
program is that it scales up to enormous numbers of terms and
arbitrary degrees, using memory proportional to the number of terms,
not the degree. If the number of terms is about equal to the degree,
the overhead of hash-tables makes this version costlier than using arrays.
\subsection{Hashing with GMP arithmetic}
Let's elaborate on the program as stated in the previous section, and
replace the coefficient arithmetic with GMP arithmetic. GMP, GNU
Multiple-precision Package is a free library that can be loaded in to
Lisp\footnote{usually; details follow.}
and provides various facilities for arbitrary precision
arithmetic, duplicating but also extending the repertoire of
arithmetic, while at the same time providing substantial additional
efficiency. GMP uses architecture-dependent routines, some written in assembly
language, to speed up processing, and (unless the Lisp system already
uses GMP!) provides a boost in speed when the numbers exceed a few
word-lengths.
The line {\tt (incf (gethash (+ ex ex2) ans 0) (* co co2))}
shown in the program above needs to be changed to something more elaborate.
\begin{verbatim}
(setf (gethash (+ ex ex2) ans)
(mp+ (gethash (+ ex ex2) ans (mpz_creat_zero)) (mp* co c2)))
\end{verbatim}
We also need to change this program slightly:
\begin{verbatim}
(defun list2hash(k)
(let ((ans (make-hash-table :test '=)))
(dolist (i k ans)(setf (gethash (car i) ans) (lisp2gmp (cdr i))))))
;; and a corresponding use of {\tt gmp2lisp} in {\tt hash2list}.
\end{verbatim}
This is not entirely as good as can be done, at least with our handle
on the GMP routines, which includes a routine for in-place replacement
of a numeric value, that is, a ``merged'' operation that executes
$x:=x+a*b$. In particular, the GMP operation
\verb| (mpz_addmul x a b)| will update a value $x$ (which might be an
entry in a hash-table). This saves on instructions executed as well
as memory allocation: temporary memory can immediately be re-used, at
least when it is truly temporary and not part of an answer
coefficient. The program below has an additional change in the code
so the hash code is computed only once, though at the cost of storing
more, namely {\tt (exp . coef)} in the hash-table. This trick can be
used with the earlier hash multiplier as well, if it appears that
computing the hash code is a major component of the cost\footnote {In
our tests this is trivial if the exponent which is being hashed is an
integer. If the exponent is instead a vector of exponents as might be
used for a multivariate polynomial, the encoding can represent a
substantial improvement.} The basic program has gotten longer
though. We assume that the exponents are relatively small: we could
declare them to fit in a single word (whose size would depend on the
architecture; somewhat less than 32 or 64 bits) to generic faster code; it
is certainly unlikely to be a good idea to use GMP numbers for the exponents.
\begin{verbatim}
(defun ptimesg(r s);; multiply 2 hash tables with GMP
(let ((ans (make-hash-table :test 'eq )))
(maphash
#'(lambda(ex val);; index is exponent, val is (coef. exponent)
(maphash
#'(lambda (ex2 val2)
(let* ((expon (+ ex ex2)) (spot (gethash expon ans)))
(cond (spot (mpz_addmul (car spot) (car val)(car val2)))
(t (setf spot (cons (create_mpz_zero) expon))
(mpz_mul (car spot)(car val)(car val2))
(setf (gethash expon ans) spot) ))))r)) s) ans))
\end{verbatim}
Here the internal {\tt ptimes} program is now 10 lines long, up from
6. Is it faster? Certainly it is, if the underlying Lisp has a
typical arbitrary precision package (that is, not nearly as fast as
GMP) and some of the coefficients are large enough (say over 30 decimal
digits). Because ``foreign function'' interfaces (FFI) are not
currently part of the ANSI standard for Common Lisp, and not every
Lisp conforms to a proposed standard ``universal'' FFI or UFFI, the details
for attaching the GMP library to Lisp still depend on the Lisp
implementation. The version we use is in the generic library
available online,
{\verb|http://www.cs.berkeley.edu/~fateman/generic|}. Its specific
details are aimed at Allegro Common Lisp.
We have left unspecified the implementation of a hash-table, using the
default provided by our Lisp system. As is the case for almost any
general procedure, it is possible that this could be implemented in a
more specialized fashion, and by eliminating unused generality, made
faster. For example, we know that the keys are the term exponents:
non-negative integers of a specified maximum size. The Lisp hash-table
system allows for the keys to be any Lisp expression. As a kind
of specialization of a hash table for integers, we wrote out a
``radix-tree'' structure implementation where each level in the tree
represents a fixed number of bits in the index (the exponent),
and the exponents are not stored explicitly, but can be inferred
by position in the tree. In one experiment
involving a polynomial with 5,151 terms, squared to a polynomial of
20,301 terms, the radix tree representation used slightly less memory
than the hash-table (each node had up to 256 entries), and the radix tree was
about 4X faster. For smaller less-dense problems the times are closer.
% another experiment, square size 1771 to get 12,341 term result..
%\end{document}
The radix tree is, in some sense, an intermediate position between an
array and a tree, since a large enough node allocation would result in
all the terms in a polynomial fitting in just one dense node. In that
case the terms would be indexed as though they were in an array! In
the case of a sufficiently-sparse and widely-scattered collection of
exponents,(keys), the storage would effectively be proportional to the
number of keys (times the node-size). Large portions of the (empty
parts) of the polynomial might effectively use no storage even though
access time would be constant. Another variant of the radix tree
(which we suspect is novel) is to use different size nodes. We
programmed a system which has an arbitrary specification for the size
of nodes at different levels. Thus a top-level node of size 1024 will
fit all polynomials of degree less than 1024 into one node. Making the
next level size 32, and then size 8 for all subsequent levels means
that larger polynomials may be stored more sparsely, with polynomials
of degree up to 32,768 taking only 2 levels, and degree 262,144 only 3
levels. This data structure is essentially isomorphic to a hash table
of size 1024 with elements that can overflow into buckets of size 32,
which in turn can overflow into buckets of size 8 (etc.). The hash
function is simply extracting n bits from the right end of the key via
a logical AND; the key used for the rest of the search is that portion
remaining after the n bits are shifted off the right.
In the case of very sparse polynomials, we have modified the radix
tree notion and have special (terminal only) ``shortkey'' nodes with 2
entries: a displacement in the tree to the key, and the value. These
are effectively used only if there are no elements that would be in
the same regular node (nearby). That is, taking an empty tree and
inserting an element with a huge key does NOT initially populate all
the subtrees needed to reach that key; rather a shortkey node is
generated to store (essentially) the path to that node, and its value.
(Recall that in the usual case, the key is not explicitly stored: it
is implicit in the path to the node.
Continuing on in this vein of trying out tree variations, we found
that a re-implementation using binary search trees was terrible (200X
slower) because the order of insertions turned out to be too nearly
one of the well-known exactly wrong orders, resulting in the building
of a very unbalanced tree. Countering this trend, we found that using
a balanced tree insertion scheme (AVL tree) was approximately the same
speed as the radix tree on the same example. We tried an additional
alternative using yet another data structure known as a skiplist,
finding that this too had no outstanding performance, and finally a
heap structure, which we discuss in the next section. Another
direction to pursue is an implementation of a data structure again
comparable in semantics to hash-tables or search trees for exact
match, but especially suitable for parallel processing. There are
several issues pertinent especially to Lisp and hash-tables that may
not appear in some other language contexts. Since Lisps may be (and
often are) hosts for multiprocessing, the atomicity of a hash-table
transaction must be assured if several processors are to be used for
this task. Furthermore, insertions into the hash-table may cause a
storage reallocation to grow the hash-table. If so, external pointers
from data or programs directly into the hash-table itself cannot be
used. Additionally, a garbage collection triggered (perhaps by
another Lisp process) can move the hash-table to another location in
heap-allocated storage. Any Lisp that hopes to communicate via
structures with C and FORTRAN must make it possible to ``lock'' in
memory locations, so one can, with some (non-ANSI CL) coding effort
lock down hash-tables, too. In a case where the hashing overhead
dominates the arithmetic, and where atomicity of transactions in a
parallel-processing environment needs to be handled by the user, it
may pay to write a fixed-in-memory hash-table package, perhaps
distributing the hash-table itself to various processors. We have not
tried to duplicate the hash-table facilities with these additional
constraints.
\subsection {Heaps}
One alternative for storage is the ``heap'' made famous by ``heapsort''.
{In this approach we assemble the answer by extracting all
sub-products contributing to a particular exponent by bubbling them up
through an ordered heap and adding them together. We programmed this
approach, advocated by Monagan and Pearce \cite{monagan}. It was
close in speed to that of hashing on answers that were relatively
dense, but not as fast as a method especially good for dense results (simple arrays).
To its advantage: it was faster than hashing
and was also more conserving of storage on very very sparse multiplications.}
Our initial implementation was disappointing: Correspondence with
Michael Monagan and Roman Pearce on using a heap structure for
multiplication resulted in our developing a more polished
implementation of this basic idea. After finding that various
suggested programming tricks did not make much of a difference, we
found that the way to make this approach win was to change the nature
of the input data. That is, choosing the right benchmark
is the key: the inputs must be
extremely sparse, with the gap between exponents larger than the
number of terms (e.g. a 1,000-term polynomial with gaps of 5,000
between successive exponents.) This reflects the fact that the heap
method will not be advantageous for ``mostly dense'' results, where
hashing or a Karatsuba method or FFT, or just a carefully-coded tight
loop of ``naive'' multiplication may be faster.\footnote{
Somewhat orthogonally to this argument, if the coefficient arithmetic is
hugely expensive, and we can implement the FFT (or trivially one of these
asymptotically improved algorithms) around this complication,
we can reduce the number of such operations, and save time this way.}.
Here is the argument: In a very sparse multiplication of $N$ by $M$
terms there can be $N\times M$ terms in the answer. If each of these
is established in a separate entry in a hash-table, that hash-table
will have grown to substantial size, and to very little benefit in
return for the cost of hashing: no merging of coefficients with equal
keys will happen, and furthermore, the results will not be sorted (if
that matters). Compare this to the heap implementation where the size
of the heap, by contrast, will be proportional to the smaller of $N$
and $M$. As the coefficients in sub-products are generated, the
components of the appropriate exponent will be pushed out of the heap,
to be examined, added together, and then stored away, never to be
processed again. Monagan and Pearce assert an advantage that this
enables all bignum arithmetic to be done essentially with one
accumulator: they accumulate the indices of the coefficients in
chains, on the heap, and then only later perform the arithmetic. In
our implementation storing such chains does not seem to be
particularly advantageous; we just add the terms as they appear on the
top of the heap. And indeed, for large super-sparse polynomials, this
heap idea is faster.
There are (at least) two ways of maintaining the heap as needed. The size
of the heap is equal to the number of terms of the smaller polynomial for
most of the computation. As terms are exhausted, the heap shrinks. Inserting
into the heap can be done from the top (as the top-most term is removed, place
a successor in its place and percolate it downward), or from the bottom (after
removing the top-most term and reforming the heap, insert the new term at the
bottom. Somewhat to our surprise, insertion from the bottom was usually faster.
\subsection{Encoding polynomials as long integers}
There are other methods that we (and presumably others) have implemented and tested.
It is possible to encode any polynomial $p(x)$ as a long integer by
(essentially) evaluating it for some value of $x$ that is sufficiently
large. Since we can round up the value $x$ to be a power of two, this
evaluation can be done by shifting and adding. This idea is sometimes
attributed to Kronecker. Decoding the polynomial can also be done simply by
shifting and masking.
Here is a program to compute the product of two polynomials $p$
and $q$:
\begin{verbatim}
(defun ptimesI(p q)
(let((logv (padpower2 p q))); compute the padding needed
(int2list (* (list2int q logv)(list2int p logv)) logv)))
\end{verbatim}
The details of the three programs to find the log of the bound, as
well as the two conversions between list and integer, are given below. The
proof for the size of the padding has been provided in a previous
paper \cite{fateman}.
It is worth noting that the production of these, in general, rather
long integers, as well as the solitary multiplication, can be done in
ordinary Lisp, but unless the Lisp has very fast large-number arithmetic,
this is not a great idea. By using Lisp to call the GMP library,
we relieve the implementors of the Lisp of the burden of hacking together
a truly superior bignum facility.\footnote{Why would not every Lisp have
a superfast bignum facility? It appears that there are limited ``market'' forces at
work for this objective. The need for frequent revision in the
face of new hardware variants has been more easily met by the GMP project.}
We published such linkage programs
in Sept. 2003. See Appendix 2.
\begin{verbatim}
(defun list2int(p pad)
(let ((ans 0))
(dolist (j p ans)(incf ans (ash (cdr j)(* pad (car j)))))))
(defun int2list(i pad)
(let* ((ans nil)(q 0)(r 0)
(mv (- pad))
(vby2 (ash 1 (1- pad))) ;2^(v-1)
(v (ash vby2 1))
(mask (1- v))
(signum (> i 0)))
(setf i (abs i))
(setf exp 0)
(while (> i 0)
(setf q (ash i mv))
(setf r (logand i mask))
(cond ((> r vby2)
(push (cons exp (if signum (- r v)(- v r))) ans)
(setf i (1+ q)))
((= r 0) (setf i q))
(t(push (cons exp (if signum r (- r))) ans)
(setf i q)))
(incf exp))
ans))
(defun maxcoef(p1)
;; accumulate pos and negs separately to avoid computing/storing abs.
;; compare at end and make positive.
(let ((pans 0)(mans 0) temp)
(dolist (i p1 (max pans (- mans)) )
(setf temp (cdr i))
(if (< temp 0)(setf mans (min mans temp))
(setf pans (max pans temp))))))
(defun padpower2 (p1 p2)
(* 2 (ceiling (log (*(maxcoef p1)(maxcoef p2)(1+
(min (caar (last p1))(caar (last p2))))) 2))))
\end{verbatim}
\section{Sparse arrays}
Another kind of data structure that is easily built in ANSI standard
Common Lisp, and is, at least in some implementations, handled very
efficiently, is an encoding optimized toward sparse polynomials,
consisting of two arrays. In the exponent array of length N we have a
sequence of exponents, and for each exponent a coefficient in the
corresponding position in the coefficient array. Thus instead of ((10
. 20) (24 . 90)) we have arrays [10 24], [20 90], Compiled Lisp code
can sequence through these arrays as fast as is done in other
languages; whether this is faster than sequencing through a linked
list may depend on hardware memory issues. The array version will
ordinarily save space by not having to store links. Another associated positive
consequence of the use of arrays is the greater likelihood that we
will have localized data in a memory cache when needed.
There is a particular programming convenience in using arrays that we can use.
An array can be more easily ``split'' into sections as compared to a linked list. For
example, two ``virtual'' halves can be designated with each half being
treated separately. The two pieces could perhaps be handed to
different processors, or just used in a kind of
divide-and-conquer algorithm. That is, we can multiply $A \times B$
by splitting $A$ into two sections $A_1$ and $A_2$, and computing
$A_1\times B + A_2\times B$. The addition is done by some kind of
merge. This splitting can be done on the subproblems too, splitting $B$,
$A_1$, $A_2$, each into smaller pieces. The idea is to repeat the
splitting until the lowest-level multiplications are small enough to
fit (entirely or mostly) in some targeted memory cache (L1, L2, L3).
In the case of such huge problems,
merging of results should, to the extent possible, avoid
examining in their entirety the potentially large coefficients; only
the exponents are needed for sorting. One way to do this is to have the
sparse array of coefficients consist of pointers to the coefficients,
each of which may be an arbitrary-precision number, stored elsewhere
than the sparse array\cite{yozo}. We attempted to repeat our
earlier studies using cache-measurement tools but were frustrated by
the mismatch of tools (PAPI and Windows, Intel's VTUNE and Lisp, on
our current hardware, a Pentium 4). We expect that at this point
in the technology and size of caches, any results would
be tedious to collect and hard to relate to other setups. Testing
as in ATLAS \cite{atlas} might be appropriate, should one try
to achieve the utmost efficiency
of an algorithm on a particular platform.
If memory is monolithic, the splitting into sections can be done by
adjusting arrays endpoints (a bookkeeping arrangement), easily done
by ``displaced arrays,'' a feature in ANSI Common Lisp.
Given an array {\tt z} of length 2 {\tt h}, we can, in Common Lisp,
with negligible computation and no array storage allocation, rename
its two halves:
\begin{verbatim}
(setq lower (make-array h :displaced-to z))
(setq upper (make-array h :displaced-to z :displaced-index-offset h))
\end{verbatim}
To review: the scheme is to take advantage of this splitting approach
is to multiply $A$ by $B$ by splitting one or the other into halves
until the pieces are each within some target size, multiplying, and
then adding all the pieces together in a tree merge where short
results are merged to create longer and longer sequences, culminating
in the final merge to produce an answer.
One program for this framework looks like this. (Details are
available in program listing)
\begin{verbatim}
(defun mul-sac(x y)
;; mul-Sparse-Array-Cache(x y)
;; x and y are sparse polys in arrays
(let* ((xexps (car x)) ; an array of exponents for x
(yexps (car y))) ; an array of exponents for y
(cond ((< (length xexps) *cachelim*)
(if (< (length yexps) *cachelim*)
;; both x and y are small enough
(mul-sparse-inside-cache x y);; something good for cache
;; x is small enough, y is not. split it.
(multiple-value-bind (low hi)
(splitpol y)
(add-sparse-array (mul low x)(mul-sac hi x)))))
(t (multiple-value-bind (low hi)
(splitpol x)
(add-sparse-array (mul-sac low y)(mul-sac hi y)))))))
\end{verbatim}
Even if we are not trying to multiprocess, another reason exists for
splitting these polynomials. This reason is we would prefer, for
execution speed to have a limit on the largest exponent in a
polynomial, at least in intermediate calculations.
There is a speed advantage in knowing that the array of
exponents is an array of numbers that fit in a single word (that is,
an array of fixnums), rather than containing any multiple-word
``arbitrary precision'' number exponents. On the other hand, we do
not want to limit the exponent range in the abstract model. Therefore,
if the input polynomials have exponents that exceed the single-word
size, then the multiplication can be re-cast by segmenting the inputs
as $A_0+A_1x^k_1+ A_2x^{k_2}+\cdots+A_nx^{k_n}$ where the $k_n$ are
chosen so that two successive exponents are no further away than half
of the most-positive fixnum. The exponents in the polynomials
$\{A_i\}$ as well as the exponents in the product are necessarily
fixnums. The total polynomial multiplication can proceed by taking
two segmented polynomials using the data assumption of single-word
exponents, and piecewise multiplying them (fast), and then adding up the
results, suitably shifted by the $\{k_i\}$ (using arbitrary exponent sizes).
The $\{k_i\}$ are chosen so that for each $A_i$, the smallest
exponent is 0.
\section{What about multivariate polynomials?}
There are at least two different styles of extension to accomodate
multivariate polynomials. One extends the notion of coefficient
multiplication to coefficients which are themselves polynomials in
other variables. This requires a specification of order among the
variables, with the ``most main'' variable providing a kind of
scaffold upon which other polynomials hang. A mechanism must be
provided to order these variables so that the multiplications of (say)
$x+1$ by $y+1$ would result in $(x+1)y^1 + y^0$ with $y$ the main
variable, or as an alternative, $(y+1)x^1+x^0$ with $x$ the main
variable. This {\em recursive} representation is used in several
computer algebra systems, including Macsyma's canonical rational
expression system. It is more efficient for some operations, in
particular division with respect to the main variable, since accessing
the highest-degree term is a constant-time operation in this
representation.
Another popular approach is to store a polynomial as a collection of
monomials where a monomial is a coefficient times the product of
powers of variables. Ordinarily we expect most powers will be rather
small, coming from terms that look like $x^2y^3$, so it may make sense
to pack the numbers 2 and 3 into fields in a single integer. This
``exponent vector'' integer can represent any of the anticipated
values of exponents packed together into a single sufficiently-long
integer. To do this effectively we need to bound each variable's
power so as to space them out in the bits of that integer. In effect
we encode all the variables as suitably high powers of a single
variable. For example in the expression $f=x+y+z+1$ let $x=t$,
$y=t^{100}$, and $z=t^{10000}$. The expression
$g=t^{10000}+t^{100}+t+1$ behaves in many ways the same as
$x+y+z+1$. In this case, the 20th power of $f$ and $g$ has 1771 terms,
and it is easy to map from one form to the other. Furthermore, $g$ has
a term $465585120t^{31005}$ which, after a moment's thought, must
correspond to a term in $f$ of $465585120x^5y^{10}z^3$. This kind of
encoding was used prominently in the ALPAK and ALTRAN systems
\cite{brown}, where the user is required to provide, in advance, the
limits of exponent range for the exponent layout. The ALTRAN system
provides a mechanism that cleverly includes guard bits between the
exponent fields to protect against overflow from one field to the
next.
{We show a program which scans each of the polynomials, finds the
smallest power of two (rather than 10, as used in the illustration
above) that suffices to separate the maximum exponents. Using a power
of two means that binary shifting and masking can encode and extract
the exponents rapidly. The program converts the multivariate
polynomials each to a univariate polynomial then to a hash-table, does the
multiplication, and returns the hash-table back to list form. We do
not need to protect against overflow of exponent fields since we have
computed the size that is sufficient. If we were to keep data in
hash-tables over many operations, we could emulate ALTRAN, require user
declarations, etc. In our simple case for packing exponents for a
single multiplication, this takes about 40 lines of non-comment code
in total. See Appendix 3.}
{A further extension of this approach would be to reduce the 1771
terms in our example above even further by packing all the terms into
a single huge integer as suggested earlier. }
Alternatively, if the hackery above seems too obscure, requiring too
much hassle to pre-compute bounds, a program that just jumps in to the
multiplication process can be done by using the exponents in a vector
or a list, associated with each monomial. This requires only a modest
change to our hash program. In practice this may be slow if run
literally as shown, since the exponent hashing is now done on a list
of integers rather than a single integer of the packed exponents.
Furthermore, we expect that the list takes much more
storage. Nevertheless, the program is 10 lines.
\begin{verbatim}
;; allow any number of powers associated with each coefficient.
;; the list of variables, e.g. x,y,z ... is implicit in the order of powers
(defun list2hash(k) ;; k looks like ((30 . (5 4 3))(21 . (9 7 0))) 30*x^5*y^4*z^3+ ..
(let ((ans (make-hash-table :test 'equal)))
(dolist (i k ans)(setf (gethash (cdr i) ans) (car i)))))
(defun ptimes(r s)
(let ((ans (make-hash-table :test 'equal)))
(maphash #'(lambda(ex co) ; exponent, coefficient-list
(maphash #'(lambda (ex2 co2)
(incf (gethash (mapcar #'+ ex ex2) ans 0)
(* co co2))) r)) s)
ans))
\end{verbatim}
%% revised 11/21/08 RJF
\section{Can we make these programs faster?}
Yes. It would be pointless to compare these programs as-is to make an
informed choice as to which is ``the best'' without knowing the
context of the problems to be solved, profiling the behavior of the
different programs, compiling them with suitable levels of
``optimization'' or (lack of) error-checking, inserting type
declarations when possible. (In Lisp, all declarations are optional,
but they often are useful for some compilers, enabling them to produce
faster or more compact code.) On some computers and with some systems
it is possible to find parallel implementations of hash-tables and FFT
which could be used.
As a general note, for most of these programs it seems plausible to
write a collection of specific targeted programs for different
scenarios and then a general ``poly-algorithm'' program that first
tries to categorize the situation and then calls one of the specialty
routines. Unless you already know the situation, it is worthwhile to
do some diagnosis. That is, except for the very smallest inputs, you
can spend time ``linear in the size of the input'' or even ``linear in
the size of the output'' for pre- or post- processing that would help
choose the right special algorithm, or even coerce the input and
output into the forms needed by those routines. In spite of this
extra time, the net affect can be to have an algorithm that is only
slightly slower than the time for the best routine. We
can build a kind of expert system to choose the right algorithm and
mostly ignore the small costs for converting between integers and floats,
packing and unpacking arrays, converting between lists and arrays. These
are dominated by the subsequent multiplication. The diagnostic information that is
readily available includes some heuristic estimates for density
(comparing the degree and the number of terms), and computing some
norms to see if all the exponents and/or coefficients in the input and
output can be bounded by some convenient limit and packed in arrays.
(this is described below in some detail).
In writing the detailed programs for these methods, as well as some
other methods we have ignored in the text above because they never
exhibited superior performance in our tests, we found it useful to
look at the following ideas.
\begin{itemize}
\item Declaring all known types, and constraining the problem domain
so that the types are known; writing separate versions with minor
alterations (e.g. ``fixnum-only coefficient'' version). Lisp compilers are
likely to produce better code if we promise that all {\em
exponents} are small integers in the ``fixnum'' range (depending on
the architecture (64- or 32- bit), this would be something less than
64 bits or 32 bits). Separately, it is advantageous if {\em
coefficients} are all fixnums, or all double-precision floats, or all
complex, rational, bytes, etc. Our multivariate encoding allows
several exponents to be packed into a single word, saving space and
time.
\item Even if the coefficients are not known in advance to be all
small, or densely arranged, it may be that in a particular problem
instance they are, and it is to our advantage to scan through them
so we can convert a calculation
over the ``arbitrary precision integers'' to a calculation in fixnums,
or (in our experiments, more advantageously) in double-floats. At
least for some 32-bit architectures, using double-floats can improve speed
while allowing for more bits for representing numbers precisely. If
the integers needed are exactly representable in the floating-point
fraction field, and operations preserve accuracy, then we can win with
this transformation (returning the result to integers by rounding if
need be).
There are easily computed bounds on the sizes of coefficients. Here's
a good one: Let maxnorm(p) be the largest coefficient in absolute
value in the polynomial p.
Let sumnorm(p) be the sum of absolute values of coefficients in p.
Then a bound on the coefs in the answer is min (sumnorm(p)*maxnorm(q),
sumnorm(q)*maxnorm(p)). This bound is achieved, for example, for
$x^n+x^{n-1}+ \cdots +1$ squared.
As we have written earlier, if the floating-point numbers are not
sufficiently precise, the calculation can be mapped to results in
(several) finite fields and matched up using the Chinese Remainder
Theorem (Garner's algorithm) to get higher precision results.
\footnote{see Appendix 4}
\item Parallel implementation of the FFT, perhaps using a program from
FFTW, http://www.fftw.org/ . (Not used by us)
\item Parallel implementation of maphash, get/put hash,
encoding/decoding multivariate polynomials. (Not used by us)
\item Packing multiple (short) coefficients in words. (Not used by us)
\item For polynomials in arrays, or perhaps even in lists, consider
blocking to improve cache performance \cite{yozo}. (Requires
hardware-specific tools and testing, not done for this paper.)
\item hash-tables are just containers. Other container designs with
similar operations can be used. These include binary search trees,
B-trees, AVL trees, heaps \cite{monagan,scj}, skiplists, etc. These
can, from one perspective be seen as efforts to solve the ``sorting
X+Y'' \cite{fredman} problem (practically) fast, by (asymptotically)
faster algorithms, or by using less storage, or storage in some
disciplined fashion that may improve memory cache performance. Efforts
to implement a program with an algorithmic complexity better than
O$(n)$ where $n$=size$(X)$*size$(Y)$, seem to run into problems; $O(n
\log(n))$ is achievable in simple ways\cite{scj}. In view of today's
computer designs, improvements seem to depend on cache-performance
techniques. See \verb|http://maven.smith.edu/~orourke/TOPP/P41.html|.
% see appendix sortx+y.lisp
\item For sparse multivariate polynomials especially, or polynomials
which have been reduced to a sparse univariable polynomial by packing
exponents, we can consider another approach: We can compute the
density of a result heuristically (or indeed we can compute an exact
layout) by performing a kind of virtual multiplication. In effect, we
replace all the coefficients by 1, and redefine multiply and add
operations by binary logic. This produces a skeleton of the non-zero
coefficients in the product (though some of these coefficients may
turn out to be zero from cancellation when the real product is
computed). This skeleton can be used as the basis for various ideas
that might improve performance. If we use a hash-table, now the
maximum number of entries in the hash-table is known, as well as all
the hash keys. We could, if we wished, store at each hash location a
recipe for exactly which products to sum up to obtain the necessary
coefficient corresponding to that exponent. This would be expensive
in memory, and can be done in stages by using a heap; we did not,
however, find this to be as good as a hash-table.
\item Another indexing structure could be manufactured from this
information, including perhaps some distributed hash-table for
parallel processing. (All pre-packaged DHT facilities we found assume
the entries in the hash-table are more substantial, not just numbers,
so we don't expect they would be of direct use). Another thought,
specifically for multivariate polynomials, is that we could use this
information to go back to the original polynomials, and use modular
evaluation. Rather than conventional interpolation and the Chinese
Remainder Algorithm, we could use sparse interpolation\cite{zippel} to
compute the coefficients that are not zero. This technique works well
for greatest common divisor calculations, but it would be quite a
stretch to expect it to work fast for multiplications. (Not used)
\item A stream version of multiplication in which $P$ and $Q$ of size
$n\le m$ respectively, can be arranged, implementing the heap idea
but thinking about the computation functionally, rather than
with arrays. In effect, the program sets up $n$
computations, each of which can be represented by afunctional stream:
as a practical matter, a
short record of information storing a state of a
computation which can be quickly picked up and run to get the next output
from that stream. If $P= p_0x^{e_0}+p_1x^{e_1}+...$ $Q=
q_0x^{f_0}+q_1x^{f_1}+...$. Stream $0$ delivers the sparse exponent
and coefficient pairs: $[e_0+f_0, p_0q_0]$, $[e_0+f_1, p_0q_1]$, etc.
while stream $n-1$ delivers $[e_{n-1}+f_0, p_{n-1}q_0]$, etc. These
streams can be allocated on up to $n$ different processors. The main
multiplication is done by arranging pointers to these streams in a
heap so that the top of the heap has (one of) the lowest-valued streams judging
by the $e_i+f_j$ exponent of any of the streams. As items are removed
from the top of the heap, all equal exponents have their coefficients
added together and then sent out to the answer. Each time we use one
of the pairs from a stream we advance the stream and (if there is
another monomial on it) insert it back into the heap. Several tricks
can be used: the insertion into the heap can sometimes be done faster
``from the top'' rather than the usual ``from the bottom.'' Also it
may pay not to compute $p_i \times q_j$ ahead of time, but to defer
the multiplication until the accumulation is being done. This is
because the cost of computing the update $c:=c+p_i\times q_j$ may be
lower if it is done in a single ``multiply-add''. Monagan and Pearce
\cite{monagan} advocate storing a list of the $[i,j]$ pairs in this
activity and not doing {\em any} coefficient arithmetic until they are all
assembled; this allows them to execute the high-precision arithmetic
for the accumulation all at one time (per coefficient in the answer)
the result stored as a part of the answer.
After considerable twiddling, the best stream multiplication / heap
program is about as good as the hash-table version for dense
polynomial answers (though neither is nearly as good as a dense
method) but heap is superior to hash for very sparse answers. A
common-sense reason for this is that in such a situation the
hash-table program is mostly wasting time allocating unique locations for each
exponent in the table, never to be re-visited except when repeatedly
enlarging the hash-table to accomodate the growing number of terms.
\end{itemize}
\section{Benchmarks}
Before running benchmarks, it is appropriate to see to what extent
each of the programs should be specialized for the particular class of
inputs being tested. In particular, it seemed, initially, pretty safe
to assume that in programming in Lisp, we could declare exponents to
be ``fixnums'' in ordinary polynomial cases (when do we have
exponents exceeding $2^{30}$)? Where would such polynomials occur?
On the other hand, if we are encoding multivariate polynomials (say in
30 variables) packing them all in one exponent, then the resulting
single exponent may be rather large.
We can write the program out twice, and apply the appropriate one
based on the size of the exponents in the particular instance.
A program where the coefficients are arbitrary Lisp numbers can be
specialized in several ways. Without further specification, a Lisp
number can be a limited-precision integer, an exact arbitrary
precision integer, a rational number, a single- or double-precision
floating point number, a complex number (with rational or float
components), or with some additional effort in our case, a
quad-precision float, or an arbitrary multiple-precision (MPFR) float,
or a GMP arbitrary-precision integer. The last of these duplicates
the built-in Lisp facility for integers, but using an external library
that could be both more efficient and more highly tuned to the
particular hardware. Given all these choices, the operations of ``+''
and ``*'' are subroutine calls whose execution time can be dominated
by type-dispatch decision-making rather than arithmetic. Declaring
that all coefficients are (say) double-float, changes the nature of
the generated code, and makes the benchmarking more elaborate: 10
kinds of coefficients, and within some classes of coefficients we
should consider the contents: e.g. arbitrary-precision shortish
numbers, and longish numbers.
For some of our tests we set up a generator for random polynomials.
The input was $n$, the integer number of distinct monomials, $m$ the
(maximum) integer space between successive exponents: $m=1$ means
completely dense. $m=100$ means that $n$ different times a random
number $k$ between 1 and 100 is generated, and if the previous
exponent was $e$, the next exponent is $e+k$. The third parameter $c$
is the coefficient size. For each of $n$ different monomials, the
coefficient is set to a random integer between 0 and $c-1$.
Setting $m=1$ gives dense polynomials. We can produce a sparse
polynomial with $n=100,~m=5000$, for example.
Taking one of these random polynomials and squaring it provides a
dramatically different distribution of spaces between exponents. There
is a distinct clustering of monomial exponents in the ``central portion''
of the exponent range of the polynomial.
We also consider squaring polynomials along the lines that we have
proposed for testing in earlier papers, but also picked up by others:
$p = (1+x+y+z)^{20}$ converted to a univariate polynomial with $y=x^{41}$ and $z=x^{41^2}$
$q = x^{20000}+1$
$r= (x^10+2^{64})^{100}$
$t= 1+x+\cdots+x^{1000}$
If we can predict the sparsity of the answer, we can choose between
algorithms appropriate for dense and sparse results. That is, using a
direct naive method is good for anything that is small enough. If the
coefficients are small enough, methods compiled for declared types
that are single-precision integers ``fixnums'' are considerably
faster, even using an ``$n^2$'' method, carefully coded. For larger
coefficients and dense enough, especially where the inputs are of
about equal degree, problems can be done with a dense method such as
Karatsuba or (if the coefficient domain is suitable, or suitably
hacked, and the degree is large enough, FFT). However, if only half
or a quarter of the coefficients are non-zero, or if the inputs have
widely different degrees, the FFT loses its advantage in our tests.
A good very sparse method for large size polynomials, not necessarily
of similar size, where the answer is expected to be sparse (meaning
the inputs are probably very very sparse), seems to be a carefully
implemented heap algorithm.
What measurement and computation on the inputs might work to predict
the sparsity of the answer, at least as far as needed to choose the
best among a set of existing algorithms?
It must be inexpensive to compute (no worse than linear in the size of the
polynomials to be multiplied together), and not be far wrong on large
sizes. (For small sizes it almost does not matter since all algorithms
are fast enough.)
Here's one. Define DegreeSpan of a polynomial P: the difference
in degree between the maximum monomial exponent and the minimum in P. This
is one less than the degree of a polynomial if there is a non-zero
constant term. Let $\#f$ be the count of monomial terms in $f$.
Define spacing $S(f)$ as a ratio of $\#f$/DegreeSpan($f$).
Say that ``Sparse'' means $S < 0.2$. Note: $S=1$ means 100 percent dense.
$S=0.2$ means that of all the possible terms up to that given degree,
only $1/5$ of them have non-zero coefficients.
Alternatively the existing coefficients are (on average) the sum
of 5 terms in the product.
Consider polynomials $f,~g$ with random spacing between exponents.
For $n>1$, if the degree-span of either $f$ or $g$ exceeds $n\#f\times \#g $
then $f\times g$ usually has spacing $S < 1/n$. [Proof? We don't have a proof;
it is not even always true. Just some test data.]
So in particular if we use 0.2 as a cutoff, the degree-span of
$f$ or $g$ should exceed $5*\#f*\#g$.
Whether 0.2 is a critical cutoff point for anything depends on your
algorithms, your computer, your polynomials, your Lisp compiler.
In our experience precise benchmarks become mostly irrelevant as a
choice of language, compiler and implementation can change a balance
by a factor of up to 20; the choice of a hardware platform with different
size cache and perhaps alternative instruction sets can create substantial
changes as well. Memory sizes affect the benchmarks as well: some algorithms
use less memory; this becomes critical when they are compared to algorithms which
use slightly more memory: enough to require paging.
Versions of operating systems also affect testing data.
Substantial timing data, mostly reproducible on one machine, have been generated.
Because of the caveats above, we include only a very small selection; instead
the discussion below attempts to summarize.
A summary of the algorithms, each of which takes two polynomials and returns their product;
the polynomials are pairs of arrays containing exponents and coefficients.
\begin{itemize}
\item [dense]
Allocates an array large enough for all the answer coefficients, fills with zeros.
Computes the product, then runs through the array counting non-zero coefficients.
Allocates two fixed-length arrays and stores exponents and coefficients in them, returning the pair.
\item [dense fixed coefs]
Same as dense but assumes coefficients are fixnums. Not really a good assumption, but it
makes a more fair comparison with FFT, which makes assumption that coefficients are
exactly representable as double-floats.
\item [heap]
Allocates a heap the size of the smaller input; cycles all subproducts through the
heap collecting results into a stack. Rewrites the stack into two fixed-length arrays.
\item [naive]
Collects terms as generated into a sorted linked list. Rewrites into two fixed-length arrays.
\item[hashing]
Allocates a hash table which grows as required to be
large enough for all the answer coefficients. Computes the product,
then collects data (key, value) in the hash table into an array which
is then sorted and rewritten into two fixed-length arrays.
\item [fft]
Like dense, except allocates larger arrays suitable for FFT, and multiplies using a floating-point double-precision FFT. Rounds the results to integers and rewrites the non-zero terms
into two fixed-length arrays.
\item[radix tree]
Allocates a radix tree which grows as required to be
large enough for all the answer coefficients. Computes the product,
then collects data (key, value) in the tree into an array which
is then sorted and rewritten into two fixed-length arrays.
\item [karatsuba, toom]
Like dense, except multiplies by subdividing arrays in halves recursively (until small enough that a $O(n^2)$ is faster), and using an asymptotically faster method. Then rewrites non-zero terms
into two fixed-length arrays.
\end{itemize}
\begin{verbatim}
polynomials q and r have 10,000 terms, spaced 50 apart, with coefficients
between 1 and 5. Time the computation of q*r.
Algorithm Time in seconds
radix-tree 42.7
naive fixed coefs 50.2
heap 64.2
other algorithms ran out of memory
polynomials q and r have 5,000 terms, and 1,000 terms respectively,
spaced 50 apart, with coefficients between 1 and 5.
dense-fixed coefs 0.09
dense 0.22
radix-tree 1.62
hashing 1.68
fft 2.08
heap 2.53
naive 4.03
polynomials q and r have 5,000 terms, spaced 500 apart, with coefficients
between 1 and 5.
dense-fixed coefs 0.20
dense 0.37
heap 3.37
radix-tree 8.4
hashing 8.1
naive 129.19
fft out of space
polynomials q and r have 5,000 terms, spaced 1 apart, with coefficients
between 1 and 5. (Fully dense inputs)
fft 0.02
dense-fixed coefs 0.03
dense 0.19
toom 0.22
karatsuba 0.50
radix-tree 0.67
hashing 0.81
heap 2.17
naive 2.76
fft 0.02
polynomials q and r have 5,000 terms, spaced 10,000 apart, with coefficients
between 1 and 5. (very sparse)
heap 4.70
radix-tree 22.88 ;; 6.85 without sorting answer
hashing 22.88
naive 320.81
\end{verbatim}
%****
\section{Conclusions}
A really small program won't be the fastest, but given a selection of
a few, one of them will be pretty efficient. If we can find an easily
computed characterization of the problem (especially an estimated
description of the density of the result) we can probably decide, in a
particular hardware/software situation, which of several polynomial
multiplication programs will be fastest, incorporating costs of
sorting of exponents. Automatically tuning can probably help with
choosing the cutoff parameter for spacing versus degree, as well as
dividing up large polynomials into sections for improving cache
performance\cite{yozo}, an exercise we do not pursue here. We expect
that the tradeoff will generally be between a simple array-based
algorithm (for nearly all problems) and the use of a heap-based
multiplication for extravagantly large and sparse cases.
Other choices, including asymptotically faster methods (like
dense FFT, dense Karatsuba) may pay off, even by a factor of 10. While
the FFT has a significant leg up, in some circumstances, the
implementation we timed requires that approximate double-precision
floating-point number coefficients are sufficient for accuracy.
Regarding the FFT, note that if other programs are compiled to be
similarly type-restricted to fixed-length coefficients, and also given
dense problems, the advantage of the FFT on any problems that fit in
memory seems to be more like a factor of two. Given a substantially
sparse problem, the FFT or Karatsuba style algorithms are quite slow,
and may in fact run out of memory when other algorithms are feasible,
because they use memory proportional to the degree, not the number of
terms in the answer.
%****
A rude, but nevertheless plausible question to ask is: What purpose
could possibly be served by multiplying polynomials of size 10,000 or
more? We don't address it, but merely remark that we have discussed
the question it with other researchers who have written up algorithms
recently. They don't seem to care.
\section{About Lisp}
We have unapologetically provided Lisp programs without explaining the
details of the Lisp language usage. There are excellent books as well
as online guides to Lisp. While we do not claim these Lisp programs
are ``self explanatory'' each of them uses Lisp in normal (though
occasionally somewhat sophisticated) idiomatic usage. If the reader
is unfamiliar with Lisp, perhaps seeing the brevity of these programs
would be an incentive to gain familiarity.
{\section {Appendix: FFT multiplication}
In the spirit of providing explicit programs in ANSI Standard Common Lisp,
we provide a Lisp FFT program.
This program, which can easily be improved in various ways (as show in
the generic arithmetic package from which it is extracted) can be run
with no changes, in any supported (floating point) precision, which
usually means single or double precision in Lisp. A ``generic''
version can be compiled from the same program using quad-doubles, from
which coefficients can reconstructed that are 60 decimal digits or
so. Furthermore, arbitrary other precisions can be set using MPFR or
other software packages, but these involve some subtleties, like the
need for computing $\pi$ to sufficient precision.
Details in {\verb|http://www.cs.berkeley.edu/~fateman/generic|}.
\begin{verbatim}
(defun ptimesfft(r s) ; r and s are arrays compute the size Z of the
;; answer. Z is a power of 2. compute complex array r, increased to
;; size Z compute complex array s, increased to size Z compute two
;; FFTs, then multiply pointwise, then compute inverse FFT * 1/n
;; finally convert back to array and return answer.
(let* ((lr (length r))
(ls (length s))
(lans (+ lr ls -1))
(z (ash 1 (ceiling (log lans 2)))) ; round up to power of 2
(rfft (four1 (v2dfa r z) z))
(sfft (four1 (v2dfa s z) z)))
(dfa2v(four1 (prodarray rfft sfft z ) z :isign -1) lans)))
(defun v2dfa(a &optional (m (length a)) )
;; coerce a vector of numbers of length m to an array of length
;; 2m, since here complex numbers are stored in 2 adjacent
;; locations.
(let ((k (length a))
(ans (make-array (* 2 m) :initial-element 0.0d0))
(zz 0.0d0))
(dotimes (i k)
(setf (aref ans (* 2 i)) (aref a i)))
ans))
(defun dfa2v(a &optional (m (/ (length a)2)))
;; Coerce real parts back to integers, more or less. a is an an
;; array of even length. If you know that there are trailing zeros,
;; set the actual length with the optional second parameter m.
(let* ((k (/ (length a) 2))
(ans (make-array m)))
(dotimes (i m ans)
(setf (aref ans i)(round (aref a (* 2 i)) k)))))
;; a simple fft, in-place, not optimized
(defun four1 (data nn &key (isign 1))
(let ((wr 0.0d0) (wi 0.0d0)
(wpr 0.0d0) (wpi 0.0d0)
(wtemp 0.0d0) (theta 0.0d0)
(tempr 0.0d0) (tempi 0.0d0)
(j 1) (n (* 2 nn)) (m 0) (mmax 0) (istep 0))
(do ((i 1 (+ i 2)))
((> i n) t)
(when (> j i)
(setf tempr (aref data (1- j)))
(setf tempi (aref data j))
(setf (aref data (1- j)) (aref data (1- i)))
(setf (aref data j) (aref data i))
(setf (aref data (1- i)) tempr)
(setf (aref data i) tempi))
(setf m (floor (/ n 2)))
label1
(when (and (>= m 2) (> j m))
(setf j (- j m)) (setf m (floor (/ m 2)))
(go label1))
(setf j (+ j m))) ;end do
(setf mmax 2)
label2
(when (> n mmax)
(setf istep (* 2 mmax))
(setf theta (/ #.(* 2 pi) (* isign mmax)))
(setf wpr (* -2 (expt (sin (* 1/2 theta)) 2)))
(setf wpi (sin theta)) (setf wr 1.0d0) (setf wi 0.0d0)
(do ((m 1 (+ m 2)))
((> m mmax) t)
(do ((i m (+ i istep)))
((> i n) t)
(setf j (+ i mmax))
(setf tempr (- (* wr (aref data (1- j)))
(* wi (aref data j))))
(setf tempi (+ (* wr (aref data j))
(* wi (aref data (1- j)))))
(setf (aref data (1- j)) (- (aref data (1- i)) tempr))
(setf (aref data j) (- (aref data i) tempi))
(setf (aref data (1- i)) (+ (aref data (1- i)) tempr))
(setf (aref data i) (+ (aref data i) tempi)))
(setf wtemp wr)
(setf wr (+ (* wr wpr) (* -1 wi wpi) wr))
(setf wi (+ (* wi wpr) (* wtemp wpi) wi)))
(setf mmax istep)
(go label2))
(return data)))
(defun prodarray(r s len)
;; r and s are the same length arrays
;; compute, for i=0, 2, ..., len-2
;; ans[i]:= r[i]*s[i]-r[i+1]*s[i+1] ;; real part
;; ans[i+1]:=r[i]*s[i+1]+s[i]*r[i+1] ;; imag part
(let ((ans (make-array (* 2 len))))
(dotimes (i len ans)
(let* ((ind (* 2 i))
(ind1 (1+ ind))
(a (aref r ind))
(b (aref r ind1))
(c (aref s ind))
(d (aref s ind1)))
(setf (aref ans ind) (- (* a c)(* b d)))
(setf (aref ans ind1) (+ (* a d)(* b c)))))))
\end{verbatim}
}
{\section{Appendix 2: polynomials mapped to numbers}
\begin{verbatim}
;;; source for polysbyGMP.lisp September 2003
;;; Using arrays of gmps.
(defun tobigG(p logv)
(let ((res (create_mpz_zero)))
(do ((i (1- (length p)) (1- i)))
((< i 0) res)
(mpz_mul_2exp res res logv);; res <- res*2^logv
(mpz_add res res (togmp (aref p i))))))
(defun frombigG(i logv)
(let* ((ans nil)
(q (create_mpz_zero))
(r (create_mpz_zero))
(one (create_mpz_zero)) ; will be one
(vby2 (create_mpz_zero)) ; will be v/2
(v (create_mpz_zero)) ; will be v
(signum (>= (signum (aref i 1)) 0))); t if i non-neg
(mpz_set_si one 1) ; set to one
(mpz_mul_2exp vby2 one (1- logv)) ; set to v/2
(mpz_mul_2exp v one logv) ; set to v
(if signum nil (mpz_neg i i)) ; make i positive
(while (> (aref i 1) 0) ; will be 0 when i is zero
(setf r (create_mpz_zero))
(mpz_fdiv_r_2exp r i logv) ; set r to remainder
(mpz_fdiv_q_2exp q i logv) ; set q to quotient
(cond ((> (mpz_cmp r vby2) 0) ; it is a negative coef.
(if signum (mpz_sub r r v) (mpz_sub r v r))
(push r ans)
(mpz_add i q one) )
(t
(if signum nil (mpz_neg r r))
(push r ans)
(setf i q))))
(coerce (nreverse ans)array)))
(defun padpower2g(p1 p2);; p1 and p2 are arrays of gmps
(+ (integer-length (min (length p1) (length p2)))
(mpz_sizeinbase(maxcoefg p1) 2)
(mpz_sizeinbase (maxcoefg p2) 2)))
(defun maxcoefg(p1)
;; assume coefs are gmp numbers, use gmp arith
;; accumulate pos and negs separately to avoid computing/storing abs.
;; compare at end and make positive.
(let ((pans (create_mpz_zero))(mans (create_mpz_zero)) temp)
(do ((i (1- (length p1)) (1- i)))
((< i 0)(mpz_neg mans mans) ; make minus ans into pos
(if (> (mpz_cmp mans pans) 0) mans pans)); return larger
(setf temp (aref p1 i))
(if (< (aref temp 1) 0)
(if (< (mpz_cmp mans temp)0) nil (setf mans temp))
(if (> (mpz_cmp pans temp) 0)nil (setf pans temp))))))
(defun polymultg(p q);; use GMP bignum stuff to multiply polys with gmp coefs
(let((logv (padpower2g p q)))
(frombigG (mp* (tobigG q logv)(tobigG p logv)) logv)))
(defun ppowerg(p n) (if (= n 1) p (polymultg p (ppowerg p (1- n)))))
\end{verbatim}}
{\section{Appendix 3: Multivariate polynomials mapped to univariate}
\begin{verbatim}
;;; Simple code for multiplying multivariate polynomials by mapping to univariate
(defun max-expons-multivar(p)
;; p is a polynomial in the form below. Return a list of the max exponents
;; (#((5 4 3) (9 7 0)) . ;;exponents
;; #(30 21) ) ;;coefs
;; 30*x^5*y^4*z^3+ 21*x^9*y^7*z^0
(let ((h (map 'array #'(lambda(z)0) (aref (car p) 0)))) ;zero out an array
(map nil #'(lambda(r) (setf h (map 'array #'max h r)))(car p))
h))
(defun bound-expons-product(p1 p2)
;; returns a list of the maximum exponents that will occur in p1*p2, polynomials
(map 'list #'+ (max-expons-multivar p1)(max-expons-multivar p2)))
(defun sumlist(h n)(cond((null h) nil) ;; (sumlist '(a b c) 0) returns list (a a+b a+b+c).
(t (let((z(+ (car h) n)))
(cons z (sumlist (cdr h) z))))))
(defun mul-multivar-direct (p q)
;; use hash tables, multiply directly WITHOUT packing exponents
(let* ((ans (make-hash-table :test 'equal ))
(ep (car p)) (cp (cdr p))
(eq (car q)) (cq (cdr q))
(cpi 0) (epi 0))
(dotimes (i (length ep)
;; this answer is not in the right form.
;; needs to be sorted and converted to pair of arrays.
(l2pam
(sort
(loop for x being the hash-key in ans
and y being the hash-value in ans unless (= y 0)
collect (cons x y))
#'lexgp :key #'car)
(hash-table-count ans)))
(setf epi (aref ep i) cpi (aref cp i))
(dotimes (j (length eq))
(incf (gethash (map 'list #'+ epi(aref eq j)) ans 0)
(* cpi(aref cq j)))))))
(defun l2pam (z &optional (n (length z)))
;; list to pair of arrays, multivar
(let ((anse (make-array n))
(ansc (make-array n))
(exponsize (length (caar z)))
(j (1- n)))
(declare (fixnum j n exponsize))
(loop (if (< j 0) (return (cons anse ansc)))
(setf (aref anse j) (make-array exponsize :initial-contents (caar z)))
(setf (aref ansc j) (cdar z))
(decf j)
(pop z))))
(defun lexgp(a b)
(cond((null b) nil)
((> (car a)(car b)) t)
(t (and (= (car a)(car b))(lexgp (cdr a)(cdr b))))))
;; this version encodes multivariate into univariate, then multiplies
(defun mul-multivar(r s &optional (mulprog #'mul-naive))
;; multiply r and s; mulprog is the univariate mult prog to use
(multiple-value-bind (in out)(make-coders r s) ; set up the encoder and decoder
(let((prod
(funcall mulprog
(cons (map 'array in (car r))(cdr r))
(cons (map 'array in (car s))(cdr s)))))
(cons ;; the re-formed exponents
(map 'array out (car prod)) (cdr prod)))))
(defun make-coders(p1 p2);; returns 2 programs for encoding and decoding
(let* ((z (nreverse(bound-expons-product p1 p2)))
(fields (mapcar #'integer-length z))
(sumfields (cons 0(sumlist fields 0)))
(maskfields (mapcar #'(lambda(r)(1- (ash 1 r))) fields)))
;; encoder vector to integer
(values
#'(lambda(v)
(reduce #'+ (map 'list #'ash (reverse v) sumfields)))
;; decoder integer to vector
#'(lambda(i)(let ((q (make-array 5 :adjustable t :fill-pointer 0)))
(do ((u fields (cdr u))
(masks maskfields (cdr masks)))
((null u) (nreverse q))
(vector-push-extend (boole boole-and (car masks) i) q)
(setf i (ash i (- (car u))))) )))))
\end{verbatim}}
\section{Appendix 4 - Radix Tree multiplication}
{\begin{verbatim}
s;;; A multi-way tree for storing sparse indexed items. Although I just
;;; made this up, it looks like it is basically around in the
;;; literature, sometimes called a radix tree or a crit bit tree. It
;;; is an alternative to a hash table or an array, and is, in some
;;; sense, a data structure that can be made more array-like or more
;;; tree-like. The index or key is always an integer, perhaps a
;;; bignum.
;;; The size of the nodes is parameterized: each internal node is a
;;; small array, say of length 8 =2^logsize Then is
;;; stored in a node by decomposing the key into 3 =logsize bits at
;;; a time per layer in tree. thus 30-bit keys will have depth 10
;;; max. The initial sub-keys are the rightmost bits, and the remaining
;;; keys are computed by shifting the keys by logsize at each level.
;;; The key is never in fact stored in the tree, but computed by position.
;;; There is one subtlety worth noting.
;;; Note that with logsize 3, we can store items with keys
;;; [0,1,...,7] in the topmost node. If we need to store an item with
;;; key 9, it has the same trailing 3 bits as 1. To accomodate, we
;;; move the 1 down exactly one level. That is, the topmost node
;;; looks like [0,p,...,7] where p=[1,9,nil,nil...nil].
;;; so an key like 1 be at the top level (or maybe 1 down).
;;; CON: Keys must be integers or mapped to integers. Traversing the
;;; tree "in order" is tricky. Numerically adjacent keys will not be
;;; adjacent in the tree, except in the trivial case where all keys
;;; fit in one node. Nodes may be largely empty. A key with N bits
;;; will take about (N/logsize) probes.
;;; PRO: we need not know how long the longest key is, which would be
;;; the case if we were using a string-type decomposition starting
;;; from the left. Short keys will be located near the top of the
;;; tree. The keys are not stored, saving space. By increasing logsize
;;; we can make the performance closer to that of an array. We never
;;; compare keys; we only extract sub-keys and use them as indexes
;;; into arrays.
;;; author RJF June 13, 2008
;;; made faster, Oct 22, 2008, RJF
;;;; THIS SHOULD BE REPLACED ENTIRELY WITH CODE FROM Nov 21, 2008
(eval-when (:compile-toplevel :load-toplevel :execute)
(declaim (optimize (speed 3)(safety 1)))
;; (defconstant logsize 5) ; intermediate node will have 2^logsize slots
(defconstant logsize 3) ; best for debugging, not bad for running some tests.
)
(defmacro init-rtree()`(make-array #.(expt 2 logsize) :initial-element 0))
(defmacro leafp(x) `(numberp ,x))
(defmacro nodep(x) `(arrayp ,x))
(defun update-rtree(key val tree &aux node)
;; we use this for multiplication or addition of polynomials
;; but we could do anything with the update operation below
;; Declarations for speed.
(declare (optimize (speed 3)(safety 0))
(type (simple-array t (#.(expt 2 logsize))) tree))
(cond ((fixnump key)(update-rtree-fix key val tree))
;;above, optimized for fixnum key
((< key #.(expt 2 logsize))
;; we have found the level, or level-1 for the key.
(cond ((nodep (setf node (aref tree key)))
;; if a node here, must descend one level further in tree
(update-rtree-fix 0 val node)); and update that node's location 0.
(t(setf (aref tree key) (+ node val)))));; put it here.
;;The key has too many bits to insert at this level.
;;Compute the subkey and separate the rest of the key by masking and shifting.
(t (let ((ind(logand key #.(1- (expt 2 logsize)))))
;; mask off the relevant bits for subkey
(declare (fixnum ind))
(setf node (aref tree ind))
(cond ((arrayp node)
(update-rtree (ash key #.(- logsize)) val node))
;;descend into tree with rest of key.
(t (setf (aref tree ind)
(update-rtree
(ash key #.(- logsize)) val
(if (leafp node)
(update-rtree-fix 0 node (init-rtree))
(init-rtree)))))))))
tree)
(defun query-rtree(key tree) ;works. though we don't use it..
(declare (optimize (speed 3)(safety 0))
(type (simple-array t (#.(expt 2 logsize))) tree))
(cond ((< key #.(expt 2 logsize))
(if (arrayp (aref tree key)) (query-rtree 0 (aref tree key))
(aref tree key)))
(t (let* ((ind(logand key #.(1- (expt 2 logsize))))
(h (aref tree ind)))
(if h (query-rtree (ash key #.(- logsize)) h) nil)))))
(defun maptree (fn mt);; order appears jumbled, actually reversed-radix order
;; apply fn, a function of 2 arguments, to key and val, for each entry in tree.
;; order is "radix reversed"
(declare (optimize (speed 3)(safety 0)))
(labels
((tt1 (path displace mt);; path = path to the key's location
(cond ((null mt) nil)
((leafp mt) (funcall fn path mt));; function applied to key and val.
(t;; must be an array
(do ((i 0 (1+ i)))
((= i #.(expt 2 logsize)))
(declare (fixnum i))
(tt1 (+ (ash i displace) path) (+ displace logsize)(aref mt i)))))))
(tt1 0 0 mt)))
#| example
(setf mt (init-rtree))
(dotimes (i (expt 8 3))(update-rtree i i mt))
mt ;;; a fully populated b-rtree with value n at location n, 0<= n < 8^3
(ordertree-nz mt)
|#
(defun mul-multivar-rtree(r s)
(declare(optimize (speed 3)(safety 0)))
(multiple-value-bind (in out)(make-coders r s)
(labels ((incode (z)(mapcar #'(lambda(h)(cons (car h)(funcall in (cdr h)))) z)))
(setf r (ordertree-nz (mul-rtree (incode r) (incode s))))
(dolist (h r r)(setf (car h)(funcall out (car h)))))))
(defun A2PA(A) ;; array of (exp . coef) to 2 arrays.
(let* ((LA (length A))
(V nil)
(anse (make-array LA))
(ansc (make-array LA)))
(dotimes (i LA (cons anse ansc))
(setf V (aref A i))
(setf (aref anse i) (car V))
(setf (aref ansc i) (cdr V)))))
(defun mul-rtree (r s) ;multiply two polynomials, in arrays
(declare (optimize (speed 3)(safety 0)))
(let* ((er (car r)) (es (car s))
(cr (cdr r)) (cs (cdr s))
(cri 0) (eri 0)
(ler (length er))
(les (length es))
(ans (init-rtree)) )
(declare (type (simple-array integer (*)) cr cs er es)
(fixnum ler les)) ;maybe some others are fixnums too..
;; do the NXM multiplies into the answer tree
(dotimes (i ler (A2PA (ordertree-nz ans)))
(declare (fixnum i))
(setf cri (aref cr i) eri(aref er i))
(dotimes (j les)
(declare (fixnum j))
(update-rtree (+ eri(aref es j)) (* cri(aref cs j)) ans)))))
(defun mul-rtree-nosort (r s);multiply two polynomials, in arrays, result in tree
(declare (optimize (speed 3)(safety 0)))
(let* ((er (car r)) (es (car s))
(cr (cdr r)) (cs (cdr s))
(cri 0) (eri 0)
(ler (length er))
(les (length es))
(ans (init-rtree)) )
(declare (type (simple-array integer (*)) cr cs er es)
(fixnum ler les)) ;maybe some others are fixnums too..
;; do the NXM multiplies into the answer tree
(dotimes (i ler ans ;(L2PA (ordertree-nz ans))
)
(declare (fixnum i))
(setf cri (aref cr i) eri(aref er i))
(dotimes (j les)
(declare (fixnum j))
(update-rtree (+ eri(aref es j)) (* cri(aref cs j)) ans)
))))
;;; The program update-rt1 below is a more general update function.
;;; First we show how we could use it for multiplication or addition.
(defun update-rtree+ (key val tree) ;; same as update-rtree, using rt1
(update-rt1 key val tree #'(lambda(val oldval)(+ val (or oldval 0)))))
(defun update-rt1(key val tree fn)
(labels((update-rtree
(key val tree)
;; we use this for multiplication or addition of polynomials
;; but we could do anything with the update operation below
;; carefully, perhaps overly, declared for speed.
(declare (optimize (speed 3)(safety 0))
(type (simple-array t (#.(expt 2 logsize))) tree))
(cond ((fixnump key)(update-rt1-fix key val tree fn))
((< key #.(expt 2 logsize)) ; we have found the level, or level-1 for the key.
(cond ((nodep (aref tree (the fixnum key))); if a node here, must descend one level further in tree
(update-rtree 0 val (aref tree (the fixnum key)))); and update that node's location 0.
(t
;; we have a leaf node to update.
(setf (aref tree (the fixnum key))
(funcall fn val (aref tree (the fixnum key)))))))
;;The key is still too big. Compute the subkey and the rest of the key by masking and shifting.
(t (let* ((ind(logand key #.(1- (expt 2 logsize)))); mask off the relevant bits for subkey
(h (aref tree ind)))
(declare (type (simple-array t (#.(expt 2 logsize))) tree)(fixnum ind))
(cond ((arrayp h)
(update-rtree (ash key #.(- logsize)) val h));descend into tree with rest of key.
((leafp h) ; leaf, not link. We make a subtree and move the leaf down one level.
(setf (aref tree ind)
(update-rtree (ash key #.(- logsize)) val (update-rtree 0 h (init-rtree)))))
;; h is nil means the node was empty. Just make a new subtree and insert.
(t (setf (aref tree ind)(update-rtree (ash key #.(- logsize)) val (init-rtree))))))))
tree))
(update-rtree key val tree)))
(defun ordertree-nz(mt);; order-tree the tree removing zeros. sort after the whole tree is traversed
(declare (optimize (speed 3)(safety 0))
(type (simple-array t (#.(expt 2 logsize))) tree))
(let ((ans (make-array 10 :adjustable t :fill-pointer 0)))
(declare (type (array t (*)) ans))
(labels
((tt1 (path displace mt);; path = path to the key's location
(cond ((null mt) nil)
;; line below changed to return nil if mt is zero
((leafp mt)(unless (zerop mt)
(vector-push-extend (cons path mt) ans)))
(t;; mt must be an array
(do ((i #.(1- (expt 2 logsize)) (1- i)))
((< i 0) ans)
(declare (fixnum i) (type (array t (*)) mt))
(tt1 (+ (ash i displace) path) (+ displace logsize)(aref mt i)))))))
(sort (tt1 0 0 mt) #'< :key #'car))))
(defun update-rtree-fix(key val tree &aux node) ;; optimization for key being fixnum; minor hacks
(declare (optimize (speed 3)(safety 0))
(type (simple-array t (#.(expt 2 logsize))) tree)
(fixnum key))
(cond ((< key #.(expt 2 logsize)) ; we have found the level, or level-1 for the key.
(cond ((nodep (setf node (aref tree key))); if a node here, must descend one level further in tree
(update-rtree-fix 0 val node)); and update that node's location 0.
(t(setf (aref tree key) (+ node val)))));; put it here.
(t (let ((ind(logand key #.(1- (expt 2 logsize))))); mask off the relevant bits for subkey
(declare (fixnum ind))
(setf node (aref tree ind))
(cond ((arrayp node)
(update-rtree-fix (ash key #.(- logsize)) val node));descend into tree with rest of key.
(t (setf (aref tree ind)
(update-rtree-fix (ash key #.(- logsize)) val
(if (leafp node)
(update-rtree-fix 0 node (init-rtree))
(init-rtree)))))))))
tree)
(defun update-rt1-fix(key val tree fn)
(labels((update-rtree;;internally, use fixnum keys
(key val tree)
(declare (optimize (speed 3)(safety 0))
(type (simple-array t (#.(expt 2 logsize))) tree)
(fixnum key))
(cond ((< key #.(expt 2 logsize)); we have found the level, or level-1 for the key.
(cond ((nodep (aref tree (the fixnum key))); if a node here, must descend one level further in tree
(update-rtree 0 val (aref tree (the fixnum key)))); and update that node's location 0.
(t
(setf (aref tree (the fixnum key))
(funcall fn val (aref tree (the fixnum key)))))))
(t (let* ((ind(logand key #.(1- (expt 2 logsize)))); mask off the relevant bits for subkey
(h (aref tree ind)))
(declare (type (simple-array t (#.(expt 2 logsize))) tree)(fixnum ind))
(cond ((arrayp h)
(update-rtree (ash key #.(- logsize)) val h));descend into tree with rest of key.
((leafp h) ; leaf, not link. We make a subtree and move the leaf down one level.
(setf (aref tree ind)
(update-rtree (ash key #.(- logsize)) val (update-rtree 0 h (init-rtree)))))
(t (setf (aref tree ind)(update-rtree (ash key #.(- logsize)) val (init-rtree))))))))
tree))
(update-rtree key val tree)))
\end{verbatim}}
\section{Appendix 4: Chinese Remainder Algorithm}
{How does this work? Let $N = n_0n_1 ...n_k$. Then the Chinese
Remainder Theorem tells us that we can represent any number x in the
range $-(N-1)/2$ to $+(N-1)/2$ by its residues modulo $n_0,~ n_1, n_2,~ ...,~ n_k$.
Conversion to Normal Form (Garner's Alg.)
Converting to normal rep. takes $k^2$ steps.
Beforehand, compute
inverse of $n_1$ mod $n_0$, inverse of $n_2$ mod $n_0n_1$, and also the products $n_0n_1$, etc.
Aside: how to compute these inverses: Extended Euclidean Algorithm.
Input: x as a list of residues $\{u_i\}$: $u_i = x$ mod $n_i$
Output: x as an integer in $[-(N-1)/2,(N-1)/2]$. (Other possibilities
include x in another range, also x as a rational fraction!)
Consider the mixed radix representation
$$x = v_0 + v_1*n0 + v-2*(n_0*n_1)+ ... +v_k*(n_0*...n_{k-1}) \eqno(*)$$
if we find the $v_i$, we are done, after $k$ more mults.
$v_0 = u_0,$ each being $x$ mod $n_0$
Next, computing remainder mod $n_1$ of each side of (*)
$u_1 = v_0 +v_1*n_0 $mod $n_1$, with everything else dropping out
so $v_0 = (u_1-v_0)* n_0^{-1} $ all mod $n_1$
in general,
$v_k = ( u_k - [v_0+v_1*n_0 +...+v_{k-1}* (n_0...n_{k-2}) ]) * (n_0*...*n_{k-1})^{-1}$ mod $n_k.$
Note that all the $v_k$ are small numbers and the products of $\{n_i\}$ are precomputed.
Cost: If we find all these values in sequence, we have $k^2$
multiplication operations, where $k$ is the number of primes needed. In
practice we pick the $k$ largest primes we can, subject to the other constraints
(e.g. we would like the arithmetic to be single-precision, and fast). If
we can do 31-bit signed arithmetic fast, and $L$ is the largest number
in the answer, $k$ is
about $2\log (L/2^{31}$)}
\section{Appendix 5: Heap management}
Code is provide in the shortprog.tex file. Needs cleaning up. Also testing data.
\begin{thebibliography}{99}
{\bibitem{atlas} ATLAS, Automatically Tuned Linear Algebra Software,
{\tt http://math-atlas.sourceforge.net/}}
{\bibitem{brown} Brown, W.S. ``A language and system for symbolic algebra on a digital computer'',
in Kuo, F. F. and J. F. Kaiser (eds.), {\em System Analysis by Digital Computer}, John Wiley, New York, 1966 349--369 }
\bibitem{fateman} Richard Fateman. ``Comparing the speed of programs for sparse polynomial multiplication,''
SIGSAM Bulletin 37, 1 March 2003.
\bibitem{fredman} Fredman, M. L. `` How good is the information
theory bound in sorting?'' {\em Theoret. Comput. Sci.}, 1:355-361,
1976.
\bibitem{scj} Johnson, S.C., ``Sparse Polynomial Arithmetic'',
{\em SIGSAM Bull}., vol 8 issue 3 (August 1974) 63---71.
\bibitem{moenck} Moenck, R. ``Practical Fast Polynomial Multiplication,'' ISSAC 1976 (SYMSAC 1976). ACM.
\bibitem{monagan} Michael Monagan, Roman Pearce.
{ ``Polynomial Division Using Dynamic Arrays, Heaps, and Packed Exponent Vectors''
in {\em Computer Algebra in Scientific Computing},
Lecture Notes in Computer Science Volume 4770/2007,
2007
295--315}
\bibitem {yozo} Yozo Hida,
``Data Structures and Cache Behavior of Sparse Polynomial Multiplication,'' Class project CS282, UC Berkeley, May, 2002. {\verb|http://www.cs.berkeley.edu/~fateman/papers/yozocache.pdf|}
\bibitem{zippel}
Zippel, R.
``{Interpolating polynomials from their values},''
{\em J. Symb. Comput. 9}, 3, 375--403, 1990
% volume = {9},
% number = {3},
% year = {1990},
% issn = {0747-7171},
% pages = {375--403},
% doi = {http://dx.doi.org/10.1016/S0747-7171(08)80018-1},
% publisher = {Academic Press, Inc.},
% address = {Duluth, MN, USA},
\end{thebibliography}
\end{document}
(defun pdumps(r varlist) ;list to string output, various hacks included
(cond ((numberp r) r)
(t
(fix+- ;; fix up a+-43*b to a-43*b.
(subseq
(apply #'concatenate 'string
(map 'list #'(lambda(k)
(format nil "+~{~s~}"
;; do something later for negative coeffs, also -1.
(let ((h (apply #'append
(map 'list
#'(lambda(r s)
(cond ((= s 0) nil)
((= s 1) (list r))
(t (list '* r '^ s))
))
varlist (cdr k)))))
(cond ((= (car k) 1) h)
((= (car k) -1)(print h) (cons '- (if (eq (car h) '*)(cdr h) h)) )
(t(cons (car k) h))))))
(sort (copy-list r) #'lex> :key #'cdr))) 1))
)))
;; if you don't like s= "x+-45", do (fix+- s)
(defun lex> (a b)(cond ((null b) nil)((null a) t)
((= (car a)(car b))(lex> (cdr a)(cdr b)))
((> (car a)(car b)) t)
(t nil)))
;; (setf k '((630 14 74 3) (900 10 8 33) (441 18 77 0) (630 14 11 30)))
;; (pdumps k '(x y z))
;;(setf r '((1 0 1)(1 1 0))) ;x + y
;;(setf s '((1 0 1)(-1 1 0))) ;x - y
;;(pdumps (ptimes-multivar r r)'(x y)) ==> "x^2+2xy+y^2" .. could put "*" in there too..
(defun ptimes-univar-faster(r s) ;; store (coef) rather than coef in hash-table. Faster updating we hope
(declare(optimize (speed 3)(safety 0)))
(let ((ans (make-hash-table)))
(maphash #'(lambda(ex co) ; exponent, coefficient in a list.
(maphash #'(lambda (ex2 co2)
(let* ((newex (+ ex ex2)))
(incf (car (or (gethash newex ans nil)(setf (gethash newex ans) (list 0))) )
(* (car co)(car co2)))))r))s)
ans))
(defun list2hash-faster (k) ;; k looks like ((30 . (5 4 3))(21 . (9 7 0))) 30*x^5*y^4*z^3+ ..
(let ((ans (make-hash-table :test 'equal)))
(dolist (i k ans)(setf (gethash (cdr i) ans) (list (car i))))))
(defun ptimes-multivar-faster(r s)
(multiple-value-bind (in out)(make-coders r s)
(labels ((incode (z)(mapcar #'(lambda(h)(cons (car h)(funcall in (cdr h)))) z)))
(mapcar #'(lambda(h)(cons (car h)(funcall out (cdr h))))
(hash2list-faster (ptimes-univar-faster (list2hash-faster (incode r))(list2hash-faster (incode s)))))
)))
(defun hash2list-faster(h)
(let ((ans nil))
(maphash #'(lambda(key val)(unless (= 0 (car val)) (push (cons (car val) key) ans))) h)
ans))
(defun ppower(x n)(if (= n 1) x (ptimes-multivar-faster x (ppower x (1- n)))))
(defun fix+-(s) ;; take a string like "a b+c+-d+e+-ef" and change it to "a b+c-d+e-ef"
(do ((h (position #\+ s)(position #\+ s :start h)))
((null h) s)
(if (char= (elt s (incf h)) #\-) ;look for a following #\- ;string must not end with +
(setf s (delete #\+ s :start (1- h) :end h)))))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; sortx+y.lisp
(eval-when (compile load) (declaim (optimize (speed 3)(safety 0))))
;; sort X+Y various ways.
(defun make-rl(n d)
;;make random list of length n with spacing k, with 0= child fill-pointer) (hlessfun x (aref a child)))
hole)
(setf (aref a hole) (aref a child)
hole child)))
(defun percolate-up (a hole x)
"Private. Moves the HOLE until it's in a location suitable for holding
X. Does not actually bind X to the HOLE. Returns the new
index of the HOLE. The hole itself percolates down; it's the X
that percolates up."
(declare (fixnum hole) (type (simple-array t (*)) a)
(optimize (speed 3)(safety 0)))
(setf (aref a 0) x)
(do ((i hole parent)
(parent (ash hole -1) (ash parent -1) ))
((not (hlessfun x (aref a parent))) i)
(declare (fixnum i parent))
(setf (aref a i) (aref a parent))))
(defun create-heap-ordered
(initial-contents)
;; used only if initial-contents is already sorted list.
;; important: initial-contents must have the first element duplicated.
;; e.g. (3 5 8) should be (3 3 5 8)
;; (format t "*")
"Initialize the indicated heap. If INITIAL-CONTENTS is a non-empty
list, the heap's contents are initialized to the values in that
list; they are assumed already ordered according to hlessfun. INITIAL-CONTENTS must
be a list or NIL."
(let* ((n (length initial-contents))
(heap
(make-array n
:initial-contents initial-contents
:element-type t)))
(setf fill-pointer n) ;global value for this heap
heap))
(defun heap-count (heap)
(declare (ignore heap))
(1- fill-pointer))
(defun heap-empty-p (heap)
(declare (ignore heap))
"Returns non-NIL if & only if the heap contains no items."
(= fill-pointer 1))
(defun heap-insert (a x)
"Insert a new element into the heap. Returns the heap." ;; rjf
;; Append a hole for the new element.
;; (vector-push-extend nil a)
;; assume enough room...
;; (vector-push nil a)
(declare (type (simple-array t (*)) a)(fixnum fill-pointer)
(optimize (speed 3)(safety 0)))
;;(format t ".")
(setf (aref a fill-pointer) nil)
(incf fill-pointer)
;; Move the hole from the end towards the front of the
;; queue until it is in the right position for the new
;; element.
(setf (aref a (percolate-up a (1- fill-pointer) x)) x))
(defun heap-remove-fast(a)
;; assumes non-empty!! if empty next answer is bogus.
;; answer after that is an error (fill pointer can't pop)
(declare(type (simple-array t(*)) a)(optimize (speed 3)(safety 0)))
;;(format t "-")
(let* ((x (aref a 1))
(last-object
(progn (decf fill-pointer)(aref a fill-pointer) ) ))
(setf (aref a (percolate-down a 1 last-object)) last-object)
x))
(defun lesser-child (a parent)
"Return the index of the lesser child. If there's one child,
return its index. If there are no children, return
(FILL-POINTER (HEAP-A HEAP))."
(declare (type (simple-array t (*)) a)
(optimize(speed 3)(safety 0)) (fixnum parent) )
(let* ((left (ash parent 1))
(right (1+ left))
(fp fill-pointer))
(declare (fixnum left fp right))
(cond ((>= left fp) fp)
((= right fp) left)
((hlessfun (aref a left) (aref a right)) left)
(t right))))
;;; --- end of heap stuff ---
;;; -*- Lisp -*-
;;; msortx.lisp multiplication of polyns based on copying coefs into arrays.
;;; this uses double-float coefficients
(eval-when (compile load) (declaim (optimize (speed 3)(safety 0))))
;; sort X+Y various ways.
(defun make-rac(n d)
;;make random list of length n with spacing k, with 0 xi xlim)
(if (> yi ylim) (return (cons zexps zcoefs))
(let()
(loop for i from yi to ylim do
(vector-push (aref yexps i) zexps)
(vector-push (aref ycoefs i) zcoefs))
(setf yi (1+ ylim)))))
((> yi ylim)
(let()
(loop for i from xi to xlim do
(vector-push (aref xexps i) zexps)
(vector-push (aref xcoefs i) zcoefs))
(setf xi (1+ xlim))))
((< (aref xexps xi)(aref yexps yi))
(vector-push (aref xexps xi) zexps)
(vector-push (aref xcoefs xi) zcoefs)
(incf xi))
((= (aref xexps xi)(aref yexps yi))
(vector-push (aref xexps xi) zexps)
(vector-push (+ (aref ycoefs yi) (aref xcoefs xi)) zcoefs)
(incf xi) (incf yi))
(t;;(> (aref xexps xi)(aref yexps yi))
(vector-push (aref yexps yi) zexps)
(vector-push (aref ycoefs yi) zcoefs)
(incf yi))))))))
(defun splitpol(z) ;; take a polynomial and return its two halves
(let* ((h (length (car z)))
(half (ash h -1))
(rest (- h half)))
(declare (fixnum h half rest)(optimize (speed 3)(safety 0)))
(values
(cons (make-array half :displaced-to (car z))
(make-array half :displaced-to (cdr z)))
(cons (make-array rest :displaced-to (car z) :displaced-index-offset half)
(make-array rest :displaced-to (cdr z) :displaced-index-offset half)))))
(defun mular(x y);; x and y are sparse polys in arrays as produced by (make-racg n m)
(let* ((xexps (car x))
(xe 0)
(xc 0)
(xlen (length xexps)))
(declare (fixnum xlim)(optimize (speed 3)(safety 0)))
(cond ((= 0 xlen)
'(#() . #())) ; empty polynomial
((= 1 xlen) ; a monomial times a polynomial!
(setf xe (aref xexps 0))
(setf xc (aref (cdr x) 0))
(cons (map 'vector #'(lambda (r)(+ r xe)) (car y))
(map 'vector #'(lambda (r)(* r xc)) (cdr y))))
(t (multiple-value-bind (low hi)
(splitpol x)
(addar (mular low y)(mular hi y)))))))
;; this works. not particularly compact. or fast. mulhashg is much faster.
(defun addar(x y);; x and y are sparse polys in arrays as produced by (make-racg n m)
(cond
((= (length(car x)) 0) y)
((= (length(car y)) 0) x)
(t
(let*
((xexps (car x))
(xi 0)
(xlim (1- (length xexps)))
(xcoefs (cdr x))
(yexps (car y))
(yi 0)
(ylim (1- (length yexps)))
(ycoefs (cdr y))
(m (+ xlim ylim)) ; maximum possible size.
(zexps (make-array m :fill-pointer 0))
(zcoefs (make-array m :fill-pointer 0)))
(declare (fixnum xi yi ylim xlim)(optimize (speed 3)(safety 0)))
(loop
(cond
;; check first if we have run out of x or y,
((> xi xlim)
(if (> yi ylim) (return (cons zexps zcoefs))
(let()
(loop for i from yi to ylim do
(vector-push (aref yexps i) zexps)
(vector-push (aref ycoefs i) zcoefs))
(setf yi (1+ ylim)))))
((> yi ylim)
(let()
(loop for i from xi to xlim do
(vector-push (aref xexps i) zexps)
(vector-push (aref xcoefs i) zcoefs))
(setf xi (1+ xlim))))
((< (aref xexps xi)(aref yexps yi))
(vector-push (aref xexps xi) zexps)
(vector-push (aref xcoefs xi) zcoefs)
(incf xi))
((= (aref xexps xi)(aref yexps yi))
(vector-push (aref xexps xi) zexps)
(vector-push (+ (aref ycoefs yi) (aref xcoefs xi)) zcoefs)
(incf xi) (incf yi))
(t;;(> (aref xexps xi)(aref yexps yi))
(vector-push (aref yexps yi) zexps)
(vector-push (aref ycoefs yi) zcoefs)
(incf yi))))))))
(defun mulhashg (x y);; exponents and coefs may be arbitrary here.
(let ((ans (make-hash-table :test 'eql)) ;; eql is good enough for bignums too
(xexps (car x))
(xcoefs (cdr x))
(yexps (car y))
(ycoefs (cdr y))
(xei 0)
(xco 0)
(zcoefs *zcoefsg*)
(zlim *zcoefs-limg*)
(zcounter 0)
(loc nil)
(xy 0))
(declare (type (simple-array t (*)) xexps yexps) ; any exponents
(type (simple-array t (*)) xcoefs ycoefs zcoefs) ; any coefs
(fixnum zcounter zlim)
(integer xei xy)
; (double-float xco)
(optimize (speed 3)(safety 0)))
(dotimes (i (length xexps)(sorthashg ans zcoefs))
(declare (fixnum i))
(setf xei (aref xexps i)) ;exponent for x
(setf xco (aref xcoefs i));coefficient corresponding to that exponent
(dotimes (j (length yexps))
(declare (fixnum j))
;; look in the hash-table at location for this exponent
(setf loc (gethash (setf xy (+ xei (aref yexps j))) ans))
(cond (loc
;; There is an entry in hash-table for this
;; exponent. Update corresponding array location. Note:
;; hash-table is NOT updated. this happens about n*m
;; times, where n=size(x), m=size(y).
(incf (aref zcoefs loc) ;; no number consing here either??
(* xco (aref ycoefs j))))
(t
;; if loc is nil, allocate a space in the answer array.
;; This happens relatively infrequently, about k times, where k=size(x*y).
;; Hash-Table is updated.
(setf (gethash xy ans) zcounter)
;; store the result in zcoefs array.
(setf (aref zcoefs zcounter) (* xco (aref ycoefs j))) ; no number consing??
(incf zcounter)
(cond ((= zcounter zlim)
;; if we reach this, we need more working space.
;; This is rarely used code. If global array is big enough, it is never used.
;; If global array is too small, it is made bigger.
(let* ((newsize (ash zlim 1)) ; double the size
(newarray (make-array newsize)))
(declare (fixnum newsize)
(type (simple-array t(*)) newarray))
;; (format t "~%zcoefs now size ~s" newsize)
(dotimes (m newsize)
(declare (fixnum m))
(setf (aref newarray m)(aref zcoefs m)))
(setf zcoefs newarray zlim newsize
*zcoefs-limg* zlim
*zcoefsg* zcoefs))))))))))
(defun sorthashg(h ar) ;; utility for mulhashgi
(let ((ans (make-array 10 :adjustable t :fill-pointer 0)))
(maphash #'(lambda (i v)
(setf v (aref ar v)); get the double-float, by indirection
(unless (= v 0.0d0) ;; normalize! remove 0 coefs
(vector-push-extend (cons i v) ans)))
h)
(breakup (sort ans #'< :key #'car))
))
(defun breakup (ar)
;; ar is an array of (exp . coef).
;; separate them into a pair: one array of exponents and one of coefficients.
(let* ((ll (length ar))
(exps (make-array ll))
(coefs (make-array ll))
(tt nil))
(dotimes (i ll (cons exps coefs))
(setf tt (aref ar i))
(setf (aref exps i) (car tt))
(setf (aref coefs i)(cdr tt)))))
(defun PA2L (PA)
;; Pair of Arrays to List, length
;; returns L a LIST of (exp . coef) of length n.
;; also returns n.
(let* ((exps (car PA))
(coefs (cdr PA))
(n (length exps)))
(declare (fixnum n))
(do ((i (1- n) (1- i))
(ans nil (cons (cons (aref exps i)(aref coefs i)) ans)))
((< i 0) (values ans n))
(declare (fixnum i)))))
(defun L2PA (L n)
;;List to Pair of Arrays
;; L is a LIST of (exp . coef) of length n.
;; create a pair: one array of exponents and one of coefficients.
(let* ((exps (make-array n))
(coefs (make-array n))
(count 0))
(declare (fixnum count))
(dolist (i L (cons exps coefs))
(setf (aref exps count) (car i))
(setf (aref coefs count)(cdr i))
(incf count))))
(defun addarL(r s);; r and s are sparse polys as Lists of (exp . coef).
;; return a similar list of their sum. destructive.
(declare (optimize (speed 1)(safety 3)(debug 3)))
(let((ans nil) (pt nil))
;;set initial link
(cond((null r)(return-from addarL s))
((null s)(return-from addarL r))
((< (caar r)(caar s))(setf ans r) (setf r (cdr r))) ;; which arg do we destroy?
(t (setf ans s)(setf s (cdr s))))
(setf pt ans)
(loop
(cond((null r)(setf (cdr pt) s) (return ans))
((null s)(setf (cdr pt) r) (return ans))
((< (caar r)(caar s)) ;;key is car
(setf (cdr pt) r)
(setf r (cdr r)))
((= (caar r)(caar s))
; (format t "~% r=~s s=~s pt=~s" r s pt)
(setf (cdadr pt)(+ (cdar r)(cdar s)))
(setf (cdr pt) r)
(setf s (cdr s))
(setf r (cdr r)))
(t (setf (cdr pt) s)
(setf s (cdr s))))
;; (format t "~% r=~s s=~s pt=~s" r s pt)
(setf pt (cdr pt)))))
(defun mularL(x y);; x and y are sparse polys in arrays as produced by (make-racg n m)
;; but do the intermediate adds using lists.
;; this works but uses a lot of cons cells and is about 5X slower than mulhashg.
(let* ((xexps (car x))
(xcoefs (cdr x))
(yexps (car y))
(ycoefs (cdr y))
(xe 0)
(xc 0)
(xlen (length xexps))
(ans nil)
(i (1- (length yexps))))
(declare (fixnum i xlen)(optimize (speed 3)(safety 0)))
(cond ((= 0 xlen)
'(#() . #())) ; empty polynomial
((= 1 xlen) ; a monomial times a polynomial!
(setf xe (aref xexps 0))
(setf xc (aref xcoefs 0))
;; probably just as fast to do this next line rather than loop
;;(map 'list #'(lambda (r s)(cons (+ r xe) (* s xc))) (car y)(cdr y))
(loop
(if (< i 0)(return t))
(push (cons (+ xe (aref yexps i)) (* xc (aref ycoefs i))) ans)
(decf i))
ans)
(t (multiple-value-bind (low hi)
(splitpol x)
(addarL (mularL low y)(mularL hi y)))))))
(defun mularL(x y);; x and y are sparse polys in arrays as produced by (make-racg n m)
;; but do the intermediate adds using lists.
;; this works but uses a lot of cons cells and is about 5X slower than mulhashg.
(let* ((xexps (car x))
(xcoefs (cdr x))
(yexps (car y))
(ycoefs (cdr y))
(xe 0)
(xc 0)
(xlen (length xexps))
(ans nil)
(i (1- (length yexps))))
(declare (fixnum i xlen)(optimize (speed 3)(safety 0)))
(cond ((= 0 xlen)
'(#() . #())) ; empty polynomial
((= 1 xlen) ; a monomial times a polynomial!
(setf xe (aref xexps 0))
(setf xc (aref xcoefs 0))
;; probably just as fast to do this next line rather than loop
;;(map 'list #'(lambda (r s)(cons (+ r xe) (* s xc))) (car y)(cdr y))
(loop
(if (< i 0)(return t))
(push (cons (+ xe (aref yexps i)) (* xc (aref ycoefs i))) ans)
(decf i))
ans)
(t (multiple-value-bind (low hi)
(splitpol x)
(addarL (mularL low y)(mularL hi y)))))))
;; subdivide both polys but only to some limited depth.
;;
(defvar *cachelim* 10 )
(defun dosmallmult(x y) (ptimes x y)) ;; make sure result is right form.
(defun mularLd(x y);; x and y are sparse polys in arrays as produced by (make-racg n m)
;; but do the intermediate adds using lists.
(let* ((xexps (car x))
(xcoefs (cdr x))
(yexps (car y))
(ycoefs (cdr y))
(xe 0)
(xc 0)
(xlen (length xexps))
(ans nil))
(declare (fixnum i xlen)(optimize (speed 3)(safety 0)))
(cond ((= 0 xlen)
nil) ; empty polynomial
((= 1 xlen) ; a monomial times a polynomial!
(setf xe (aref xexps 0))
(setf xc (aref xcoefs 0))
(map 'list #'(lambda (r s)(cons (+ r xe) (* s xc))) yexps ycoefs))
((< xlen *cachelim*)(if (< (length yexps) *cachelim*)
;; both x and y are small enough
(mularg x y)
;; x is small enough, y is not. split it.
(multiple-value-bind (low hi)
(splitpol y)
(addarL (mularL low x)(mularL hi x)))))
;; x is not small enough.
(t (multiple-value-bind (low hi)
(splitpol x)
(addarL (mularL low y )(mularL hi y)))))))
(defun testsplit(n z)
(dotimes (i n)(splitpol z)))
(defun mularq(x y);; x and y are sparse polys in arrays as produced by (make-racg n m)
;; divide only down to cachelim
(let* ((xexps (car x))
(yexps (car y))
(xe 0)
(xc 0)
(xlen (length xexps)))
(declare (fixnum xlim)(optimize (speed 3)(safety 0)))
(cond ((= 0 xlen)
'(#() . #())) ; empty polynomial
((= 1 xlen) ; a monomial times a polynomial!
(setf xe (aref xexps 0))
(setf xc (aref (cdr x) 0))
(cons (map 'vector #'(lambda (r)(+ r xe)) (car y))
(map 'vector #'(lambda (r)(* r xc)) (cdr y))))
((< xlen *cachelim*)(if (< (length yexps) *cachelim*)
;; both x and y are small enough
(mulhashgg x y);; or something good for cache
;; x is small enough, y is not. split it.
(multiple-value-bind (low hi)
(splitpol y)
;;(format t "~%low=~s~%~%high=~s" low hi)
(addar (mularq low x)(mularq hi x)))))
(t (multiple-value-bind (low hi)
(splitpol x)
(addar (mularq low y)(mularq hi y)))))))
;; same as mulhashg but simple-array --> array. Slighly slower.
(defun mulhashgg (x y);; exponents and coefs may be arbitrary here.
(let ((ans (make-hash-table :test 'eql)) ;; eql is good enough for bignums too
(xexps (car x))
(xcoefs (cdr x))
(yexps (car y))
(ycoefs (cdr y))
(xei 0)
(xco 0)
(zcoefs *zcoefsg*)
(zlim *zcoefs-limg*)
(zcounter 0)
(loc nil)
(xy 0))
(declare (type (array t (*)) xexps yexps) ; any exponents
(type (array t (*)) xcoefs ycoefs zcoefs) ; any coefs
(fixnum zcounter zlim)
(integer xei xy)
; (double-float xco)
(optimize (speed 3)(safety 0)))
(dotimes (i (length xexps)(sorthashg ans zcoefs))
(declare (fixnum i))
(setf xei (aref xexps i)) ;exponent for x
(setf xco (aref xcoefs i));coefficient corresponding to that exponent
(dotimes (j (length yexps))
(declare (fixnum j))
;; look in the hash-table at location for this exponent
(setf loc (gethash (setf xy (+ xei (aref yexps j))) ans))
(cond (loc
;; There is an entry in hash-table for this
;; exponent. Update corresponding array location. Note:
;; hash-table is NOT updated. this happens about n*m
;; times, where n=size(x), m=size(y).
(incf (aref zcoefs loc) ;; no number consing here either??
(* xco (aref ycoefs j))))
(t
;; if loc is nil, allocate a space in the answer array.
;; This happens relatively infrequently, about k times, where k=size(x*y).
;; Hash-Table is updated.
(setf (gethash xy ans) zcounter)
;; store the result in zcoefs array.
(setf (aref zcoefs zcounter) (* xco (aref ycoefs j))) ; no number consing??
(incf zcounter)
(cond ((= zcounter zlim)
;; if we reach this, we need more working space.
;; This is rarely used code. If global array is big enough, it is never used.
;; If global array is too small, it is made bigger.
(let* ((newsize (ash zlim 1)) ; double the size
(newarray (make-array newsize)))
(declare (fixnum newsize)
(type (array t(*)) newarray))
;; (format t "~%zcoefs now size ~s" newsize)
(dotimes (m newsize)
(declare (fixnum m))
(setf (aref newarray m)(aref zcoefs m)))
(setf zcoefs newarray zlim newsize
*zcoefs-limg* zlim
*zcoefsg* zcoefs))))))))))
;;end of listing
;;;; yet another heap-merge-multiplier, using a hacked together top-down heap by rjf
;;;
;;; multiplication of polyns based on sortx+y.lisp
;;;
;;; -*- Lisp -*-
;;; msort5.lisp multiplication of polyns based on
;;; sortx+y.lisp msort2-5 uses heaps implemented as lisp lists. seems
;;; like a loser, 4X slower than hash, somewhat slower than arrays.
;;; msort5 uses lisp lists to construct a heap, each internal node n
;;; is, in this design (here . (left . right)) where here is a leaf
;;; node, and left/right may either be nil or an internal node or a
;;; leaf node. A leaf node looks like a dotted pair consisting of
;;; (number . number). They represent an exponent . coefficient pair.
;;; An empty heap looks like this: (nil)
(eval-when (compile load) (declaim (optimize (speed 3)(safety 0))))
;; sort X+Y various ways.
(defun make-rlc(n d)
;;make random list of length n with spacing k, with 0 msort5
(time (mulheap2 x y)) in msort4. x is 1000 terms, y is 500 terms
; cpu time (non-gc) 1,000 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 1,000 msec user, 0 msec system
; real time 1,047 msec
; space allocation:
; 2,005,505 cons cells, 0 other bytes, 0 static bytes
......
(time (mulheap2 x y)) in msort5
; cpu time (non-gc) 937 msec user, 0 msec system
; cpu time (gc) 16 msec user, 0 msec system
; cpu time (total) 953 msec user, 0 msec system
; real time 969 msec
; space allocation:
; 1,753,843 cons cells, 32 other bytes, 0 static bytes
.... still, faster and less space is hashing:
(time (mulhash x y))
; cpu time (non-gc) 172 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 172 msec user, 0 msec system
; real time 172 msec
; space allocation:
; 31,501 cons cells, 240,752 other bytes, 0 static bytes
... and using an array-based heap is better than list heap, not as fast as hash.
(time (mulheap x y))
; cpu time (non-gc) 656 msec user, 0 msec system
; cpu time (gc) 16 msec user, 0 msec system
; cpu time (total) 672 msec user, 0 msec system
; real time 688 msec
; space allocation:
; 505,515 cons cells, 2,184 other bytes, 0 static bytes
;;; inserted here... If the coefficients are doublefloats stored in a separate array, we
;;; have times like this:
;[2] cl-user(367): (time (mulhashe x y))
; cpu time (non-gc) 62 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 62 msec user, 0 msec system
; real time 63 msec
; space allocation:
; 4,513 cons cells, 388,856 other bytes, 0 static bytes
#((3 . 2.0d0) (4 . 5.0d0) (5 . 3.0d0) (6 . 2.0d0) (7 . 12.0d0)
|#
;;;;msortxd2. FASTEST HASH MULTIPLIER, set up for arrays of double-floats..
;;; -*- Lisp -*-
;;; msortx.lisp multiplication of polyns based on copying coefs into arrays.
;;; this uses double-float coefficients
(eval-when (compile load) (declaim (optimize (speed 3)(safety 0))))
;; sort X+Y various ways.
(defun make-rac(n d)
;;make random list of length n with spacing k, with 0= left ,fp) ,fp)
((= right ,fp) left)
((hlessfun (aref ,a left) (aref ,a right)) left)
(t right))))
;; construct length(x) streams, a_0*y, a_1*y, ....
(defun mularZ2(x y);; x and y are sparse polys in arrays as produced by (make-racg n m). return array
;; hack for heap management with a starter entry in ans
(let* ((xexps (car x))
(xcoefs (cdr x))
(yexps (car y))
(ycoefs (cdr y))
(yexp0 (aref yexps 0))
(ycoef0 (aref ycoefs 0))
(ylim (length (car y)))
(ans (make-array (1+ (length xexps)))))
(declare (optimize (speed 3)(safety 0))(fixnum j)(type (simple-array t (*)) ans xexps xcoefs yexps ycoefs))
(setf (aref ans 0)(cons 0 (cons 0 #'(lambda() nil)))) ; blocker for heap management
(dotimes (i (length xexps) ans)
(setf (aref ans (1+ i))
(cons (+ yexp0 (aref xexps i)) ;; the lowest exponent
(cons (* (aref xcoefs i)ycoef0) ; its coefficient
(let ((index 0)
(xexpi (aref xexps i))
(xco (aref xcoefs i)))
#'(lambda()
(incf index)
(if (>= index ylim) nil
(values (+ xexpi (aref yexps index)) ; the exponent
(* xco (aref ycoefs index))) ; the coef
)))))))))
;; set up a context for heap array fill pointer..
(let ((fill-pointer 0))
#+ignore ;; maybe clearer what is going on
(defun percolate-down (aa hole xx fp)
"Private. Move the HOLE down until it's in a location suitable for X.
Return the new index of the hole."
(declare (fixnum hole fp)
(type (simple-array t (*)) aa)(optimize (speed 3)(safety 0)))
(do ((child (lesser-child aa hole fp) (lesser-child aa hole fp)))
((or (>= (the fixnum child) fp) (hlessfun xx (aref aa child)))
hole)
(declare (fixnum child))
(setf (aref aa hole) (aref aa child)
hole child)))
;; #+ignore;; works also, maybe a teensy bit faster.
(defun percolate-down (aa hole xx fp)
"Private. Move the HOLE down until it's in a location suitable for X.
Return the new index of the hole."
(let ((xxval (car xx)))
(declare (fixnum hole fp)(integer xxval aaval)
(type (simple-array t (*)) aa)(optimize (speed 3)(safety 0)))
(do* ((child (lesser-child aa hole fp) (lesser-child aa hole fp))
(aaval (aref aa child)(aref aa child)))
((or (>= (the fixnum child) fp) (< xxval (the fixnum(car aaval))))
hole)
(declare (fixnum child))
(setf (aref aa hole) aaval hole child))))
(defun percolate-up (a hole xx)
"Private. Moves the HOLE until it's in a location suitable for holding
X. Does not actually bind X to the HOLE. Returns the new
index of the HOLE. The hole itself percolates down; it's the X
that percolates up."
(declare (type (simple-array t (*)) a)(fixnum hole)
(optimize (speed 3)(safety 0)))
(setf (aref a 0) xx)
(do ((i hole parent)
(parent (ash hole -1);;(floor (/ hole 2))
(ash parent -1);;(floor (/ parent 2))
))
;; potential to speed up line below by declaration if a, x are fixnum,
((not (hlessfun xx (aref a parent))) i)
(declare (fixnum i parent))
(setf (aref a i) (aref a parent))))
;; important: initial-contents must have the first element duplicated.
;; e.g. (3 5 8) should be (3 3 5 8)
(defun heap-insert (a xx fp)
"Insert a new element into the heap. Returns the heap.";; rjf
(declare (type (simple-array t (*)) a)(fixnum fill-pointer)
(optimize (speed 3)(safety 0)))
;; (format t ".")
(setf (aref a fp) nil)
(incf fill-pointer) ; the global one
;; Move the hole from the end towards the front of the
;; queue until it is in the right position for the new
;; element.
(setf (aref a (percolate-up a fp xx)) xx))
;;; here's a trick to try, if we believe Roman Pearce's comment:
;;; Instead of percolating down, see what is going to be inserted next,
;;; and try to insert it at the top. This saves a lot of time if it works.
(defun heap-remove(a)
;; assumes non-empty!! if empty next answer is bogus.
;; answer after that is an error (fill pointer can't pop)
;; (format t "-")
(declare(type (simple-array t(*)) a)(optimize (speed 3)(safety 0)))
(let* ((xx (aref a 1))
(fp (decf fill-pointer)) ;faster, fp is local now
(last-object (aref a fp)))
(declare (fixnum fp))
(setf (aref a (the fixnum (percolate-down a 1 last-object fp))) last-object)
xx))
#|
[2c] cl-user(931): (time (mulhashg xxx yyy))
; cpu time (non-gc) 94 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 94 msec user, 0 msec system
; real time 93 msec
; space allocation:
; 4,514 cons cells, 511,264 other bytes, 0 static bytes
(#(4 6 7 9 10 11 12 13 14 15 ...) . #(9 6 9 10 2 19 2 3 10 5 ...))
;;; another... with array answers..
[5] cl-user(1002): (time (mulstr3 xxx yyy))
; cpu time (non-gc) 219 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 219 msec user, 0 msec system
; real time 297 msec
; space allocation:
; 2,135 cons cells, 142,856 other bytes, 0 static bytes
(#(4 6 7 9 10 11 12 13 14 15 ...) . #(9 6 9 10 2 19 2 3 10 5 ...))
;; even better, not adjustable arrays, making sure 2nd arg is longer than first
;; notice storage is way down.
(time (mulstr3 x y))
; cpu time (non-gc) 235 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 235 msec user, 0 msec system
; real time 235 msec
; space allocation:
; 1,037 cons cells, 114,344 other bytes, 0 static bytes
(#(1 3 6 7 8 11 12 13 14 15 ...) . #(20 4 25 16 5 36 25 4 1 8 ...))
|#
;;
(defun mulstr3(x y) ;;works
(let* ((xl (length (car x)))
(yl (length (car y)))
(h (if (> (the fixnum xl)(the fixnum yl))(mularZ2 y x)(mularZ2 x y)));; initial heap
(ml (+ (the fixnum xl)(the fixnum yl)))
(r nil)
(anse (make-array ml :adjustable t :fill-pointer 0))
(ansc (make-array ml :adjustable t :fill-pointer 0))
(preve 0)
(prevc 0))
(declare (integer preve prevc ) (fixnum xl yl ml))
; (declare (special fill-pointer))
(setf fill-pointer (length h))
(loop while (not (heap-empty-p)) do
(setf r (heap-remove h)) ; take next term off heap
(cond ((= preve (car r)) ; is this same exponent as previous?
(incf prevc (cadr r))) ; if so, add coefficient.
(t (unless (= prevc 0) ; normalize: remove zero coeffs
(vector-push-extend prevc ansc)
(vector-push-extend preve anse))
(setf prevc (cadr r))
(setf preve (car r)))); initialize new coefficient
(multiple-value-bind (e1 c1);; advance the stream taken from heap
(funcall (cddr r))
(cond (e1 ; check: is there more on this stream--
(setf (car r) e1 (cadr r) c1); update the stream
; (format t "~% inserting r=~s" r)
(heap-insert h r fill-pointer ))
))) ;re-insert in the heap
(vector-push-extend prevc ansc) ;put in last results before exit
(vector-push-extend preve anse)
(cons anse ansc)))
;; well, this part works if update-heap works.
(defun mulstr6(x y);; hack to make heap management cuter.
(declare (optimize(speed 3)(safety 0)))
(let* ((hh (if (< (length (car x))(length (car y)))(mularZ2 x y)(mularZ2 y x)))
;; initial heap
(anse (make-array 10 :adjustable t :fill-pointer 0))
(ansc (make-array 10 :adjustable t :fill-pointer 0))
(preve 0)
(prevc 0))
(declare (integer preve prevc))
(setf fill-pointer (length hh))
(loop while (not (heap-empty-p)) do
(multiple-value-bind
(newe newc)
(update-heap hh)
(declare (integer newe newc))
; take next term off heap and insert its successor
(cond ((= preve newe) ; is this same exponent as previous?
(incf prevc newc)) ; if so, add coefficient.
(t (unless (= prevc 0); normalize: remove zero coeffs
(vector-push-extend prevc ansc)
(vector-push-extend preve anse))
(setf prevc newc)
(setf preve newe))); initialize new coefficient
))
(vector-push-extend prevc ansc) ;put in last results before exit
(vector-push-extend preve anse)
(cons anse ansc)))
;; WORKS
(defun update-heap (h)
;;return top of heap h, which is necessarily not empty.
;; take successor in stream of top of heap, and insert it in heap.
(declare(type (simple-array t(*)) h)(optimize (speed 3)(safety 0)))
(let* ((top (aref h 1))
(left (aref h 2))
(right (aref h 3))
(lefti (car left))
(righti (car right))
(rprog (cddr top))
(ta 0)(tb 0)
;(t1 nil)
(t2 nil))
; (format t "~%top= ~s ~%left=~s ~%right=~s"top left right)
(multiple-value-bind (newe newc)
(funcall (the compiled-function rprog))
;; case, newe <= lefti and newe <= righti. put it at the top.
;; avoiding all heap programs
;; top of heap is location 1 (ignore 0)
;; left branch is location 2
;; right branch is location 3
(cond((null newe)
; (format t "#")
(setf ta (car top) tb (cadr top))
(heap-remove h)
(values ta tb))
((and ;(setf t1 )
(<= newe lefti)
(setf t2 (<= newe righti)))
;; insert this node at the top of the heap
;; and return it.
;; fill-pointer remains the same.
; (format t "^")
(setf ta (car top)tb (cadr top))
(setf (car top) newe (cadr top) newc)
(values ta tb))
))
))
;; the idea here is to burp out one exponent-coefficient pair per call.
;; it works.
(defun mulstr5(x y) ;; try to make a stream output
(let* ((xl (length (car x)))
(yl (length (car y)))
(h (if (> (the fixnum xl)(the fixnum yl))(mularZ2 y x)(mularZ2 x y)));; initial heap
(r nil)
(anse 0)
(ansc 0)
(preve 0)
(prevc 0))
(setf fill-pointer (length h))
; (format t "~%0h=~s" h)
#'(lambda() (prog nil
again
(loop
(cond ((heap-empty-p)(setf anse preve ansc prevc)
(setf preve nil prevc nil)
(return t))) ;exit from loop
;; collect all terms with exponent preve
;; peek at next item
(cond ((not (equal preve (car (aref h 1))))
(setf anse preve ansc prevc)
(setf r (heap-remove h))
(setf preve (car r) prevc (cadr r))
(bubble-up h r)
(return t))
(t ; preve= prevc
(setf r (heap-remove h))
(incf prevc (cadr r))
(bubble-up h r)))) ;keep looping
(if (equal ansc 0)(go again) ; don't return terms with 0 coef
(return (values anse ansc)))))))
(defun bubble-up(hh r)
(declare (optimize (speed 3)(safety 0)))
(multiple-value-bind (e1 c1);; advance the stream taken from heap
(funcall (the compiled-function (cddr r)))
(cond (e1 ; check: is there more on this stream--
(setf (car r) e1 (cadr r) c1) ; update the stream
(heap-insert hh r fill-pointer))
)))
) ;; end of let
(defun muls (x y)(let ((k (mulstr5 x y))
(anse nil)
(ansc nil))
(loop (multiple-value-bind (e c) (funcall k)
(cond (e (push e anse)(push c ansc))
(t (return (cons (nreverse anse)(nreverse ansc)))))))))
;; about as fast as muls, but returns answer as pair of arrays.
(defun mula (x y)(let ((k (mulstr5 x y))
(anse (make-array 10 :adjustable t :fill-pointer 0))
(ansc (make-array 10 :adjustable t :fill-pointer 0)))
(loop (multiple-value-bind (e c) (funcall k)
(cond (e (vector-push-extend e anse)
(vector-push-extend c ansc))
(t (return (cons anse ansc))))))))
;; for comparison
(defparameter *zcoefs-limg* 50)
(defparameter *zcoefsg* (make-array *zcoefs-limg*))
(defun mulhashg (x y);; exponents and coefs may be arbitrary here.
(let ((ans (make-hash-table :test 'eql)) ;; eql is good enough for bignums too
(xexps (car x))
(xcoefs (cdr x))
(yexps (car y))
(ycoefs (cdr y))
(xei 0)
(xco 0)
(zcoefs *zcoefsg*)
(zlim *zcoefs-limg*)
(zcounter 0)
(loc nil)
(xy 0))
(declare (type (simple-array t (*)) xexps yexps) ; any exponents
(type (simple-array t (*)) xcoefs ycoefs zcoefs) ; any coefs
(fixnum zcounter zlim)
(integer xei xy)
; (double-float xco)
(optimize (speed 3)(safety 0)))
(dotimes (i (length xexps)(sorthashc ans zcoefs))
(declare (fixnum i))
(setf xei (aref xexps i)) ;exponent for x
(setf xco (aref xcoefs i));coefficient corresponding to that exponent
(dotimes (j (length yexps))
(declare (fixnum j))
;; look in the hash-table at location for this exponent
(setf loc (gethash (setf xy (+ xei (aref yexps j))) ans))
(cond (loc
;; There is an entry in hash-table for this
;; exponent. Update corresponding array location. Note:
;; hash-table is NOT updated. this happens about n*m
;; times, where n=size(x), m=size(y).
(incf (aref zcoefs loc) ;; no number consing here either??
(* xco (aref ycoefs j))))
(t
;; if loc is nil, allocate a space in the answer array.
;; This happens relatively infrequently, about k times, where k=size(x*y).
;; Hash-Table is updated.
(setf (gethash xy ans) zcounter)
;; store the result in zcoefs array.
(setf (aref zcoefs zcounter) (* xco (aref ycoefs j))) ; no number consing??
(incf zcounter)
(cond ((= zcounter zlim)
;; if we reach this, we need more working space.
;; This is rarely used code. If global array is big enough, it is never used.
;; If global array is too small, it is made bigger.
(let* ((newsize (ash zlim 1)) ; double the size
(newarray (make-array newsize)))
(declare (fixnum newsize)
(type (simple-array t(*)) newarray))
;; (format t "~%zcoefs now size ~s" newsize)
(dotimes (m newsize)
(declare (fixnum m))
(setf (aref newarray m)(aref zcoefs m)))
(setf zcoefs newarray zlim newsize
*zcoefs-limg* zlim
*zcoefsg* zcoefs))))))))))
;;;;;;;;chaining version
(defmacro heap-empty-p ()
"Returns non-NIL if & only if the heap contains no items."
`(<= fill-pointer 1))
(defmacro hlessfun (a b) `(< (st-exp ,a) (st-exp ,b)))
(defmacro lesser-child (a parent fp)
"Return the index of the lesser child. If there's one child,
return its index. If there are no children, return
(FILL-POINTER (HEAP-A HEAP))."
`(let* ((left;;(+ (the fixnum ,parent)(the fixnum ,parent))
(ash ,parent 1))
(right (1+ (the fixnum left))))
(declare (fixnum left right ,fp ,parent )
(optimize(speed 3)(safety 0))
(type (simple-array t (*)) ,a))
(cond ((>= left ,fp) ,fp)
((= right ,fp) left)
((hlessfun (aref ,a left) (aref ,a right)) left)
(t right))))
;; put together in a lexical context to make access faster to global vars
(let ((*YYEXPS* nil)
(*YYCOEFS* nil)
(*YYLEN* 0)
(fill-pointer 0))
(declare (fixnum *YYLEN* fill-pointer)
(type (simple-array t (*)) *YYEXPS* *YYCOEFS*))
(defstruct (st)
;; A kind of stream. each structure contains 5 items
;; which by reference to arrays of XX and YY , can burp out
;; successive exponents and coefficients of the product.
(exp 0 :type integer)
(coef 0 :type integer)
(i 0 :type fixnum)
(v 0 :type integer) (e 0 :type integer)); v, e are read-only
(defun make-simple-array (A) (if (typep A 'simple-array) A
(make-array (length A)
:element-type (array-element-type A)
:initial-contents A)))
(defun mularZ4(x y)
(declare (optimize (safety 3)(speed 0)))
(let* ((xexps (car x))
(xcoefs (cdr x))
(yexps (make-simple-array (car y)))
(ycoefs(make-simple-array (cdr y)))
(yexp0 (aref yexps 0))
(ycoef0 (aref ycoefs 0))
(ans (make-array (1+(length xexps))))
(ylen (length (car y)))
(xe 0)
(v 0)) ;
(declare (fixnum i) (type (simple-array t (*))
ans xexps xcoefs yexps ycoefs))
(setf *YYEXPS* yexps *YYCOEFS* ycoefs *YYLEN* ylen)
(setf (aref ans 0) (make-st :exp 0 :coef 0 :i ylen :v 0 :e 0))
(dotimes (j (length xexps) ans)
(setf v (aref xcoefs j))
(setf xe (aref xexps j))
(setf (aref ans (1+ j))
(make-st :exp (+ xe yexp0) :coef (* v ycoef0) :i 1 :v v :e xe) ))))
(defun percolate-down4 (aa hole xx fp)
"Private. Move the HOLE down until it's in a location suitable for X.
Return the new index of the hole."
(let ((xxval (st-exp xx)))
(declare (fixnum hole fp)(integer xxval)
(type (simple-array t (*)) aa)(optimize (speed 3)(safety 0)))
(do* ((child (lesser-child aa hole fp) (lesser-child aa hole fp))
(aaval (aref aa child)(aref aa child)))
((or (>= (the fixnum child) fp) (< xxval (the fixnum(st-exp aaval))))
hole)
(declare (fixnum child) (type st aaval))
(setf (aref aa hole) aaval hole child))))
;; important: initial-contents of heap must have an element 0 that
;; is not really used. for convenience we could have it equal to the first element.
;; e.g. (3 5 8) should be (3 3 5 8)
(defun heap-insert4 (a xx);; xx is new element
"Insert a new element into the heap. Heap is changed";; rjf
(declare (type (simple-array t (*)) a)(fixnum fill-pointer)
(integer newe)
(optimize (speed 3)(safety 0)))
(let ((c nil) ; potential chain point
(fp (incf fill-pointer)) )
(declare (type (simple-array t (*)) a)(fixnum hole fp)
(optimize (speed 3)(safety 0)))
;; (format t "~%in insert4, fill-pointer is ~s" fill-pointer)
;; (setf (aref a (decf fp)) xx)
(decf fp)
(setf (aref a fp) xx)
(do ((i fp parent)
(parent (ash fp -1);; initialize to parent of fp
(ash parent -1);; and its parent, next time
))
((>= (st-exp xx) (st-exp(aref a parent)));; is xx>= parent? if so, exit
a);; inserted the item at location i.
(declare (fixnum i parent))
;; (format t "~%insert4: c=~s parent=~s xx=~s"c parent xx)
(setf c (aref a parent))
(cond
;exit from do loop
(t (setf (aref a i) c
(aref a parent) xx) a)))))
(defun heap-remove4(a)
;; returns nil if the heap is empty, otherwise the removed item is returned
(declare(type (simple-array t(*)) a)(optimize (speed 3)(safety 0)))
(if (heap-empty-p) nil
(let* ((xx (aref a 1))
(fp (decf fill-pointer)) ;faster, fp is local now
(last-object (aref a fp)))
(declare (fixnum fp))
(setf (aref a fp) nil) ;; optional. makes tracing neater
(setf (aref a (percolate-down4 a 1 last-object fp)) last-object)
xx)))
;; this is the multiplication program
(defun mul-heap4(x y)
(declare (optimize(speed 3)(safety 0)))
(let* ((hh (if (< (length (car x))(length (car y)))(mularZ4 x y)(mularZ4 y x)))
(alen 10)
;; initial heap
(anse (make-array alen :adjustable t :element-type 'integer :fill-pointer 0))
(ansc (make-array alen :adjustable t :element-type 'integer :fill-pointer 0))
(preve 0) (prevc 0)
(top nil)
(newe 0)(newc 0) )
(declare (integer preve prevc newe newc)
(type (array t(*)) anse ansc hh))
(setf fill-pointer (length hh))
(loop while (setf top (heap-remove4 hh)) do
(setf newe (st-exp top) newc (st-coef top))
;; remove top item from heap
(cond
((= preve newe) ; is this same exponent as previous?
(incf prevc newc)) ; if so, add coefficient, and loop
(t (unless (= prevc 0) ; normalize: remove zero coeffs
(vector-push-extend prevc ansc)
(vector-push-extend preve anse))
(setf prevc newc) ; initialize new coefficient
(setf preve newe)))
(if (setf top (next-term top)) ;; is there a successor?
(heap-insert4 hh top))) ;; if so, insert it in heap
;; end of loop, push out the final monomial before exit from routine.
(vector-push-extend prevc ansc)
(vector-push-extend preve anse)
(cons anse ansc)))
(defun next-term(rec) ; record is a stream 5-tuple return succesor or nil
(declare(type (simple-array t(*)) *YYCOEFS* *YYEXPS*)
(optimize (speed 3)(safety 0))
(type st rec))
(let ((i (st-i rec)))
(cond ((>= i *YYLEN*) nil);; no more items on stream of products
(t (incf (the integer (st-i rec)))
(setf (st-coef rec)
(* (st-v rec)(aref *YYCOEFS* i)))
(setf (st-exp rec)
(+ (st-e rec)(aref *YYEXPS* i)))
rec)))))
;;;;;;;;;;;;;;;;;;;; dense, naive algorithm, fixnum..
(defun make-fixnum-array (A) (if (typep A '(simple-array fixnum))
A
(make-array (length A)
:element-type 'fixnum
:initial-contents A)))
;; this next program is an attempt to beat the FFT on its home base,
;; but by using the most straight-forward algorithm taking n*m multiplies
;; using a storage scheme that is compact if the answer is dense.
(defun mul-dense-fix (x y)
;; fast and simple for small degree dense
;; exponents and coefficients are all fixnums, in the answer too.
;; if they are not, this answer will be wrong!
;; we want y to be shorter, less copying..
(if (< (length (car x))(length (car y))) ; reverse args if x is shorter
(mul-dense-fix y x)
(let* (
(xe (car x)) ;array of exponents of x
(ye (car y)) ;array of exponents of y
(xc (cdr x)) ;array of coefficients of x
;array of coefficients of y, put in typed array so we can iterate faster
(yc (make-fixnum-array (cdr y)))
(xei 0)(xci 0)
(yl (length ye))
(ans (make-array
(+ 1 (aref xe (1- (length xe)))
(aref ye (1- yl))) ; size of ans
:element-type 'fixnum
:initial-element 0)))
(declare (fixnum yl xei xci)
(type (simple-array t (*)) xe ye xc)
(type (simple-array fixnum (*)) ans yc))
(dotimes (i (length (car x)) (DA2PA ans))
(declare (fixnum i))
(setf xci (aref xc i))
(unless (= xci 0)
(setf xei (aref xe i))
(dotimes(j yl)
;; multiply terms
(declare (fixnum j))
(incf (aref ans (the fixnum (+ xei(the fixnum (aref ye j)))))
(the fixnum(* xci (the fixnum (aref yc j)))))))))))
(defun mul-dense (x y)
;; simple for small degree dense ;; faster than mul-naive if dense
;; exponents and coefficients are integers. compare to mul-dense-fix
;; if they are not, mul-dense-fix's answer will be wrong!
;; the idea here is to copy one of the polys to a (dense) array.
;; we want y to be shorter, less copying..
(if (< (length (car x))(length (car y))) ; reverse args if x is shorter
(mul-dense y x)
(let* (
(xe (car x)) ;array of exponents of x
(ye (car y)) ;array of exponents of y
(xc (cdr x)) ;array of coefficients of x
;array of coefficients of y, put in typed array so we can iterate faster
(yc (make-simple-array (cdr y)))
(xei 0)(xci 0)
(yl (length ye))
(ans (make-array
(+ 1 (aref xe (1- (length xe)))
(aref ye (1- yl))) ; size of ans
:element-type 'integer
:initial-element 0)))
;; we could do a reality check here: if the degree is computed above is huge
;; compared to the number of terms, maybe this is a bad idea.
(declare (fixnum yl)
(type (simple-array integer (*)) xe ye xc)
(type (simple-array integer (*)) ans yc))
(dotimes (i (length (car x)) (DA2PA ans))
(declare (fixnum i))
(setf xci (aref xc i))
(unless (= xci 0)
(setf xei (aref xe i))
(dotimes(j yl)
;; multiply terms
(declare (fixnum j))
(incf (aref ans (+ xei (aref ye j)))
(* xci (aref yc j)))))))))
)
(defun DA2PA (A)
;;DENSE single array to Pair of Arrays
;; A is an ARRAY of (coef coef coef ...), implicit exponents (0 1 2 ...)
;; create a pair: one array of exponents and one of coefficients.
(let* ((n (count-if-not #'(lambda(r)(eq r 0))
A))
(exps (make-array n))
(coefs (make-array n))
(c 0))
(declare (fixnum n)(type (simple-array t (*)) A exps coefs)
(type (array t (*)) exps coefs))
(do ((i 0 (1+ i))
(j 0))
((> i n) (cons exps coefs))
(declare (fixnum i j))
(unless (= 0 (setf c (aref A i)))
(setf (aref exps j) i (aref coefs j)c)
(incf j)))))