DOE2000 Progress Report

Jim Demmel - UC Berkeley

Jack Dongarra - University of Tennessee and ORNL

NetSolve Recent Awards:

Jack Dongarra was recently named an IEEE Fellow for "contributions and leadership in the field of computational mathematics."

The NetSolve and ATLAS projects both won R&D 100 Awards for 1999.

James Demmel and Mark Adams received the Karl Benz Award for the best industrial application, ``Parallel Unstructured Multigrid Finite Element Solvers,'' in Mannheim SuParCup '99 at Mannheim Supercomputer '99 Conference, June 10-12, 1999

NetSolve

Overview:

NetSolve is a client-server system enabling users to solve complex scientific problem remotely. The system allows users to access both hardware and software computational resources distributed across a network. NetSolve searches for computational resources on a network, chooses the best one available, and using retry for fault-tolerance solves a problem, and returns the answers to the user. A load-balancing policy is used by the NetSolve system to ensure good performance by enabling the system to use the computational resources available as efficiently as possible. Our framework is based on the premise that distributed computations involve resources, processes, data, and users, and that secure yet flexible mechanisms for cooperation and communication between these entities is the key to metacomputing infrastructures.

Some goals of the NetSolve project are:

ease-of-use for the user
efficient use of the resources, and
the ability to integrate any arbitrary software component as a resource into the NetSolve system.

Interfaces in Fortran, C, Matlab, and Java have been designed and implemented which enable users to access and use NetSolve more easily. An agent-based design has been implemented to ensure efficient use of system resources.

One of the key characteristics of any software system is versatility. In order to ensure the success of NetSolve, the system has been designed to incorporate any piece of software with relative ease. There are no restrictions on the type of software that can be integrated into the system.

Results and Interactions:

Currently, a wide range of applications have been integrated and make use of the NetSolve system. Here, we identify a few of these applications and summarize how they have taken advantage of NetSolve.

IPARS: Subsurface Modeling.
MCell: Cellular Microphysiology.
CENTS: Nuclear Engineering.
Ground Cover Classification: Image Processing
DIPS: Image Processing
LUCAS (Geographical Analysis)
SCIRun: Programming and Visualization Environment

See NetSolve Application for more details.

NetSolve has integrated numerous systems (either in part or in whole) to help in its functionality. Here we make mention of these systems and summarize what components we have integrated and the reasons for doing so.

Globus - The Globus project is developing the fundamental technology that is needed to build computational grids, execution environments that enable an application to integrate geographically-distributed instruments, displays, and computational and information resources. Such computations may link tens or hundreds of these resources. NetSolve currently uses Globus' "Heartbeat Monitor" to detect failed servers. In its testing phase is a new NetSolve client which implements a proxy that would allow the client to utilize the Globus grid infrastructure if available. If not, the client resorts to its present behavior.
CONDOR - Condor is a software system that runs on a cluster of workstations to harness wasted CPU cycles. A Condor pool consists of any number of machines, of possibly different architectures and operating systems that are connected by a network. NetSolve currently has the ability to access CONDOR pools as its computational resource. With little effort, the server can be configured to submit the clients request to an existing CONDOR pool, collect the results, and send them to the client.
IBP (Internet Backplane Protocol) - IBP is a storage management system serves up writable storage as a wide-area network resource, allows for the remote direction of storage activities, and decouples the notion of user identification from storage. Currently being developed are IBP-enabled clients and servers for NetSolve that would allow NetSolve to allocate and schedule storage resources as part of its resource brokering. This will lead to much improved performance and fault-tolerance when resources fail.
NWS (Network Weather Service) - NWS is a system that uses sensor processes on workstations to monitor the CPU and network connection. It constantly collects statistics on these entities and has the ability to incorporate statistical models to run on the collected data to generate a forecast of future behavior. NetSolve is currently integrating NWS into its agent to help its efforts of determining which computational servers would yield results to the client most efficiently.

In addition NetSolve has been extended to enable software from the following DOE2000 software packages: LAPACK, ScaLAPACK, SuperLU, PETSc, and AZTEC.

There have been many enhancements and additions made to the NetSolve project over the past year. The majority of these new features will be distributed as a part of the NetSolve 1.3 release slated for March of this year. Below, we list and briefly describe these enhancements:

Sparse Matrix Datatype
We introduce to the NetSolve arsenal, a new sparse matrix datatype that allows
users to significantly reduce network traffic and therefore request latency by reducing the amount of data needing to be transferred when matrices contain many non-zero elements.
Iterative and Direct Linear Equation Solvers
We have integrated and tested a number of direct and iterative Solvers that act on both Sparse and Dense Matrices into the NetSolve system. Some of the systems we have used are PETSc, AZTEC, SuperLU and MA28.
LAPACK
We have increased the subset of LAPACK functionality that was initially distributed with NetSolve and have completed a full and robust testing of those functions we now include.
Documentation
All documentation related to the NetSolve project has been updated. In addition to the User’s Guide, which has been reformatted for ease of use, there will be other tutorials and primers geared specifically at the different usage models we envision.
Task Farming
This interface that is used to facilitate scheduling of numerous requests to the same problem has been enhanced with a more expedient scheduling algorithm. Work has also been done experimenting with using the AppLeS methodology to schedule these pleasing parallel requests.
Request Sequencing
This is a newly added interface that allows users to explicitly group a number of NetSolve requests together, thereby expressing some parameter sharing amongst the sequence. The NetSolve system then stages data effectively so it can be shared amongst the processors solving the individual requests in a way that significantly reduces network traffic.
Network Weather Service
We have incorporated the CPU sensors and the forecaster of the NWS system that helps the NetSolve agent in scheduling tasks to the different servers that comprise the NetSolve system.
Security
We have completed integrating client-to-server authentication using kerberos (version 5). This improves the servers abilities to restrict user access based on credential provided by the kerberos key distribution center and the certificates it issues.
Logging
In an effort to give NetSolve system administrators a way to monitor usage of their system, we are providing a logging mechanism that will allow them to discover who uses the system, when, the nature of the request, etc.
Globus Proxy
We have created a proxy at the NetSolve client end that allows users to leverage other metacomputing environments as their servers. So far, we have implemented the proxies that communicate with Globus or the NetSolve servers, as they existed before; however, the design we chose is such that one can easily implement proxies for their system of choice.
Problem Description File Generator
We have updated the utility that is used to expand NetSolve server functionality to make it so easy and intuitive that even novice users could painlessly add their own functional modules to NetSolve servers.

Plans:

Security
Over the next year, in addition to the kerberos client authentication we have already implemented, we will be adding as a part of the NetSolve system, a robust security model that requires no additional components (like kerberos) using technology like SSL and X.509 certificates.
Just-in-time software binding
Today the NetSolve server is a tight coupling of hardware and software components. Using a repository strategy, we will move to a model where the server daemon can dynamically download new software components from trusted locations, thus preventing the need to reconfigure and recompile the server every time new capabilities are desired.
Numerical Methods
Over the next few months, we plan to increase the number and types of functional modules we distribute with the NetSolve system. Among others, some of these modules will be Differential Equation Solvers (Partial, Ordinary, ...), random number generators, etc.
Java API
We are currently updating our Java API so that once again Java programmers can too enjoy the programming expedience that NetSolve provides.
Windows Server
We will implement NetSolve servers to run on the MS Windows/NT platform using COM/DCOM technology to marshal arguments between client and server.
Excel Interface
We currently have a prototype implementation of an Excel interface to NetSolve. Over the next few months, we will develop this prototype into an interface that we can distribute to users so that they can access NetSolve functionality and functions from Excel.
job submission
We will create a NetSolve shell interface that allows the user to easily manage and probe NetSolve systems in addition to submitting entire programs, as opposed to requests of only a functional nature as is today.
XML
We will investigate the feasibility of representing some parts of NetSolve, like our problem descriptions, using XML in an effort to facilitate interoperability amongst other metacomputing environments.

ATLAS

Overview:

ATLAS is an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time-consuming task. ATLAS has been designed to automate much of this process. We concentrate our efforts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, our initial work is for general matrix multiply, DGEMM. However much of the technology and approach developed here can be applied to the other Level 3 BLAS. The general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations, such as sparse operations.

ATLAS uses code generators (i.e., programs that write other programs) in order to provide the many different ways of doing a given operation, and has sophisticated search scripts and robust timing mechanisms in order to find the best ways of performing the operation for a given architecture.

One of the main performance kernels of linear algebra has traditionally been a standard API known as the BLAS (Basic Linear Algebra Subprograms). This API is supported by hand-tuned efforts of many hardware vendors, and thus provides a good first target for ATLAS, as there is both a large audience for this API, and on those platforms where vendor-supplied BLAS exist, an easy way to see if ATLAS can provide the required level of performance.

Results and Interactions:

The ATLAS project just released version 3.0BETA of the package. Highlights of changes from v2.0 are:

ATLAS now supplies complete BLAS, although some level 1 and 2 BLAS not fully optimized on all architectures
Some LAPACK routines explicitly supported (LU, Cholesky and related routines)
Standard C and Fortran77 APIs for all BLAS and provided LAPACK routines; C routines support both row- & column-major access
Improved small-case GEMM performance made possible by code generator that can generate all transpose cases (and thus avoid data copy), with associated speed boost in many Level 3 BLAS routines.
Support for complex matrix multiplication without copying user data
Support for additional looping structures for complex GEMM, providing better performance and reducing memory usage for many cases

Therefore, this is the first release that supplies a complete, highly optimal, BLAS package. Timings (some of which can be found at http://www.netlib.org/atlas) show that ATLAS provides a portable library that competes favorably on a wide variety of architectures with the traditional hand-tuned libraries. As mentioned in the highlights above, we have begun to look at optimizations in other areas as well. There are many areas that require further investigation; we detail a few of them below. We plan to perform preliminary investigations into most if not all of the areas in the next 18 months, and expect to more fully investigate several.

Plans:

Generalization of ATLAS's configure routines ATLAS provides an installation and configuration tool suitable for high performance computing. On a purely software engineering level, we hope to generalize this package in order to allow other packages (e.g., LAPACK, ScaLAPACK, etc) to use these techniques to reduce the number one problem for these kind of packages: installation errors.

Code generation for GEMV & GER The present dense level 2 BLAS may not be optimal for all platforms because they rely on a variety of hand-optimized routines for their performance engines. All level 2 routines use the dense matrix-vector multiply (GEMV) and rank-1 update (GER) as their performance kernels, so these routines need to be generated automatically, as is done for the Level 3 dense BLAS with matrix multiply, for optimal performance portability.

Code generation for some Level 1 routines Many level 1 routines cannot be optimized much more than a standard compiler will do, and so do not need special attention via ATLAS's empirical techniques. However, operations such as AXPY (y <- y + alpha*x) are complex enough that the potential performance benefit makes it worth investigating the optimizations provided by code generation. Also, this investigation is necessary in order to support more fully sparse operations, which use level 1, or near level 1 operations relatively often.

SMP support via pthreads We plan to continue our investigation of using threads in order to fully utilize SMP machines.

Packed storage optimizations One important area that has been traditionally mishandled is packed storage, where only the relevant portion of a triangular or symmetric matrix is stored, allowing for larger problems to be solved in the same memory space. Present implementations are orders of magnitude slower than they need to be due to BLAS interface issues, and vector-based algorithms. This work may require extending the present generators, or development of specialized routines.

Wide and narrow band optimizations We have yet to do even a preliminary investigation of this area. Wide band may prove to be as optimizable as packed storage; narrow band is certainly less so. Narrow-band, however, demonstrates many of the same problems that are found in sparse optimizations, and are thus of particular interest.

Sparse optimizations This is an open-ended research area that encompasses many different areas of optimization. We hope to use our experience with dense optimizations in order to gain insight into the more tractable storage schemes. This will later pave the way for more advanced work, such as structure analysis and dynamic libraries, as well as providing a springboard to handling the less dense-like structures.

Algorithmic improvements and higher-level routines We have already extended ATLAS beyond the BLAS and into higher level kernels such as LAPACK's LU and Cholesky. This trend should continue, with perhaps some interesting algorithmic research. For instance, with the known performance provided by ATLAS, alternative algorithms may become attractive in the search for the best performance (an example might be use of the sign function for eigenvalues, due to the relative performance advantage its level-3 BLAS operations enjoy over the Level-2 operations used by the traditional methods).

Java versions One area of great interest is the application of ATLAS techniques to Java. The portability and ease of access of Java make it an appealing development platform to the scientists and engineers using high performance computing. However, this portability and ease of access must often be sacrificed (e.g. by bringing in native methods for optimized libraries) in order to achieve performance. Since the compiler/interpreter is time constrained, optimization at the level enjoyed by C or Fortran77 is unlikely, even if the Java language inheritly lent itself to optimization as well as C and Fortran77 do, which it clearly does not. For this reason, it seems likely that providing an already optimized program to the compiler/interpreter will yield disproportionately large performance improvements.

Results and Interactions:

We are working with researchers at LLNL to extend these ideas to sparse matrix operations.

PAPI

Overview:

The purpose of the PAPI project is to specify a standard application-programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis including hand tuning, compiler optimization, debugging, benchmarking, monitoring and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development towards alleviating commonly occurring bottlenecks in high performance computing.

Results and Interactions:

This project is just beginning. We expect this effort to have an impact on performance tuning and analysis of users programs.

This project is part of the Parallel Tools Consortium.

Faster Eigensolvers and SVD

Overview:
There are two recent algorithmic innovations made by our groups and other for solving the symmetric eigenvalue problem and singular value decomposition (SVD). In both cases the algorithms have three phases: First, we reduce the initial matrix from dense (or banded or sparse...) to narrow band form (tridiagonal for the symmetric eigenvalue problem, and bidiagonal for the SVD). Second, we solve the eigenproblem or SVD for the narrow band matrix. Third, we transform the solution of the narrow band problem to the original problem. Phases 1 and 3 are basically well-parallelized, and use the BLAS effectively (but see the section on the BLAST Forum for innovations here too). But Phase 2 is the bottleneck if we use the traditional QR algorithm, which leads us to seek better alternatives.

The first alternative to QR is based on a divide-and-conquer algorithm (DC) introduced by Cuppen in the early 80s, but not made effective and stable until the early 1990s by the work of Ming Gu and Stan Eisenstat. This drops the O(n³) complexity of QR to O(n^x) for some x typically much less than 3, but still 3 in the worst case. Furthermore, it uses the BLAS effectively, whereas QR is BLAS1 only. This was incorporated into LAPACK v 2.0 in 1994 just for the symmetric eigenvalue problem. It is not straightforward to parallelize because of communication.

The second alternative is much more recent, based on work by Prof. Beresford Parlett and Inderjit Dhillon at Berkeley. We call it the Holy Grail (HG) because it enjoys the two properties of (1) requiring only the minimal O(nk) work to compute the nk entries of k different eigenvectors of dimension n, and (2) being able to compute each requested eigenvector in an embarrassingly parallel fashion. The Holy Grail assumes IEEE arithmetic, and exploits NAN and infinity arithmetic to go faster; this is first LAPACK routine that requires IEEE arithmetic to work correctly, and represents a break from the past where we insisted on portability to older architectures like the Cray C90.

This is joint work with several other people as well, including Osni Marques at NERSC.

Results and Interactions:

LAPACK v. 3, released in mid 1999, along with an updated Users' Guide, includes both (1) the Holy Grail (HG) implementation of the dense and tridiagonal symmetric eigenvalue problems, and (2) divide-and-conquer (DC) algorithms for the SVD and SVD-based least squares solver.

The figures in Section 3.5 of the Users' Guide, the LAPACK Benchmark show how much faster the new algorithms are than the old.

We first consider the Holy Grail. The figure below, figure 3.1 from the Users' Guide, compares the performance of three routines, DSTEQR, DSTEDC and DSTEGR, for coputing all the eigenvalues and eigenvectors of a symmetric tridiagonal matrix. The times are shown on a Compaq AlphaServer DS-20 for matrix dimensions from 100 to 1000. DSTEQR (QR) was the only algorithm available in LAPACK 1.0, DSTEDC (divide-and-conquer) was introduced in LAPACK 2.0, and DSTEGR (the Holy Grail) was introduced in LAPACK 3.0. As can be seen, for large matrices DSTEQR is about 14 times faster than DSTEDC and nearly 50 times faster than DSTEQR.

Timings of routines for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix. The upper graph shows times in seconds on a Compaq AlphaServer DS-20. The lower graph shows times relative to the fastest routine DSTEGR, which appears as a horizontal line at 1.

We next consider the full singular value decomposition of a dense 1000 by 1000 matrix. We compare DGESVD, the original QR-based routines from LAPACK 1.0, with DGESDD, the new divide-and-conquer based routine from LAPACK 3.0. The table below shows the speeds on several machines. The new routine is 5.7 to 16.8 times faster than the old routine. Part of the speed results from doing many fewer operations, and the rest comes from doing them faster (a high Mflop rate).



---------------------------------------------------------------------------------
|       Machine          Peak    Old routine      New routine       Speedup of  |
|                       Mflops  Time    Mflops    Time    Mflops    New vs Old  |
|                       (DGEMM) (secs)            (secs)                        |
|   -----------         ------  ---------------   ---------------   ----------  |
|   Compaq AS DS-20      500     143      107       21      323       6.8       |
|   IBM Power 3          722     252       61       15      438      16.8       |
|   Intel Pentium III    400     245       62       39      171       6.3       |
|   SGI Origin 2000      555     176       87       25      267       7.0       |
|   SUN Enterprise 450   527     402       38       71       94       5.7       |
---------------------------------------------------------------------------------

Finally consider the figure below (figure 3.3 in the Users' Guide), which compares the performance of five drivers for the linear least squares problem, DGELS, DGELSY, DGELSX, DGELSD and DGELSS, which are shown in order of decreasing speed. DGELS is the fastest, using just QR without pivoting, and so also the least reliable for rank deficient problems. DGELSY and DGELSX use QR with pivoting, and so handle rank-deficient problems more reliably than DGELS, but can be more expensive. DGELSD and DGELSS both use the SVD, and so are the most reliable (and expensive) ways to solve rank deficient least squares problems. DGELS, DGELSX and DGELSS were in LAPACK 1.0, and DGELSY and DGELSD where introduced in LAPACK 3.0. DGELDS uses the new divide and conquer algorithm. The times are shown on a Compaq AlphaServer DS-20 for square matrices of dimensions from 100 to 1000, and for one right-hand-side. Note that DGELSD is only 3 to 5 times slower than the fastest algorithm DGELS, whereas the old algorithm DGELSS that it replaces was 7 to 34 times slower.

Timings of driver routines for the least squares problem. The upper graph shows times in seconds on a Compaq AlphaServer DS-20. The lower graph shows times relative to the fastest routine DGELS, which appears as a horizatonal line at 1.

Future Plans:

Improve Holy Grail. There are some very rare cases when the current Holy Grail resorts to an O(n³) algorithm for certain eigenvalue distributions. Beresford Parlett and Inderjit Dhillon have a fix for this which we need to test and incorporate.
Propogate Holy Grail to other sequential routines. Once the Grail is in final form, we will use it in all places where symmetric eigenproblems and the SVD arise: banded matrices, packed matrices, the SVD, least squares problems, etc.
Propogate Holy Grail to parallel routines. Parallel versions of the Grail will be incorporated in appropriate ScaLAPACK routines.

Basic Linear Algebra Subprogram (BLAS) Technical Forum

Overview:

The BLAS Technical Forum is a standards committee with representatives from academia, industry, and national labs. Its mandate is to extend the widely used BLAS to include

useful dense and banded routines not includes in the original BLAS1, BLAS2 and BLAS3 standards,
sparse BLAS, and
extended and mixed precision BLAS.

Consideration was also given to parallel and interval BLAS, but no consensus was reached. Regular meetings were held over a period of several years, drafts written, and votes taken, as documented on the BLAS Technical Forum web page.

Results and Interactions:

In fall of 1999, the final version of the new BLAS standard was finally approved. This 323 page document describes in great detail the calling sequences and semantics of hundreds of new routines, in Fortran 77, Fortran 95 and C. It is available on the BLAS Technical Forum web page.

We are now engaged in providing a reference implementation, including test code. The sparse BLAS (chapter 3) are being implemented by Iain Duff and his collaborations at Rutherford in England. Berkeley is implementing chapter 4 (extended and mixed precision BLAS) using a technique based on M4 macros, that will also implement a large fraction of chapter 2 (dense and banded BLAS, with conventional precisions). Collaborators at HP/Convex are using our M4 templates to implement the rest of Chapter 2.

The work at Berkeley is joint with Xiaoye Li at NERSC and a team of 6 undergraduates. The current Berkeley implementation includes all the real versions of three routines: DOT, SPMV and GBMV, as well as test code. There are many versions of each routine, because of the used of extended and mixed precision (e.g. taking the dot product of one single precision real arrays with one single precision complex array, returning the single precision complex result, and doing intermediate computations in double-double, or about 106 bits of precision). This made the use of M4 to automatically generate code from templates essential. For example, 220 lines of M4 for DOT expand into 9526 lines of C, including all versions.

Testing is the biggest challenge, since we want to portably test semantics like the one mentioned above (single precision inputs and output, extended precision internally) without resorting to extended precision packages in the test code. We have done this by systematically generating test arrays in any precision and of any length whose dot product is extremely small. Comparing the computed dot product with the true dot product exposes the actual internal precision used. Testing other routines (like matrix-vector multiply) can be reduced to testing dot products.

Future Plans:

Complete reference implementation. We will extend our reference implementation to include all the other routines in chaper 4 of the standard document.
Produce Pentium version. The Pentium's double-extended floating point registers afford a more efficient way to implement the extended precision BLAS (we currently use "double-double", which represents extended precision words by a pairs of doubles).
Collaborate with other implementors. As mentioned above, HP/Convex plans to use our model implementation to help do the rest of Chapter 2.
Produce optimized version. Techniques described above under ATLAS (or the earlier Berkeley project PHiPAC) can be used to optimize the new BLAS. Given the number of new BLAS, such an automatic approach is essential.
Incorporate routines into applications. The first target is the iterative refinement routines in LAPACK (e.g. SGESVX). The use of extended precision matrix-vector multiplication will let this LAPACK routine compute accurate solutions of linear systems, nearly independently of condition number, at essentially the same cost as now.

SuperLU

Overview:

SuperLU is a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations on high performance machines. The library is written in C and is callable from either C or Fortran.

The library routines perform an LU decomposition with partial pivoting and triangular system solves through forward and back substitution. The LU factorization routines can handle non-square matrices but the triangular solves are performed only for square matrices. The matrix columns may be preordered (before factorization) either through library or user supplied routines. This preordering for sparsity is completely separate from the factorization. Working precision iterative refinement subroutines are provided for improved backward stability. Routines are also provided to equilibrate the system, estimate the condition number, calculate the relative backward error, and estimate error bounds for the refined solutions.

SuperLU is joint with with Xiaoye Li at NERSC.

Results and Interactions:

We have completed a release of all three versions:

SuperLU for sequential machines
SuperLU_MT for shared memory parallel machines
SuperLU_DIST for distributed memory

A Users' Guide describes all three libraries.

A talk on SuperLU_DIST was given at the 1999 SIAM Parallel Processing conference, and reported significantly better performance than that in the Supercomputing 98 paper. SuperLU_Dist demonstrated up to 100-fold speedup on the 512 process Cray T3E at NERSC, and up to a 10.2 Gigaflops factorization rate.

SuperLU_DIST was used by C. W. McCurdy and collaborators at LBNL to compute the quantum scattering of a 3-body system, achieving good agreement with experimental data for the first time. The computations involved sparse complex linear systems of dimension up to 1.79 million, which were solved on the SGI/Cray T3E at NERSC and ASCI Blue Pacific Computer at LLNL. This work resulted in the 24 Dec 1999 cover of Science, depicting wavefunctions of the three scattered particles:

Future Plans:

Incomplete LU. We plan to make an incomplete LU version of SuperLU in order ot scale to even larger problems.
Better Parallel Scheduling. We will use EDAGs (elimination DAGs) to do a better job of parallel scheduling of the tasks.
Dense Code. We will switch to a dense code (ScaLAPACK) at the end, once the trailing submatrix (Schur complement) is dense enough. This should especially benefit matrices arising from discretizing 3D PDEs, since these matrices generally have a large trailing submatrix.
Ultimate Scalability. Our limit of scalabilty is that of parallel sparse Cholesky, which has the advantage of not having to pivot for stability. Our goal is that distributed SuperLU be as scalable as Cholesky, and we will conduct experiments to determine whether we reach our goal.

Prometheus

Overview:

Prometheus is a multigrid solver for finite element matrices on unstructured meshes in solid mechanics. Prometheus is joint work with Prof. R. Taylor of Civil Engineering and postdoc Mark Adams. Prometheus is not directly funded by DOE2000, but we include it since it uses PETSc, a DOE2000-supported project.

Prometheus uses maximal independent sets on modified matrix graphs; these node sets are meshed with a Delaunay mesher to provide coarse "grids", which are used with standard finite element shape functions to provide interpolation operators for multigrid. Coarse grid operators are formed with these interpolation operators and the finer grid matrix in a standard Galerkin way. Prometheus is built on PETSc from Argonne National Laboratory, and uses ParMetis from the University of Minnesota and FEAP from the University of California at Berkeley. Its basic architechture is shown below:

Results and Interactions:

Prometheus has solved problems of over 39 million degrees of freedom on 960 PowerPC processors at LLNL with about 60% parallel efficiency. It has solved problems up to 76 million degrees of freedom on 1024 processor This problem was so large that it resulted in paging to disk, so the parallel efficiency decreased, but it did work.

Prometheus is publically available, and being incorporated into applications within ASCI.

Our work on Prometheus was recognized with the Karl Benz Award for the best industrial application, in the Mannheim SuParCup '99 at the Mannheim Supercomputer '99 Conference, June 10-12, 1999. The title of the paper was "Parallel Unstructured Multigrid Finite Element Solvers."

Future Plans:

New Coarse Grid Operators. We have recently add a new multigrid algorithm to our library: Atlas, an implementation of "smoothed aggregation" multigrid by Vanek, Mandel, and Brezina. Atlas shows promise, being as effective as Prometheus (an more effective for shell problems) while not requiring the rather complex construction of explicit coarse grids. (Note: Atlas is the brother of Prometheus, so the name conflict within the project has a good mythological basis.) We will continue to develop, test and compare this new coarse grid construction scheme.
New Applications. We are now integrating Prometheus into ALE3D at LLNL. We also hope that it will also be effective on large structures arising from earthquake modeling within the Pacific Earthquake Engineering Research Center (PEER), whose goal is to simulate a large piece of the Bay Area in a large earthquake. We will seek out other suitable applications as well.

PBody

Overview:

Pody is a parallel library of N-body methods, i.e. for quickly calculating the electrostatic or gravitational forces or potentials on N particles. It is portable across a range of parallel machines, permits the user a wide range of choices in how the computation is done, and is efficient.

Pbody was not part of the original DOE2000 proposal, but is work that was going on independently. As Pbody reached maturity, I began supporting the graduate student doing the work, David Blackston.

Results and Interactions:

We have collaborated with Prof. Andy Neureuther at Berkeley to incorporate Pbody into his EBEAM program for simulating electon beam semiconductor fabrication devices. A paper "Integration of an adaptive parallel N-body solver into a particle-by-particle electron beam interation simulation", coauthored by D. Blackston, J. Demmel, A. Neureuther and B. Wu, was presented at Technical Conference 3777, Charged Particle Optics IV, part of the annual SPIE meeting, 22-23 July 1999.

Future Plans:

Release of PBody. We plan to clean up and release Pbody for public use.
New Algorithms. The modular architecturs of Pbody permits easy incorporation of new algorithms. For example, Rohklin and coworkers have a new approximation that is significantly faster than the currently used 3D multipole expansion, which we would like to include.
Performance improvements. Most of the computational time is spent in a few routines with a BLAS-like flavor. These routines are susceptible to optimization using techniques such as those used in ATLAS and PHiPAC.
Personnel.David Blackston will be graduating with his PhD in Jan 2000. He has already started work at LBNL on DOE2000 related activities.