U.C. Berkeley CS267/EngC233

Applications of Parallel Computers

Spring 2011

Tentative Syllabus

High-Level Description

This syllabus may be modified during the semester, depending on feedback from students and the availability of guest lecturers. Topics that we have covered before and intend to cover this time too are shown in standard font below, and possible extra topics (some presented in previous classes, some new) are in italics.

After this high level description, we give the currently planned schedule of lectures (Updated Jan 16)(subject to change).

Computer Architectures (at a high level, in order to understand what can and cannot be done in parallel, and the relative costs of operations like arithmetic, moving data, etc.).

Sequential computers, including memory hierarchies

Shared memory computers and multicore

Distributed memory computers

GPUs (Graphical Processing Units, eg NVIDIA cards)

Cloud and Grid Computing

Programming Languages and Models for these architectures

Threads

OpenMP

Message Passing (MPI)

UPC and/or Titanium

Communication Collectives (reduce, broadcase, etc.)

CUDA/OpenCL etc. (for GPUs)

Cilk

Sources of parallelism and locality in simulation: The two most important issues in designing fast algorithms are (1) identifying enough parallelism, and (2) minimizing the movement of data between memories and processors (moving data being much slower than arithmetic or logical operations. We discuss how simulations of real-world processes have naturally exploitable parallelism and "locality" (i.e. data than needs to be combined can naturally be stored close together, to minimize its movement).

Programming "Patterns": It turns out that there is a relatively short list of basic computing problems that appear over and over again. Good ways to solve these problems exist, and so it is most productive to be able to recognize these "patterns" when they appear, and use the best available algorithms and software to implement them. The list of patterns continues to evolve, but we will present the most common ones, and also illustrate how they arise in a variety of applications.

Originally, there were 7 such patterns that were identified by examining a variety of high performance computational science problems. Since there were 7, they were called the "7 dwarfs" of high performance computing. For each one, we will discuss its structure and usage, algorithms, measuring and tuning its performance (automatically when possible), and available software tools and libraries.

Dense linear algebra (matrix multiply, solving linear systems of equations, etc.)

Sparse linear algebra (similar to the dense case, but where the matrices have mostly zero entries and the algorithms neither store nor operate on these zero entries).

Structured Grids (where the data is organized to lie on a "grid", eg a 2-dimensional mesh, and the basic operations are the same at each mesh point (eg "average the value at each mesh point with its neighbors").

Unstructured Grids (similar to the above, but where "neighbor" can be defined by an arbitrary graph)

Spectral Methods (the FFT, or Fast Fourier Transform, is typical).

Particle Methods (where many "particles" (eg atoms, planets, people,...) are updated (eg moved) depending on the values of some or all other particles (eg by electrostatic forces, gravity, etc.)

Monte Carlo, sometimes also called MapReduce (as used by Google), where every task is completely independent, but may finish at a different time and require different resources, and where the results of all the tasks may be combined ("reduced") to a single answer.

The next 6 patterns of parallel computing were identified by examining a broad array of nonscientific applications that require higher performance via parallelism; not only did the above "7 dwarfs" appear, but 6 other computational patterns, that we will probably only have time to partially cover: (see here for details):

Finite State Machines, where the "state" is updated using rules based on the current state and most recent input

Combinational Logic, performing logical operations (Boolean Algebra) on large amounts of data

Graph traversal, traversing a large graph and performing operations on the nodes

"Graphical models" involve special graphs representing random variables and probabilities, and are used in machine learning techniques

Dynamic Programming, an algorithmic technique for combining solutions of small subproblems into solutions of larger problems

Branch-and-Bound search, a divide-and-conquer technique for searching extremely large search spaces, like those arising in games like chess

More Patterns - there are various other patterns that are useful for organizing software (parallel or sequential) that we will cover as well.

Measuring performance and finding bottlenecks

Load balancing techniques, both dynamic and static

Parallel Sorting

Correctness
Verification and Validation (V&V) of the results (how to convince yourself and others to believe the result of a large computation, important not only with parallelism)
Automatic code derivation (sketching)
Proofs and testing of code

Assorted possible guest lectures (some repeats, some new; depends on availability of lecturers)

Performance Measuring Tools

Simulating the Human Brain

Computational Nanoscience

Musical performance and delivery (ParLab application)

Volunteer Computing (eg how seti@home etc work)

Climate Modeling

Computational Astrophysics

Computational Biology

Image Processing (ParLab application)

Speech Recognition (ParLab application)

Modeling Circulatory System of Stroke Victims (ParLab application)

Parallel Web Browers (ParLab application)

Detailed Schedule of Lectures (updated Jan 16)(subject to change) (lecturers shown in parentheses)

Jan 18 (Tuesday): Introduction: Why Parallel Computing? (Kathy Yelick)

Jan 20 (Thursday): Single processor machines: Memory hierarchies and processor features (Kathy Yelick)

Jan 25 (Tuesday): Introduction to parallel machines (Kathy Yelick)

Jan 27 (Thursday): Shared memory machines and programming: OpenMP and Threads (James Demmel)

Feb 1 (Tuesday): Distributed memory machines and programming in MPI (Kathy Yelick)

Feb 3 (Thursday): Sources of parallelism and locality in simulation: Part 1 (James Demmel)

Feb 8 (Tuesday): Sources of parallelism and locality in simulation: Part 2; Tricks with Trees (James Demmel)

Feb 10 (Thursday): GPUs, and programming with with CUDA and OpenCL (Bryan Catanzaro)

Feb 15 (Tuesday): Performance and Debugging Tools (NERSC staff)

Feb 17 (Thursday): Structured Grids and Performance Modeling (Kathy Yelick)

Feb 22 (Tuesday): Dense Linear Algebra: Part 1 (James Demmel)

Feb 24 (Thursday): Dense Linear Algebra: Part 2 (James Demmel)

Mar 1 (Tuesday): Global Address Space Programming in UPC (Kathy Yelick)

Mar 3 (Thursday): Sparse-Matrix-Vector-Multiply (Kathy Yelick)

Mar 8 (Tuesday): Graph Partitioning (James Demmel)

Mar 10 (Thursday): Particle (N-Body) methods (James Demmel)

Mar 15 (Tuesday): Multigrid on structured grids (James Demmel)

Mar 17 (Thursday): Spectral Methods (FFT) (James Demmel)

Mar 22-24: Spring Break

Mar 29 (Tuesday): Patterns of Parallel Programming I (Kurt Keutzer)

Mar 31 (Thursday): Patterns of Parallel Programming II (Kurt Keutzer)

Apr 5 (Tuesday): Cloud computing with MapReduce and Hadoop (Matei Zaharia)

Apr 7 (Thursday): Multiphysics Programming Frameworks (John Shalf)

Apr 12 (Tuesday): (Computational Astrophysics - TBD)

Apr 14 (Thursday): (Dynamic load balancing and Parallel Sorting - TBD)

Apr 19 (Tuesday): (Parallel Graph Algorithms - TBD)

Apr 21 (Thursday): Simulation of Blood Flow on 200K cores (2010 Gordon Bell Prize) (Richard Vuduc)

Apr 27 (Tuesday): (Climate Modeling - TBD)

Apr 29 (Thursday): (Future Exascale Machines - TBD)

May 8 (Thursday): Student Poster Session