U.C. Berkeley CS267

Applications of Parallel Computers

Spring 2014

Tentative Syllabus


Abstract

This course teaches both graduate and advanced undergraduate students from diverse departments how use parallel computers both efficiently and productively, i.e. how to write programs that run fast while minimizing programming effort. The latter is increasingly important since essentially all computers are (becoming) parallel, from supercomputers to laptops. So beyond teaching the basics about parallel computer architectures and programming languages, we emphasize commonly used patterns that appear in essentially all programs that need to run fast. These patterns include both common computations (eg linear algebra, graph algorithms, structured grids,..) and ways to easily compose these into larger programs. We show how to recognize these patterns in a variety of practical problems, efficient (sometimes optimal) algorithms for implementing them, how to find existing efficient implementations of these patterns when available, and how to compose these patterns into larger applications. We do this in the context of the most important parallel programming models today: shared memory (eg PThreads and OpenMP on your multicore laptop), distributed memory (eg MPI and UPC on a supercomputer), GPUs (eg CUDA and OpenCL, which could be both in your laptop and supercomputer), and cloud computing (eg MapReduce and Hadoop). We also present a variety of useful tools for debugging correctness and performance of parallel programs. Finally, we have a variety of guest lectures by a variety of experts, including parallel climate modeling, astrophysics, and other topics.

High-Level Description

This syllabus may be modified during the semester, depending on feedback from students and the availability of guest lecturers. Topics that we have covered before and intend to cover this time too are shown in standard font below, and possible extra topics (some presented in previous classes, some new) are in italics.

After this high level description, we give the currently planned schedule of lectures (Updated Jan 20)(subject to change).

  • Computer Architectures (at a high level, in order to understand what can and cannot be done in parallel, and the relative costs of operations like arithmetic, moving data, etc.).
  • Sequential computers, including memory hierarchies
  • Shared memory computers and multicore
  • Distributed memory computers
  • GPUs (Graphical Processing Units, eg NVIDIA cards)
  • Cloud Computing
  • Programming Languages and Models for these architectures
  • Threads
  • OpenMP
  • Message Passing (MPI)
  • UPC and/or Titanium
  • Communication Collectives (reduce, broadcase, etc.)
  • CUDA/OpenCL etc. (for GPUs)
  • Sources of parallelism and locality in simulation: The two most important issues in designing fast algorithms are (1) identifying enough parallelism, and (2) minimizing the movement of data between memories and processors (moving data being much slower than arithmetic or logical operations. We discuss how simulations of real-world processes have naturally exploitable parallelism and "locality" (i.e. data than needs to be combined can naturally be stored close together, to minimize its movement).
  • Programming "Patterns": It turns out that there is a relatively short list of basic computing problems that appear over and over again. Good ways to solve these problems exist, and so it is most productive to be able to recognize these "patterns" when they appear, and use the best available algorithms and software to implement them. The list of patterns continues to evolve, but we will present the most common ones, and also illustrate how they arise in a variety of applications.

    Originally, there were 7 such patterns that were identified by examining a variety of high performance computational science problems. Since there were 7, they were called the "7 dwarfs" of high performance computing. For each one, we will discuss its structure and usage, algorithms, measuring and tuning its performance (automatically when possible), and available software tools and libraries.

  • Dense linear algebra (matrix multiply, solving linear systems of equations, etc.)
  • Sparse linear algebra (similar to the dense case, but where the matrices have mostly zero entries and the algorithms neither store nor operate on these zero entries).
  • Structured Grids (where the data is organized to lie on a "grid", eg a 2-dimensional mesh, and the basic operations are the same at each mesh point (eg "average the value at each mesh point with its neighbors").
  • Unstructured Grids (similar to the above, but where "neighbor" can be defined by an arbitrary graph)
  • Spectral Methods (the FFT, or Fast Fourier Transform, is typical).
  • Particle Methods (where many "particles" (eg atoms, planets, people,...) are updated (eg moved) depending on the values of some or all other particles (eg by electrostatic forces, gravity, etc.)
  • Monte Carlo, sometimes also called MapReduce (as used by Google), where every task is completely independent, but may finish at a different time and require different resources, and where the results of all the tasks may be combined ("reduced") to a single answer.
  • The next 6 patterns of parallel computing were identified by examining a broad array of nonscientific applications that require higher performance via parallelism; not only did the above "7 dwarfs" appear, but 6 other computational patterns (see here for details). Of these, we will only have time to cover the first, graph algorithms:
  • Graph algorithms, eg traversing a large graph and performing operations on the nodes
  • Finite State Machines, where the "state" is updated using rules based on the current state and most recent input
  • Combinational Logic, performing logical operations (Boolean Algebra) on large amounts of data
  • "Graphical models" involve special graphs representing random variables and probabilities, and are used in machine learning techniques
  • Dynamic Programming, an algorithmic technique for combining solutions of small subproblems into solutions of larger problems
  • Branch-and-Bound search, a divide-and-conquer technique for searching extremely large search spaces, like those arising in games like chess
  • More Patterns - there are various other structural patterns that are useful for organizing software (parallel or sequential) that we will cover as well.
  • Programming Frameworks, parallel languages that make certain patterns easy to use and compose for specific applications domains
  • Measuring performance and finding bottlenecks
  • Load balancing techniques, both dynamic and static
  • Parallel Sorting
  • Assorted guest lectures (depending on availability of lecturers)
  • Climate Modeling
  • Computational Astrophysics
  • Computational Materials Science
  • Exascale Computing
  • Detailed Schedule of Lectures (updated Jan 20)(subject to change) (lecturers shown in parentheses)

  • Jan 21 (Tuesday): Introduction: Why Parallel Computing? (Jim Demmel)
  • Jan 23 (Thursday): Single processor machines: Memory hierarchies and processor features (Jim Demmel)
  • Jan 28 (Tuesday): Introduction to parallel machines and programming models (Jim Demmel)
  • Jan 30 (Thursday): Sources of parallelism and locality in simulation: Part 1 (Jim Demmel)
  • Feb 4 (Tuesday): Sources of parallelism and locality in simulation: Part 2 (Jim Demmel)
  • Feb 6 (Thursday): Shared memory machines and programming: OpenMP and Threads; Tricks with Trees (Jim Demmel)
  • Feb 11 (Tuesday): Distributed memory machines and programming in MPI (Jim Demmel)
  • Feb 13 (Thursday): Performance and Debugging Tools (David Skinner and Richard Gerber, NERSC)
  • Feb 18 (Tuesday): Programming in Unified Parallel C (UPC) (Kathy Yelick)
  • Feb 20 (Thursday): Cloud Computing (Shivaram Venkataraman)
  • Feb 25 (Tuesday): GPUs, and programming with with CUDA and OpenCL (Bryan Catanzaro, NVIDIA)
  • Feb 27 (Thursday): Dense Linear Algebra: Part 1 (Jim Demmel)
  • Mar 4 (Tuesday): Dense Linear Algebra: Part 2 (Jim Demmel)
  • Mar 6 (Thursday): Graph Partitioning: Part 1 (Jim Demmel)
  • Mar 11 (Tuesday): Graph Partitioning: Part 2, and Sparse-Matrix-Vector-Multiply (Jim Demmel)
  • Mar 13 (Thursday): Sparse-Matrix-Vector-Multiply and Autotuning (Jim Demmel)
  • Mar 18 (Tuesday): TBD
  • Mar 20 (Thursday): TBD
  • Mar 24-28: Spring Break
  • Apr 1 (Tuesday): TBD
  • Apr 3 (Thursday): TBD
  • Apr 8 (Tuesday): TBD
  • Apr 10 (Thursday): TBD
  • Apr 15 (Tuesday): TBD
  • Apr 17 (Thursday): TBD
  • Apr 22 (Tuesday): TBD
  • Apr 24 (Thursday): TBD
  • Apr 29 (Tuesday): TBD
  • May 1 (Thursday): TBD
  • May 8 (Thursday): Student Poster Session