## CS 267 Dense Linear Algebra: Parallel Gaussian Elimination

## **James Demmel**

www.cs.berkeley.edu/~demmel/cs267\_Spr16

03/01/2016 CS267 Lecture 13

## Gaussian Elimination (GE) for solving Ax=b

- Add multiples of each row to later rows to make A upper triangular
- Solve resulting triangular system Ux = c by substitution

After i=1

03/01/2016

```
... for each column i
... zero it out below the diagonal by adding multiples of row i to later rows
for i = 1 to n-1
... for each row j below row i
for j = i+1 to n
... add a multiple of row i to row j
tmp = A(j,l);
for k = i to n
A(j,k) = A(j,k) - (tmp/A(i,i)) * A(i,k)
```

After i=3

CS267 Lecture 13

## **Outline**

- Review Gaussian Elimination (GE) for solving Ax=b
- · Optimizing GE for caches on sequential machines
  - using matrix-matrix multiplication (BLAS and LAPACK)
- · Minimizing communication for sequential GE
  - Not LAPACK, but Recursive LU minimizes bandwidth (latency possible)
- Data layouts on parallel machines
- Parallel Gaussian Elimination (ScaLAPACK)
- · Minimizing communication for parallel GE
- Not ScaLAPACK (yet), but "Comm-Avoiding LU" (CALU)
- Same idea for minimizing bandwidth and latency in sequential case
- Summarize rest of dense linear algebra
- Dynamically scheduled LU for Multicore
- LU for Heterogeneous computers (CPU + GPU)

03/01/2016 CS267 Lecture 13

## Refine GE Algorithm (1)

Initial Version

```
... for each column i
... zero it out below the diagonal by adding multiples of row i to later rows
for i = 1 to n-1
... for each row j below row i
for j = i+1 to n
... add a multiple of row i to row j
tmp = A(j,i);
for k = i to n
A(j,k) = A(j,k) - (tmp/A(i,i)) * A(i,k)
```

 Remove computation of constant tmp/A(i,i) from inner loop.

```
for i = 1 to n-1

for j = i+1 to n

m = A(j,i)/A(i,i)

for k = i to n

A(j,k) = A(j,k) - m * A(i,k)

j

03/01/2016

CS267 Lecture 13
```





CS267 Lecture 13

03/01/2016

Store all m's here before updating

rest of matrix





## What GE really computes

for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) ... BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i) \* A(i, i+1:n) ... BLAS 2 (rank-1 update)

- Call the strictly lower triangular matrix of multipliers M, and let L = I+M
- · Call the upper triangle of the final matrix U
- Lemma (LU Factorization): If the above algorithm terminates (does not divide by zero) then A = L\*U
- Solving A\*x=b using GE
  - Factorize A = L\*U using GE (cost = 2/3 n<sup>3</sup> flops)
  - Solve L\*y = b for y, using substitution (cost = n<sup>2</sup> flops)
  - Solve U\*x = y for x, using substitution (cost = n<sup>2</sup> flops)
- Thus  $A^*x = (L^*U)^*x = L^*(U^*x) = L^*y = b$  as desired

03/01/2016 CS267 Lecture 13

## **Pivoting in Gaussian Elimination**

- A = [ 0 1 ] fails completely because can't divide by A(1,1)=0 [1 0]
- But solving Ax=b should be easy!
- When diagonal A(i,i) is tiny (not just zero), algorithm may terminate but get completely wrong answer
  - · Numerical instability
  - · Roundoff error is cause
- · Cure: Pivot (swap rows of A) so A(i,i) large

03/01/2016 CS267 Lecture 13 11

## Problems with basic GE algorithm

```
for i = 1 to n-1

A(i+1:n,i) = A(i+1:n,i) / A(i,i) ... BLAS 1 (scale a vector)

A(i+1:n,i+1:n) = A(i+1:n , i+1:n) ... BLAS 2 (rank-1 update)

- A(i+1:n , i) * A(i , i+1:n)
```

· What if some A(i,i) is zero? Or very small?

03/01/2016

- Result may not exist, or be "unstable", so need to pivot
- Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply) is fastest (earlier lectures...)



10

## Gaussian Elimination with Partial Pivoting (GEPP)

• Partial Pivoting: swap rows so that A(i,i) is largest in column

```
for i = 1 to n-1 find and record k where |A(k,i)| = \max\{i \le j \le n\} |A(j,i)| ... i.e. largest entry in rest of column i if |A(k,i)| = 0 exit with a warning that A is singular, or nearly so elseif k \ne i swap rows i and k of A end if A(i+1:n,i) = A(i+1:n,i) / A(i,i) ... each |quotient| \le 1 A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i) / A(i,i+1:n)
```

- Lemma: This algorithm computes A = P\*L\*U, where P is a permutation matrix.
- This algorithm is numerically stable in practice
- For details see LAPACK code at
  - http://www.netlib.org/lapack/single/sgetf2.f
- Standard approach but communication costs?

## Problems with basic GE algorithm · What if some A(i,i) is zero? Or very small? - Result may not exist, or be "unstable", so need to pivot • Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply) is fastest (earlier lectures...) for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i)... BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n,i+1:n) ... BLAS 2 (rank-1 update) - A(i+1:n , i) \* A(i , i+1:n) RS2: Level 1, 2 and 3 BLAS BLAS 3 BLAS 2 BLAS 1 03/01/2016 CS267 Lecture 13 13



## Converting BLAS2 to BLAS3 in GEPP

- Blocking
  - Used to optimize matrix-multiplication
  - Harder here because of data dependencies in GEPP
- BIG IDEA: Delayed Updates
  - Save updates to "trailing matrix" from several consecutive BLAS2 (rank-1) updates
  - Apply many updates simultaneously in one BLAS3 (matmul) operation
- · Same idea works for much of dense linear algebra
  - Not eigenvalue problems or SVD need more ideas
- First Approach: Need to choose a block size b
  - Algorithm will save and apply b updates
  - b should be small enough so that active submatrix consisting of b columns of A fits in cache
  - b should be large enough to make BLAS3 (matmul) fast



## Communication Lower Bound for GE

- · Matrix Multiplication can be "reduced to" GE
- Not a good way to do matmul but it shows that GE needs at least as much communication as matmul
- Does blocked GEPP minimize communication?

$$\begin{bmatrix} I & 0 & -B \\ A & I & 0 \\ 0 & 0 & I \end{bmatrix} = \begin{bmatrix} I \\ A & I \\ 0 & 0 & I \end{bmatrix} \cdot \begin{bmatrix} I & 0 & -B \\ & I & A \cdot B \\ & & I \end{bmatrix}$$

03/01/2016

CS267 Lecture 13

17

## **Does LAPACK's GEPP Minimize Communication?**

```
for ib = 1 to n-1 step b .... Process matrix b columns at a time end = ib + b-1 .... Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n, ib:end) = P' * L' * U' .... let LL denote the strict lower triangular part of A(ib:end, ib:end) + I A(ib:end, end+1:n) = LL-1 * A(ib:end, end+1:n) .... update next b rows of U A(end+1:n, end+1:n, ib:end) * A(ib:end, end+1:n) .... apply delayed updates with single matrix-multiply .... with inner dimension b
```

- Case 1:  $n \ge M$  huge matrix attains lower bound
  - b = M<sup>1/2</sup> optimal, dominated by matmul
- Case 2:  $n \le M^{1/2}$  small matrix attains lower bound
  - Whole matrix fits in fast memory, any algorithm attains lower bound
- Case 3: M<sup>1/2</sup> < n < M medium size matrix not optimal
  - Can't choose b to simultaneously optimize matmul and BLAS2 GEPP of n x b submatrix
  - Worst case: Exceed lower bound by factor  $M^{1/6}$  when  $n = M^{2/3}$
- · Detailed counting on backup slides

18

## Alternative cache-oblivious GE formulation (1/2)

- Toledo (1997)
  - Describe without pivoting for simplicity
  - "Do left half of matrix, then right half"



function [L,U] = RLU (A) ... assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else



[L1,U1] = RLU( A(1:m , 1:n/2)) ... do left half of A ... let L11 denote top n/2 rows of L1 A(1:n/2 , n/2+1 : n ) = L11<sup>-1</sup> \* A(1:n/2 , n/2+1 : n ) ... update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n) = A( n/2+1: m, n/2+1:n) - A( n/2+1: m, 1:n/2 ) \* A(1:n/2 , n/2+1: n)

- A( n/2+1: m, 1:n/2 ) \* A( 1:n/2 , n/2+1: n ) ... update rest of right half of A [L2.U2] = RLU( A(n/2+1:m , n/2+1:n) ) ... do right half of A

03/01/2016

return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]

CS267 Lecture 13

19

```
<u>Alternative cache-oblivious GE formulation (2/2)</u>
```

```
function [L,U] = RLU (A) ... assume A is m by n
if (n=1) L = A/A(1,1), U = A(1,1)
else
[L1,U1] = RLU(A(1:m, 1:n/2)) ... do left half of A
... let L11 denote top n/2 rows of L1
A(1:n/2, n/2+1: n) = L11-1* A(1:n/2, n/2+1: n)
... update top n/2 rows of right half of A
A( n/2+1: m, n/2+1:n) = A( n/2+1: m, n/2+1:n)
- A( n/2+1: m, 1:n/2) * A(1:n/2, n/2+1: n)
... update rest of right half of A
[L2,U2] = RLU(A(n/2+1:m, n/2+1:n)) ... do right half of A
return [L1,[0;L2]] and [U1, [A(.,.); U2]]
```

• W(m,n) = W(m,n/2) + O(max(m·n,m·n<sup>2</sup>/M<sup>1/2</sup>)) + W(m-n/2,n/2)

```
Still doesn't minimize latency, but fixable CLASS PROJECT
```

 $\leq 2 \cdot W(m,n/2) + O(max(m \cdot n,m \cdot n^2/M^{1/2}))$ 

= O( $m \cdot n^2/M^{1/2} + m \cdot n \cdot \log M$ )

=  $O(m \cdot n^2/M^{1/2})$  if  $M^{1/2} \cdot log M = O(n)$ 

## **Explicitly Parallelizing Gaussian Elimination**

- Parallelization steps
  - Decomposition: identify enough parallel work, but not too much
  - Assignment: load balance work among threads
  - Orchestrate: communication and synchronization
  - Mapping: which processors execute which threads (locality)
- Decomposition
  - In BLAS 2 algorithm nearly each flop in inner loop can be done in parallel, so with n<sup>2</sup> processors, need 3n parallel steps, O(n log n) with pivoting

```
| for i = 1 to n-1
| A(i+1:n,i) = A(i+1:n,i) / A(i,i) ... BLAS 1 (scale a vector)
| A(i+1:n,i+1:n) = A(i+1:n, i+1:n) ... BLAS 2 (rank-1 update)
| - A(i+1:n, i) * A(i, i+1:n)
```

- This is too fine-grained, prefer calls to local matmuls instead
- Need to use parallel matrix multiplication
- · Assignment and Mapping
  - Which processors are responsible for which submatrices?







## **Review of Parallel MatMul**

· Want Large Problem Size Per Processor

PDGEMM = PBLAS matrix multiply

## Observations:

- · For fixed N, as P increasesn Mflops increases, but less than 100% efficiency
- · For fixed P, as N increases, Mflops (efficiency) rises

## DGEMM = BLAS routine for matrix multiply

Maximum speed for PDGEMM = # Procs \* speed of DGEMM

## Observations:

- · Efficiency always at least 48%
- · For fixed N. as P increases. efficiency drops
- · For fixed P. as N increases. efficiency increases

|       |    |         | _  | -   |     |    |
|-------|----|---------|----|-----|-----|----|
| Speed | in | Villons | of | מיד | CRV | vr |

| Speed in Milops of PDGEMM |        |       |       |       |       |  |  |
|---------------------------|--------|-------|-------|-------|-------|--|--|
| Machine                   | Procs  | Block | N     |       |       |  |  |
|                           |        | Size  | 2000  | 4000  | 10000 |  |  |
| Cray T3E                  | 4=2x2  | 32    | 1055  | 1070  | 0     |  |  |
|                           | 16=4x4 |       | 3630  | 4005  | 4292  |  |  |
|                           | 64=8x8 |       | 13456 | 14287 | 16755 |  |  |
| IBM SP2                   | 4      | 50    | 755   | 0     | 0     |  |  |
|                           | 16     |       | 2514  | 2850  | 0     |  |  |
|                           | 64     |       | 6205  | 8709  | 10774 |  |  |
| Intel XP/S MP             | 4.     | 32    | 330   | 0     | 0     |  |  |
| Paragon                   | 16     |       | 1233  | 1281  | 0     |  |  |
|                           | 64     |       | 4496  | 4864  | 5257  |  |  |
| Berkeley NOW              | 4      | 32    | 463   | 470   | 0     |  |  |
|                           | 32=4x8 |       | 2490  | 2822  | 3450  |  |  |
|                           | 64     |       | 4130  | 5457  | 6647  |  |  |

| Efficiency = MFlops(PDGEMM)/(Procs*MFlops(DGEMM)) |       |        |       |      |      |       |
|---------------------------------------------------|-------|--------|-------|------|------|-------|
| Machine                                           | Peak/ | DGEMM  | Procs | N    |      |       |
|                                                   | proc  | Mflops |       | 2000 | 4000 | 10000 |
| Cray T3E                                          | 600   | 360    | 4     | .73  | .74  |       |
|                                                   |       |        | 16    | .63  | .70  | .75   |
|                                                   |       |        | 64    | .58  | .62  | .73   |
| IBM SP2                                           | 266   | 200    | 4     | .94  |      |       |
|                                                   |       |        | 16    | .79  | .89  |       |
|                                                   |       |        | 64    | .48  | .68  | .84   |
| Intel XP/S MP                                     | 100   | 90     | 4     | .92  |      |       |
| Paragon                                           |       |        | 16    | .86  | .89  |       |
|                                                   |       |        | 64    | .78  | .84  | .91   |
| Berkeley NOW                                      | 334   | 129    | 4     | .90  | .91  |       |
|                                                   |       |        | 32    | .60  | .68  | .84   |
|                                                   |       |        | 64    | .50  | 66   | .81   |

## **Does ScaLAPACK Minimize Communication?**

- Lower Bound: O(n2 / P1/2) words sent in O(P1/2) mess.
  - Attained by Cannon and SUMMA (nearly) for matmul
- ScaLAPACK:
  - O(n2 log P / P1/2) words sent close enough
  - O(n log P) messages too large
  - Why so many? One reduction costs O(log P) per column to find maximum pivot, times n = #columns
- Need to abandon partial pivoting to reduce #messages
  - Suppose we have n x n matrix on P1/2 x P1/2 processor grid
  - Goal: For each panel of b columns spread over P1/2 procs. identify b "good" pivot rows in one reduction
    - · Call this factorization TSLU = "Tall Skinny LU"
  - Several natural bad (numerically unstable) ways explored, but good way exists
    - SC08, "Communication Avoiding GE", D., Grigori, Xiang

03/01/2016 CS267 Lecture 13

## $\begin{array}{c} \textbf{PDGESV} = \textbf{ScaLAPACK Parallel LU} \\ \hline \textbf{Efficiency} = \textbf{MFlops}(\textbf{PDGESV})/\textbf{MFlops}(\textbf{PDGEMM}) \\ \end{array}$

## Since it can run no faster than its inner loop (PDGEMM), we measure:

Efficiency = Speed(PDGESV)/Speed(PDGEMM)

## Observations:

03/01/2016

- · Efficiency well above 50% for large enough problems
- For fixed N. as P increases, efficiency decreases (just as for PDGEMM)
- For fixed P, as N increases efficiency increases (just as for PDGEMM)
- From bottom table, cost of solving · Ax=b about half of matrix multiply for large enough matrices.
  - From the flop counts we would expect it to be  $(2*n^3)/(2/3*n^3) = 3$ times faster, but communication makes it a little slower.

| Time(PDGESV)/Time(PDGEMM) |       |       |      |      |       |  |
|---------------------------|-------|-------|------|------|-------|--|
| Machine                   | Procs | Block | N    |      |       |  |
|                           |       | Size  | 2000 | 4000 | 10000 |  |
| Cray T3E                  | 4     | 32    | .50  | .40  |       |  |
|                           | 16    |       | .75  | .51  | .40   |  |
|                           | 64    |       | 1.86 | .72  | .45   |  |
| IBM SP2                   | 4     | 50    | .60  |      |       |  |
|                           | 16    |       | 1.16 | .64  |       |  |
|                           | 64    |       | 2.24 | 1.03 | .51   |  |
| Intel XP/S GP             | 4     | 32    | .52  |      |       |  |
| Paragou                   | 16    |       | .89  | .50  |       |  |
|                           | 64    |       | 2.08 | .79  | .44   |  |
| Berkeley NOW              | 4     | 32    | .44  |      |       |  |
| -                         | 32    |       | .88  | .54  | .47   |  |
|                           | 64    |       | 1.18 | .62  | .49   |  |

Performance of ScaLAPACK LU

Block

IBM SP2

Paragou

Intel XP/S MP

Berkeley NOW

32 .67 .82

Size 2000 4000 10000

.32 .66

.42 .75

.44 .65

.18 .47 .75

.56 .52 .29

.64

.37 .66

.16

.76

.38 .62 .71

.28 .54 .84

## Choosing Rows by "Tournament Pivoting"

$$W^{nxb} = \begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W \end{bmatrix} = \begin{bmatrix} P_1 \cdot L_1 \cdot U_1 \\ P_2 \cdot L_2 \cdot U_2 \\ P_3 \cdot L_3 \cdot U_3 \\ P_3 \cdot L_3 \cdot U_3 \end{bmatrix}$$

Choose b pivot rows of W<sub>1</sub>, call them W<sub>1</sub>' Choose b pivot rows of W<sub>1</sub>, call them W<sub>2</sub>'
Choose b pivot rows of W<sub>3</sub>, call them W<sub>2</sub>'
Choose b pivot rows of W<sub>3</sub>, call them W<sub>3</sub>'

$$\begin{pmatrix} W_1 \\ W_2 \\ W_3 \\ W_4 \end{pmatrix} = \begin{pmatrix} P_{12} \cdot L_{12} \cdot U_{12} \\ P_{34} \cdot L_{34} \cdot U_{34} \end{pmatrix}$$
 Choose b pivot rows, call them  $W_{12}$  Choose b pivot rows, call them  $W_{34}$ 

$$\begin{pmatrix} W_{12}' \\ W_{34}' \end{pmatrix}$$
 =  $P_{1234} \cdot L_{1234} \cdot U_{1234}$  Choose b pivot row

Go back to W and use these b pivot rows (move them to top, do LU without pivoting) Not the same pivots rows chosen as for GEPP Need to show numerically stable (D., Grigori, Xiang, '11)

03/01/2016

CS267 Lecture 13

## **Minimizing Communication in TSLU**

Parallel: 
$$W = \begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W_4 \end{bmatrix} \xrightarrow{\rightarrow} LU \xrightarrow{} LU \xrightarrow{} LU$$

Sequential: 
$$W = \begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W \end{bmatrix} \xrightarrow{\longrightarrow} LU \xrightarrow{\longrightarrow} LU \xrightarrow{\longrightarrow} LU$$

Dual Core: 
$$W = \begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W_4 \end{bmatrix} \xrightarrow{\longrightarrow} UU \xrightarrow{\longrightarrow$$

Multisore / Multisocket / Multirack / Multisite / Out-of-core: ?

Can Choose reduction tree dynamically

03/01/2016

CS267 Lecture 13

20

## Same idea for QR of Tall-skinny matrix (TSQR)

Parallel: 
$$W = \begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W_4 \end{bmatrix} \xrightarrow{QR} QR \xrightarrow{QR} QR$$

Sequential: 
$$W = \begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W_4 \end{bmatrix} \xrightarrow{QR} QR \xrightarrow{QR} QR$$

Dual Core: 
$$W = \begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W_4 \end{bmatrix} \xrightarrow{\longrightarrow} QR \xrightarrow{\longrightarrow} QR \xrightarrow{\longrightarrow} QR \xrightarrow{\longrightarrow} QR$$

First step of SVD of Tall-Skinny matrix

03/01/2016 CS267 Lecture 13

20

## Performance vs ScaLAPACK LU

- TSLU
  - IBM Power 5
  - Up to 4.37x faster (16 procs, 1M x 150)
  - Cray XT4
    - Up to 5.52x faster (8 procs, 1M x 150)
- CALU
  - IBM Power 5
  - Up to 2.29x faster (64 procs, 1000 x 1000)
  - Cray XT4
    - Up to 1.81x faster (64 procs, 1000 x 1000)
- See INRIA Tech Report 6523 (2008), paper at SC08

03/01/2016 CS267 Lecture 13 31

## **TSQR Performance Results**

- Parallel
- Intel Clovertown
  - Up to 8x speedup (8 core, dual socket, 10M x 10)
  - Pentium III cluster, Dolphin Interconnect, MPICH
  - Up to 6.7x speedup (16 procs, 100K x 200)
  - BlueGene/L
  - Up to 4x speedup (32 procs, 1M x 50)
  - Tesla C 2050 / Fermi
    - Up to **13x** (110,592 x 100)
  - Grid 4x on 4 cities vs 1 city (Dongarra, Langou et al)
  - Cloud (Gleich and Benson) ~2 map-reduces
- Seguential
  - "Infinite speedup" for out-of-core on PowerPC laptop
    - As little as 2x slowdown vs (predicted) infinite DRAM
    - LAPACK with virtual memory never finished
- · SVD costs about the same
- Joint work with Grigori, Hoemmen, Langou, Anderson, Ballard, Keutzer, others





## Summary of dense sequential O(n³) algorithms attaining communication lower bounds • References are from Table 3.1 in "Communication lower bounds and optimal algorithms for numerical linear algebra", Ballard et al, 2014 • #words moved = Ω(n³/M¹/²), #messages = Ω(n³/M³/²) • Cache-oblivious, Ours, LAPACK, Randomized Computation 2-Level Mem BLAS-3 Cholesky LU Symmade [10] [10] [10] [10] [10] [10] [10]

```
Can we do even better?
· Assume nxn matrices on p processors
• Use c copies of data: M = O(cn<sup>2</sup> / p) per processor

    Increasing M reduces lower bounds:

   #words moved = \Omega((n^3/P) / M^{1/2}) = \Omega((n^2/(c^{1/2}P^{1/2})))
                      = \Omega((n^3/P) / M^{3/2}) \sim \Omega(P^{1/2}/c^{3/2})
   #messages
  Attainable for Matmul

    Not attainable for LU, Cholesky

• Thm: #words moved * #me
   • Lowering #words by
                                    must increase #messages by
      same factor

    Cor: Perfer

                    aling impossible for LU, Cholesky,QR
  Both lower
                     atainable for Cholesky, LU,
  QR (via Chc , y QR):
   • #words moved = \Omega( n^2 / (c^{1/2} P^{1/2}) )
   • #messages = \Omega(c^{1/2} P^{1/2})
```





## **Dense Linear Algebra on Recent Architectures**

- Multicore
  - How do we schedule all parallel tasks to minimize idle time?
- GPUs
  - Heterogeneous computer: consists of functional units (CPU and GPU) that are good at different tasks
  - How do we divide the work between the GPU and CPU to take maximal advantage of both?
  - Challenging now, will get more so as platforms become more heterogeneous

39

03/01/2016 CS267 Lecture 13

## Multicore: Expressing Parallelism with a DAG

- DAG = Directed Acyclic Graph
  - S1 → S2 means statement S2 "depends on" statement S1
  - Can execute in parallel any Si without input dependencies
- For simplicity, consider Cholesky A = LLT, not LU
  - N by N matrix, numbered from A(0,0) to A(N-1,N-1)
  - "Left looking" code: at step k, completely compute column k of L

```
for k = 0 to N-1

for n = 0 to k-1

A(k,k) = A(k,k) - A(k,n)*A(k,n)

A(k,k) = sqrt(A(k,k))

for m = k+1 to N-1

for n = 0 to k-1

A(m,k) = A(m,k) - A(m,n)*A(k,n)

A(m,k) = A(m,k) / A(k,k)
```



# Sample Cholesky DAG with #blocks in any row or column = N/b = 5 • Note implied order of summation from left to right • Not necessary for correctness, but it does reflect what the sequential code does • Can process DAG in any order respecting dependences Slide courtesy of Jakub Kurzak, UTK

## Expressing Parallelism with a DAG - Block Cholesky • Each A[i,j] is a b-by-b block for k = 0 to N/b-1for n = 0 to k-1SYRK: S₁(k,n) $A[k,k] = A[k,k] - A[k,n]*A[k,n]^{\mathsf{T}}$ POTRF: $S_2(k)$ A[k,k] = unblocked Cholesky(A[k,k])for m = k+1 to N/b-1for n = 0 to k-1**GEMM:** $S_3(k,m,n)$ $A[m,k] = A[m,k] - A[m,n]*A[k,n]^T$ TRSM: $S_{4}(k,m)$ $A[m,k] = A[m,k] \cdot A[k,k]^{-1}$ S<sub>1</sub>(k,n) Same DAG, but only ≈(N/b)<sup>3</sup>/6 vertices S,(k,m) $S_3(k,m,n)$

## Scheduling options

- Static (pre-assign tasks to processors) vs
   Dynamic (idle processors grab ready jobs from work-queue)
  - If dynamic, does scheduler take user hints/priorities?
- Respect locality (eg processor must have some task data in its cache) vs not
- Build and store entire DAG to schedule it (which may be very large, (N/b)<sup>3</sup>), vs
   Build just the next few "levels" at a time (smaller, but less information for scheduler)
- Programmer builds DAG & schedule vs Depend on compiler or run-time system
  - Ease of programming, vs not exploiting user knowledge
  - If compiler, how conservative is detection of parallelism?
  - Generally useful, not just linear algebra

## Schedulers tested

- Cilk
  - · programmer-defined parallelism
  - spawn creates independent tasks
  - · sync synchronizes a sub-branch of the tree
- SMPSs
  - · dependency-defined parallelism
  - pragma-based annotation of tasks (directionality of the parameters)
- PLASMA (Static Pipeline)
  - programmer-defined (hard-coded)
  - · apriori processing order
  - · stalling on dependencies

Slide courtesy of Jakub Kurzak, UTK









## **Dense Linear Algebra on GPUs**

- Source: Vasily Volkov's SC08 paper
  - Best Student Paper Award (729 citations)
- New challenges
  - More complicated memory hierarchy
  - Not like "L1 inside L2 inside ...",
    - · Need to choose which memory to use carefully
    - Need to move data manually
  - GPU does some operations much faster than CPU, but not all

51

- CPU and GPU fastest using different data layouts

03/01/2016 CS267 Lecture 13

## Scheduling on Multicore - Next Steps

- PLASMA 2.8.0 released 12/2015
  - Includes BLAS, Cholesky, QR, LU, LDLT, eig, svd
  - icl.cs.utk.edu/plasma/
- Future of PLASMA
  - Continue adding functions
  - Add dynamic scheduling
    - QUARK dynamic scheduse released 12/2011
    - DAGs for eigenproblems are too complicated to do by hand
    - Plan to adopt Plan to adopt Plan to adopt Plan to DAG scheduling features
  - Still assume horiogeneity of available cores
    - What about GPUs, or mixtures of CPUs and GPUs?
  - MAGMA
    - · icl.cs.utk.edu/magma

03/01/2016 CS267 Lecture 13

## **Motivation**

- NVIDIA released CUBLAS 1.0 in 2007, which is BLAS for GPUs
- · This enables a straightforward port of LAPACK to GPU
  - · Consider single precision only



- · Goal: understand bottlenecks in the dense linear algebra kernels
  - · Requires detailed understanding of the GPU architecture
  - Result 1: New coding recommendations for high performance on GPUs
- · Result 2: New , fast variants of LU, QR, Cholesky, other routines CS267 Lecture 13

## **GPU Memory Hierarchy**



- Register file is the fastest and the largest on-chip memory
  - Constrained to vector operations only
- · Shared memory permits indexed and shared access
  - However, 2-4x smaller and 4x lower bandwidth than registers
    - Only 1 operand in shared memory is allowed versus 4 register operands

53

- Some instructions run slower if using shared memory

03/01/2016 CS267 Lecture 13

## \_global\_\_void sgemmNN( const float \*A, int Ida, const float \*B, int Idb, float\* C, int Idc, int k, float alpha, float beta ) A += blockldx.x \* 64 + threadldx.x + threadldx.v\*16: B += threadidx.x + ( blockidx.y \* 16 + threadidx.y ) \* idb; C += blockldx.x \* 64 + threadldx.x + (threadldx.y + blockldx.y \* ldc ) \* 16; \_shared\_\_ float bs[16][17]; float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; const float \*Blast = B + k; do #pragma unrol for( int i = 0; i < 16; i += 4 ) bs[threadldx,x][threadldx,v+i] = B[i\*ldb]; B += 16: for( int i = 0: i < 16: i++. A += Ida ) The bottleneck: $c[0] += A[0]^*bs[i][0]; \quad c[1] += A[0]^*bs[i][1]; \quad c[2] += A[0]^*bs[i][2]; \quad c[3] += A[0]^*bs[i][3];$ Read A's columns $c[4] += A[0]^*bs[i][4]; \quad c[5] += A[0]^*bs[i][5]; \quad c[6] += A[0]^*bs[i][6]; \quad c[7] += A[0]^*bs[i][7];$ Do Rank-1 updates c[8] += A[0]\*bs[i][8]; c[9] += A[0]\*bs[i][9]; c[10] += A[0]\*bs[i][10]; c[11] += A[0]\*bs[i][11]; $c[12] += A[0]*bs[i][12]; \\ c[13] += A[0]*bs[i][13]; \\ c[14] += A[0]*bs[i][14]; \\ c[15] += A[0]*bs[i][15]; \\ c[16] += A[0]*bs[i][16]; \\ c[17] += A[0]*bs[i][16]; \\ c[18] += A[0]*bs[i][18]; \\ c[18] += A[0]*bs[i]$ } while( B < Blast ); for( int i = 0; i < 16; i++, C += ldc ) Store C's block to memory C[0] = alpha\*c[i] + beta\*C[0]: CS267 Lecture 13 55 03/01/2016

## (Some new) NVIDIA coding recommendations

- Minimize communication with CPU memory
- · Keep as much data in registers as possible
  - Largest, fastest on-GPU memory
  - Vector-only operations
- · Use as little shared memory as possible
  - Smaller, slower than registers; use for communication, sharing only
  - Speed limit: 66% of peak with one shared mem argument
- Use vector length VL=64, not max VL = 512
  - Strip mine longer vectors into shorter ones
- Final matmul code similar to Cray X1 or IBM 3090 vector codes











## Design of fast matrix factorizations on GPU

- Use GPU for matmul only, not BLAS2 or BLAS1
- Factor panels on CPU
- Use "look-ahead" to overlap CPU and GPU work
  - GPU updates matrix while CPU factoring next panel
- Use row-major layout on GPU, column-major on CPU
   Convert on the fly
- Substitute triangular solves LX= B with multiply by L-1
  - For stability CPU needs to check || L-1 ||
- · Use variable-sized panels for load balance
- For two GPUs with one CPU, use column-cyclic layout on GPUs









## Class Projects

- · Pick one (of many) functions/algorithms
- · Pick a target parallel platform
- Pick a "parallel programming framework"
  - LAPACK all parallelism in BLAS
  - ScaLAPACK distributed memory using MPI
  - PLASMA DAG scheduling on multicore
    - · Parallel Linear Algebra for Scalable Multi-core Architectures
    - http://icl.cs.utk.edu/plasma/
  - MAGMA DAG scheduling for heterogeneous platforms
    - · Matrix Algebra on GPU and Multicore Architectures
    - · http://icl.cs.utk.edu/magma/
  - Spark, Elemental, ...
- · Design, implement, measure, model and/or compare performance
  - Can be missing entirely on target platform
  - May exist, but with a different programming framework

