### Breaking Locality Accelerates Block Gauss-Seidel

Stephen Tu, Shivaram Venkataraman, Ashia C. Wilson,
Alex Gittens, Michael I. Jordan, and Benjamin Recht.


Simple Goal: Fix an $n \times n$ (dense) positive definite matrix $A$ and response vector $b \in \R^n$. Our goal is to solve $Ax=b$.

Caveat: we want $n$ to be really big.

Motivating application is kernel ridge regression (KRR).

KRR: Given dataset $\{(x_i, y_i)\}_{i=1}^{n}$ and kernel function $k(x, y)$, the goal is to solve $$\min_{\alpha \in \R^n} \twonorm{ K\alpha - Y }^2 + \lambda \alpha^\T K \alpha \:.$$

Solution is to solve $(K + \lambda I) \alpha = Y$. For large $n$ (i.e. $n \approx 10^6$), $K$ does not even fit in memory.

If the matrices do not fit in memory, we must turn to iterative methods.

Iterative methods for linear systems is a widely studied area. Classical methods include conjugate-gradient (CG) and Gauss-Seidel (GS).

This talk: A new accelerated variant of randomized GS designed to work with small blocks of data at a time.

#### Randomized block GS

Given a current iterate $x_k$ and a random block of coordinates $I \subseteq [n]$, update only the coordinates in $I$ via $$(x_{k+1})_{I} = (x_k)_{I} - A_{II}^{-1} (A x_k - b)_{I} \:.$$

Equivalently, with sketching matrix $S_k \in \R^{n \times \abs{I}}$, $$x_{k+1} = x_k - S_k (S_k^\T A S_k)^{-1} S_k^\T (A x_k - b) \:.$$

Question: How should we sample blocks?

#### How to sample in block GS?

We focus on two practical distributions for block GS.

Fixed partition: Choose blocksize $p$, divide $[n]$ into blocks $I_1, ..., I_{n/p}$ ahead of time. During the iterates, randomly choose a block $I_{k_t}$ for $k_t \sim \mathrm{Unif}(\{1, ..., n/p\})$.

Random coordinates: At each iteration, choose uniformly from the set $\{ I \in 2^{[n]} : \abs{I} = p \}$.

Fixed partition is preferable from a systems perspective (better cache locality, can replicate blocks, etc.).

Random coordinates suffers from slower memory reads. So should we just use fixed partitioning?

#### Does breaking locality help?

A simple example where the sampling matters in block GS (similar to Lee and Wright, 16).

Define the $n \times n$ matrix $$A_{\alpha,\beta} := \alpha I + \frac{\beta}{n} \ind\ind^\T \:.$$

Now, let us try $n=5000, p=500, \alpha=1, \beta=1000$.

To understand what is going on, let us look at the theory of randomized GS.

(Gower and Richtárik, 16). For all $k \geq 0$, \begin{align*} \E\norm{x_k - x_*}_{A} \leq (1 - \mu)^{k/2} D_0 \:, \end{align*}
where $\mu := \lambda_{\min}(\E[ P_{A^{1/2} S} ])$.

The $\mu$ quantity varies as the sampling distribution changes.

$$A_{\alpha,\beta} := \alpha I + \frac{\beta}{n} \ind\ind^\T \:.$$

\begin{align*} \mu_{\mathrm{part}} &= \frac{p}{n+\beta p} \:, \\ \mu_{\mathrm{rand}} &= \mu_{\mathrm{part}} + \frac{p-1}{n-1} \frac{\beta p}{n + \beta p} \:. \end{align*}

As $\beta \longrightarrow \infty$, $\mu_{\mathrm{part}} \approx 1/\beta$ whereas $\mu_{\mathrm{rand}} \approx p/n$. Gap is arbitrarily large.

Systems point of view: fixed-partition sampling is preferable. Can cache blocks ahead of time, replicate across nodes, etc. Locality is good for performance!

Optimization point of view: random coordinates is preferable. Each iteration of GS will make more progress. Locality is bad for optimization!

#### Questions when accelerating GS

Q1: Does the same sampling phenomenon occur with acceleration?

Q2: Does acceleration provide the $\sqrt{\mu}$ behavior we expect?

#### Accelerated randomized block GS

Add a Nesterov momentum step to the iterates.

\begin{align*} x_{k+1} &= \frac{1}{1+\tau} y_k + \frac{\tau}{1+\tau} z_k \:, \\ H_k &= S_k(S_k^\T A S_k)^{-1} S_k^\T \:, \\ y_{k+1} &= x_{k+1} - H_k(A x_{k+1} - b) \:, \\ z_{k+1} &= z_k + \tau(x_{k+1} - z_k) - \frac{\tau}{\mu} H_k(A x_{k+1} - b) \:. \end{align*}

Parameters $\tau, \mu$ are carefully chosen.

#### Prior theory for accelerated block GS

(Nesterov and Stich, 16). For all $k \geq 0$, accelerated block GS with fixed-partition sampling satisfies $$\E\norm{y_k - x_*}_A \lesssim \left(1 - \sqrt{\frac{p}{n} \mu_{\mathrm{part}}} \right)^{k/2} D_0 \:.$$

Here, $\mu_{\mathrm{part}} = \E[ P_{A^{1/2}} S]$, where $S \in \R^{n \times p}$ represents fixed-partition sampling.

Loses $\sqrt{n/p}$ factor over "ideal" Nesterov rate.

$$\E\norm{y_k - x_*}_A \lesssim \left(1 - \sqrt{\frac{p}{n} \mu_{\mathrm{part}}} \right)^{k/2} D_0 \:.$$

Question: does the guarantee hold for other sampling schemes, with $\mu_{\mathrm{part}}$ replaced by $\mu = \E[ P_{A^{1/2} S} ]$?

(Main result). For all $k \geq 0$, accelerated block GS with any (non-degenerate) sampling scheme satisfies $$\E\norm{y_k - x_*}_A \lesssim (1-\tau)^{k/2} D_0 \:.$$
Here, $\tau := \sqrt{ \frac{1}{\nu} \cdot \mu}$, where $\mu$ is as before, and $\nu$ is a new quantity which behavies roughly like $n/p$.

We also prove the rate is sharp-- there exists a starting point which matches the rate up to constants.

For fixed-partition sampling, we can prove that $\nu = n/p$, thus recovering Nesterov and Stich's earlier result (combined with the sharpness of the rate, this proves that the $\sqrt{n/p}$ loss over ideal Nesterov rate is real).

For random coordinate sampling, we can only prove the weaker claim $$\nu \leq \frac{n}{p} \max_{\abs{J} = p} \frac{\max_{i \in J} A_{ii}}{\lambda_{\min}(A_{JJ})} \:.$$

Empirically, $\nu$ never strays too far from $n/p$, so we leave this as open.

## Experiments

Block sampling distribution is very important for GS (need to trade-off systems performance with optimization performance).

We generalize the theory of block GS to the accelerated case.

Poster session Mon 6:30pm-10:00pm.