### Low-rank Solutions of Linear Matrix Equations via Procrustes Flow

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi*, and Benjamin Recht.
UC Berkeley, *USC


Low-rank matrix recovery: Given linear operator $\A(\cdot)$ and measurements $b = \A(M) \in \R^{m}$, would like to recover $M \in \R^{n_1 \times n_2}$.

$$\min_{X \in \R^{n_1 \times n_2}} \mathrm{rank}(X) \;\;\mathrm{s.t.}\;\; \A(X) = b \:.$$
$$\min_{X \in \R^{n_1 \times n_2}} \mathrm{rank}(X) \;\;\mathrm{s.t.}\;\; \A(X) = b \:.$$

Framework is general: matrix completion, phase retrieval, metric embedding, etc.

### Not convinced?

Low-rank matrix recovery is NP-hard in general.

Procrustes flow: A procedure to estimate rank-$r$ $M$ to arbitrary accuracy from $m=\widetilde{\Omega}(nr)$ samples for special types of $\A(\cdot)$'s.

(Also known as gradient descent).

### Which operators work?

Restricted isometry: For all matrices $M \in \R^{n \times n}$ with rank at most $r$, there exists $\delta_r$ s.t.

$$(1-\delta_r) \norm{M}^2_F \leq \twonorm{\A(M)}^2 \leq (1+\delta_r) \norm{M}^2_F \:.$$

What operators are RIP (for constant $\delta$)?

Gaussian ensemble with $m = \Omega(nr)$.

Subsampled 2D-Fourier basis [1] with $m = \Omega(nr \log^4{n})$.

Subsampled Pauli basis [2] with $m = \Omega(nr \log^6{n})$.

[1] Oymak, Recht, and Soltanolkotabi (2015).

[2] Liu (2011).

### Abridged history of RIP matrix sensing

Algorithm Workhorse Convergence Sample complexity
Nuclear norm minimization
SDP
SDP
- $nr$
Iterative hard thresholding
Gradient + rank-$r$ SVD
Gradient + rank-$r$ SVD
rank-$r$ SVD
$\log(1/\varepsilon)$ $nr$
Alternating minimization Least squares $\log(1/\varepsilon)$ $nr^2\kappa^4$

### What if we just used gradient descent?

Low-rank recovery problem: $$\min_{X \in \R^{n \times n}} \frac{1}{2} \twonorm{\A(X) - b}^2 \;\;\mathrm{s.t.}\;\; \mathrm{rank}(X) \leq r \:.$$

Procrustes flow: run gradient descent on the following

\begin{align*} \min_{U \in \R^{n \times r}} f(U) &:= \frac{1}{4} \twonorm{ \A(UU^\T) - b }^2 \:. \end{align*}

Instance of popular Burer-Monteiro heuristic.

### Non-convex, but I've seen worse

The surface of $f(U)$ in a $\R^{2 \times 1}$ case.

A complication: $f(U) = f(U {\color{red}R})$ for any orthogonal matrix $\color{red}R$.

Define distance using orthogonal Procrustes distance

$$\dist(U, X) := \min_{\substack{{\color{red}R} \in \R^{r \times r} \\ {\color{red}RR^\T} = {\color{red}R^\T R} = I}} \norm{U - X{\color{red}R}}_F \:.$$

$\dist(U, X) \longrightarrow 0$ implies $\norm{UU^\T - XX^\T}_F \longrightarrow 0$.

Proof idea: Show strong convexity along trajectories given by optimal Procrustes rotation.

Main theorem: ($C_1, C_2$ below are absolute constants)

Let $U_0$ satisfy

$$\dist(U_0, X) \leq \sigma_r^{1/2}/4 \:.$$

Iterating

$$U_{t+1} \gets U_t - {\color{blue} \frac{C_1}{\sigma_1(U_0)^2}} \underbrace{ \color{green} \A^*( \A(U_t U_t^\T) - b ) U_t}_{\nabla f(U_t)} \:,$$
(Update)

achieves

$$\definecolor{rate}{rgb}{0.59, 0.29, 0.0} \dist^2(U_\tau, X) \leq \left( 1 - \frac{C_2}{\color{rate} \kappa} \right)^{\tau} \dist^2(U_0, X) \:.$$

### What about initialization?

Use $\log(r^{1/2} \kappa)$ iterations of hard-thresholding.

Can use approx SVD (see Hegde, Indyk, and Schmidt (2016)).

You don't even need to! All local min are global min, and all saddle points have negative curvature.

### Matrix perturbation insights

Working with the factorized $U$ requires converting from $\dist(U, X)$ to $\norm{UU^\T - XX^\T}_F$ and vice versa.

(1) $\dist(U, X)$ small $\Longrightarrow$ $\norm{UU^\T - XX^\T}_F$ small is easy.

(2) $\norm{UU^\T - XX^\T}_F$ small $\Longrightarrow$ $\dist(U, X)$ small requires more work.

(Standard matrix perturbation arguments are pessimistic by a condition number).

We establish that

$$\dist^2(U, X) \lesssim {\color{green} \frac{1}{\sigma_r(X)^2}} \norm{UU^\T - XX^\T}^2_F \:.$$

(This inequality has come up many extensions of our work, e.g. Bhojanapalli et al.).

For $M_\ell = U_\ell \Sigma_\ell V_\ell^\T$, $\ell=1,2$,

$$\dist^2\left(\begin{bmatrix} U_2\Sigma_2^{1/2} \\ V_2\Sigma_2^{1/2} \end{bmatrix}, \begin{bmatrix} U_1\Sigma_1^{1/2} \\ V_1\Sigma_1^{1/2} \end{bmatrix}\right) \lesssim {\color{green} \frac{1}{\sigma_r(M_1)}} \norm{M_2 - M_1}^2_F \:.$$

### Conclusion

Gradient descent for RIP matrix sensing converges linearly to the unknown matrix $M$.

Sample complexity of $\Omega(nr)$ for the Gaussian ensemble, and $\widetilde{\Omega}(nr)$ for other structured ensembles.

Poster session Wed 10am-1pm.