21) SVD Geometry#

Last time#

  • Condition number of a matrix

  • Least squares and normal equations

  • Intro to SVD

  • Geometry of the Singular Value Decomposition

Today#

  1. Recap on matrix norm and condition number

  2. Condition number via SVD
    2.1. Relationship between conditioning and singular values

  3. Singular Value Decomposition for solving systems

  4. Costs of decompositions seen

using LinearAlgebra
using Plots
default(linewidth=4, legendfontsize=12)

function vander(x, k=nothing)
    if isnothing(k)
        k = length(x)
    end
    m = length(x)
    V = ones(m, k)
    for j in 2:k
        V[:, j] = V[:, j-1] .* x
    end
    V
end

function gram_schmidt_naive(A)
    m, n = size(A)
    Q = zeros(m, n)
    R = zeros(n, n)
    for j in 1:n
        v = A[:,j]
        for k in 1:j-1
            r = Q[:,k]' * v
            v -= Q[:,k] * r
            R[k,j] = r
        end
        R[j,j] = norm(v)
        Q[:,j] = v / R[j,j]
    end
    Q, R
end

function gram_schmidt_classical(A)
    m, n = size(A)
    Q = zeros(m, n)
    R = zeros(n, n)
    for j in 1:n
        v = A[:,j]
        R[1:j-1,j] = Q[:,1:j-1]' * v
        v -= Q[:,1:j-1] * R[1:j-1,j]
        R[j,j] = norm(v)
        Q[:,j] = v / R[j,j]
    end
    Q, R
end

function qr_householder(A)
    m, n = size(A)
    R = copy(A)
    V = [] # list of reflectors
    for j in 1:n
        v = copy(R[j:end, j])
        v[1] += sign(v[1]) * norm(v) # <--- here we pick the sign of v so that moves it the largest distance
        v = normalize(v)
        R[j:end,j:end] -= 2 * v * v' * R[j:end,j:end]
        push!(V, v)
    end
    V, R
end

function qr_chol(A)
    R = cholesky(A' * A).U
    Q = A / R
    Q, R
end

function qr_chol2(A)
    Q, R = qr_chol(A)
    Q, R1 = qr_chol(Q)
    Q, R1 * R
end
qr_chol2 (generic function with 1 method)

1. Recap on matrix norm and condition number#

The condition number of matrix-vector multiplication depends on the vector. The condition number of the matrix is the worst case (maximum) of the condition number for any vector, i.e.,

\[ \kappa(A) = \max_{x \ne 0} \lVert A \rVert \frac{\lVert x \rVert}{\lVert A x \rVert} .\]

If \(A\) is invertible, then we can rephrase as

\[ \kappa(A) = \max_{x \ne 0} \lVert A \rVert \frac{\lVert A^{-1} (A x) \rVert}{\lVert A x \rVert} = \max_{A x \ne 0} \lVert A \rVert \frac{\lVert A^{-1} (A x) \rVert}{\lVert A x \rVert} = \lVert A \rVert \lVert A^{-1} \rVert . \]

Evidently multiplying by a matrix is just as ill-conditioned of an operation as solving a linear system using that matrix.

Matrix norms induced by vector norms#

Recall

\[ \lVert A \rVert = \max_{\lVert x \rVert = 1} \lVert A x \rVert . \]

2. Condition number via SVD#

\[ \kappa(A) = \lVert A \rVert \ \lVert A^{-1} \rVert \]

Or, in terms of the SVD

\[ U \Sigma V^T = \texttt{svd}(A) \]

where

\[\begin{split} \Sigma = \begin{bmatrix} \sigma_{\max} && \\ & \ddots & \\ && \sigma_{\min} \end{bmatrix}, \end{split}\]

where, \((\sigma_{1}, \sigma_{2}, \ldots)\) are the non-negative real numbers called singular values of \(A\) (usually listed in decreasing order).

We have that the matrix condition number is given by:

\[ \kappa(A) = \frac{\sigma_{\max}}{\sigma_{\min}} = \texttt{cond}(A) \]

2.1 Relationship between conditioning and singular values#

What makes a matrix ill-conditioned?

A = [10 5; .9 .5]
@show cond(A)
svdvals(A) # Julia built-in that returns singular values in decreasing order
cond(A) = 252.116033572376
2-element Vector{Float64}:
 11.22755613596245
  0.0445332888070338

There are two orders of magnitude difference between \(\sigma_{\textrm{max}}\) and \(\sigma_{\textrm{min}}\).

m = 100
x = LinRange(-1, 1, m)
A = vander(x, 20)
@show cond(A)
svdvals(A)
cond(A) = 7.206778416853811e6
20-element Vector{Float64}:
 11.510601491136654
  8.993572449917352
  5.907360126610745
  3.6029154831946064
  1.9769729310318853
  1.098310376576986
  0.5392322016751737
  0.2820042230178553
  0.12425066006852684
  0.06183086124622112
  0.024242929146349384
  0.011556331531623212
  0.003964435582121844
  0.0018189836735690052
  0.0005298822985647027
  0.00023487660987952337
  5.4870893944525604e-5
  2.3566694052620285e-5
  3.8286365250847765e-6
  1.597190981232044e-6

Six orders of magnitude here!

Since

\[ \kappa(A) = \frac{\sigma_{\max}}{\sigma_{\min}} = \texttt{cond}(A) \]

we can see how the larger the ratio between the largest and the smallest singular values is, the larger the condition number of a matrix is (hence, the more it is ill-conditioned).

3. Singular Value Decomposition for solving systems#

Recall

\[ U \Sigma V^T = A \]
where \(U\) and \(V\) have orthonormal columns and \(\Sigma\) is diagonal with nonnegative entries.

The entries of \(\Sigma\) are called singular values and this decomposition is the singular value decomposition (SVD). It may remind you of an eigenvalue decomposition \(X \Lambda X^{-1} = A\), but

  • the SVD exists for all matrices (including non-square and deficient matrices)

  • \(U,V\) have orthogonal columns (while \(X\) can be arbitrarily ill-conditioned).

  • Indeed, if a matrix is symmetric and positive definite (all positive eigenvalues), then \(U=V\) and \(\Sigma = \Lambda\).

  • Once you have the SVD decomposition of a matrix \(A\), you can solve the linear system \(Ax=b\) pretty efficiently (more ont this in a bit).

4. Costs of decompositions seen#

Although I mentioned already in class that counting flops is a bad model for modern high-performance computing, it is not the focus of this course to go too much in depth in the performance models for modern architectures. Let’s review the costs in the classical sense of algorithm complexity (i.e., depending on problem size)

Recall that we’d say an algorithm costs \(O(n^2)\) if its running time on input of size \(n\) is less than \(c n^2\) for some constant \(c\) and sufficiently large \(n\).

Sometimes we write \(\operatorname{cost}(\texttt{algorithm}, n) = O(n^2)\) or (preferably) \(\operatorname{cost}(\texttt{algorithm}) \in O(n^2)\).

Note that \(O(\log n) \subset O(n) \subset O(n\log n) \subset O(n^2) \subset \dotsb\).

We say the algorithm is in \(\Theta(n^2)\) (“big theta”) if

\[ c_1 n^2 < \operatorname{cost}(\texttt{algorithm}) < c_2 n^2 \]
for some positive constants \(c_1,c_2\) and sufficiently large \(n\).

4.1 Cost of Gram-Schmidt?#

  • We’ll count flops (addition, multiplication, division*)

  • *division is special. Since most processors can do an addition, comparison, or multiplication in a single cycle, those are all counted as one flop. But division always takes longer. How much longer depends on the processor. (Also division can be performed as multiplication by the inverse).

  • Inner product \(\sum_{i=1}^m x_i y_i\)?

  • Vector “axpy”: \(y_i = a x_i + y_i\), \(i \in [1, 2, \dotsc, m]\).

  • In the function gram_schmidt_naive, let’s look at the inner loop:

for k in 1:j-1
    r = Q[:,k]' * v
    v -= Q[:,k] * r
    R[k,j] = r
end

Counting flops is a bad model#

  • We load a single entry (8 bytes) and do 2 flops (add + multiply). That’s an arithmetic intensity of 0.25 flops/byte.

  • Current hardware can do about 10 flops per byte, so our best algorithms will run at about 2% efficiency.

  • Need to focus on memory bandwidth, not flops.

Dense matrix-matrix mulitply libraries#

4.2 Cost of QR (by Householder)#

Solve \(R x = Q^T b\).

  • QR factorization costs \(2 m n^2 - \frac 2 3 n^3\) operations and is done once per matrix \(A\).

  • Computing \(Q^T b\) costs \(4 (m-n)n + 2 n^2 = 4 mn - 2n^2\) (using the elementary reflectors, which are stable and lower storage than naive storage of \(Q\)).

  • Solving with \(R\) costs \(n^2\) operations. Total cost per right hand side is thus \(4 m n - n^2\).

This method is stable and accurate.

4.3 Cost of Cholesky#

The mathematically equivalent form \((A^T A) x = A^T b\) are called the normal equations. The solution process involves factoring the symmetric and positive definite \(n\times n\) matrix \(A^T A\).

  • Computing \(A^T A\) costs \(m n^2\) flops, exploiting symmetry.

  • Factoring \(A^T A = R^T R\) costs \(\frac 1 3 n^3\) flops. The total factorization cost is thus \(m n^2 + \frac 1 3 n^3\).

  • Computing \(A^T b\) costs \(2 m n\).

  • Solving with \(R^T\) costs \(n^2\).

  • Solving with \(R\) costs \(n^2\). Total cost per right hand side is thus \(2 m n + 2 n^2\).

The product \(A^T A\) is ill-conditioned: \(\kappa(A^T A) = \kappa(A)^2\) and can reduce the accuracy of a least squares solution.

4.4 Cost of SVD#

  • We saw that the SVD exists for all matrices (including non-square and deficient matrices)

  • \(U,V\) have orthogonal columns (while \(X\) can be arbitrarily ill-conditioned).

  • Indeed, if a matrix is symmetric and positive definite (all positive eigenvalues), then \(U=V\) and \(\Sigma = \Lambda\).

Computing an SVD requires a somewhat complicated iterative algorithm, but a crude estimate of the cost is \(2 m n^2 + 11 n^3\). Note that this is similar to the cost of \(QR\) when \(m \gg n\), but much more expensive for square matrices.

One you have the SVD decomposition of a matrix \(A\), then you can solve the linear system: \(Ax=b\).

Solving it with the SVD involves:

  • Compute \(U^T b\) at a cost of \(2 m n\).

  • Solve with the diagonal \(n\times n\) matrix \(\Sigma\) at a cost of \(n\).

  • Apply \(V\) at a cost of \(2 n^2\). The total cost per right hand side is thus \(2 m n + 2n^2\) (not bad at all).

SVD gives the unique minimum norm solution when \(A\) is rank deficient#

Observation:#

Orthogonal transformations don’t affect singular values (or conditioning).

Recall the “Geometry of the Singular Value Decomposition” activity

default(aspect_ratio=:equal)

function peanut()
    theta = LinRange(0, 2*pi, 50)
    r = 1 .+ .4*sin.(3*theta) + .6*sin.(2*theta)
    r' .* [cos.(theta) sin.(theta)]'
end

function circle()
    theta = LinRange(0, 2*pi, 50)
    [cos.(theta) sin.(theta)]'
end

function Aplot(A)
    "Plot a transformation from X to Y, first applied to a peanut shape and then to a circle"
    X = peanut()
    Y = A * X
    p = scatter(X[1,:], X[2,:], label="in")
    scatter!(p, Y[1,:], Y[2,:], label="out")
    X = circle()
    Y = A * X
    q = scatter(X[1,:], X[2,:], label="in")
    scatter!(q, Y[1,:], Y[2,:], label="out")
    plot(p, q, layout=2)
end
Aplot (generic function with 1 method)
U, S, V = svd(randn(2,2))
@show S
@show S[1] / S[end]
@show diagm(S) # create diagonal matrix with S values
Aplot(diagm(S))
S = [1.9387613583851788, 0.11322497393989903]
S[1] / S[end] = 17.12308946447003
diagm(S) = [1.9387613583851788 0.0; 0.0 0.11322497393989903]