37) Differentiation#

Last time#

  • Gradient descent

  • Nonlinear models

Today#

  1. Computing derivatives
    1.1 Numeric
    1.2 Analytic by hand
    1.3 Algorithmic (automatic) differentiation

  2. Recap on Finite Differences

  3. Ill-conditioned optimization

1. Computing derivatives#

We know the definition of the difference quotient from Calculus:

\[\lim_{h\to 0} \frac{f(x+h) - f(x)}{h}\]
  • How should we choose \(h\)?

Taylor series#

Classical accuracy analysis assumes that functions are sufficiently smooth, meaning that derivatives exist and Taylor expansions are valid within a neighborhood. In particular,

\[ f(x+h) = f(x) + f'(x) h + f''(x) \frac{h^2}{2!} + \underbrace{f'''(x) \frac{h^3}{3!} + \dotsb}_{O(h^3)} . \]

The big-\(O\) notation is meant in the limit \(h\to 0\). Specifically, a function \(g(h) \in O(h^p)\) (sometimes written \(g(h) = O(h^p)\)) when there exists a constant \(C\) such that

\[ g(h) \le C h^p \]
for all sufficiently small \(h\).

Rounding error#

We have an additional source of error, rounding error, which comes from not being able to compute \(f(x)\) or \(f(x+h)\) exactly, nor subtract them exactly. Suppose that we can, however, compute these functions with a relative error on the order of \(\epsilon_{\text{machine}}\). This leads to

\[\begin{split} \begin{split} \tilde f(x) &= f(x)(1 + \epsilon_1) \\ \tilde f(x \oplus h) &= \tilde f((x+h)(1 + \epsilon_2)) \\ &= f((x + h)(1 + \epsilon_2))(1 + \epsilon_3) \\ &= [f(x+h) + f'(x+h)(x+h)\epsilon_2 + O(\epsilon_2^2)](1 + \epsilon_3) \\ &= f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) \end{split} \end{split}\]

Tedious error propagation#

\[\begin{split} \begin{split} \left\lvert \frac{\tilde f(x+h) \ominus \tilde f(x)}{h} - \frac{f(x+h) - f(x)}{h} \right\rvert &= \left\lvert \frac{f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) - f(x)(1 + \epsilon_1) - f(x+h) + f(x)}{h} \right\rvert \\ &\le \frac{|f(x+h)\epsilon_3| + |f'(x+h)x\epsilon_2| + |f(x)\epsilon_1| + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}}h)}{h} \\ &\le \frac{(2 \max_{[x,x+h]} |f| + \max_{[x,x+h]} |f' x| \epsilon_{\text{machine}} + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h)}{h} \\ &= (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h} + O(\epsilon_{\text{machine}}) \\ \end{split} \end{split}\]

where we have assumed that \(h \ge \epsilon_{\text{machine}}\). This error becomes large (relative to \(f'\) – we are concerned with relative error after all).

What can be problematic:

  • \(f\) is large compared to \(f'\)

  • \(x\) is large

  • \(h\) is too small

Automatic step size selection#

Reference: Numerical Optimization.

  • Walker and Pernice

  • Dennis and Schnabel

diff(f, x; h=1e-8) = (f(x+h) - f(x)) / h

function diff_wp(f, x; h=1e-8)
    """Diff using Walker and Pernice (1998) choice of step"""
    h *= (1 + abs(x))
    (f(x+h) - f(x)) / h
end

x = 1000
diff_wp(sin, x) - cos(x)
-4.139506408429305e-6

1.1 Symbolic differentiation#

using Pkg
Pkg.add("Symbolics")

using Symbolics

@variables x
Dx = Differential(x)

y = sin(x)
Dx(y)
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.10/Project.toml`
  No Changes to `~/.julia/environments/v1.10/Manifest.toml`
\[ \begin{equation} \frac{\mathrm{d} \sin\left( x \right)}{\mathrm{d}x} \end{equation} \]
expand_derivatives(Dx(y))
\[ \begin{equation} \cos\left( x \right) \end{equation} \]

Awesome, what about products?

y = x
for _ in 1:2
    y = cos(y^pi) * log(y)
end
expand_derivatives(Dx(y))
\[ \begin{equation} \frac{\left( \frac{\cos\left( x^{\pi} \right)}{x} - 3.1416 x^{2.1416} \log\left( x \right) \sin\left( x^{\pi} \right) \right) \cos\left( \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} \right)}{\log\left( x \right) \cos\left( x^{\pi} \right)} - \left( \frac{3.1416 \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{2.1416}}{x} - 9.8696 \cos^{2.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} x^{2.1416} \sin\left( x^{\pi} \right) \right) \log\left( \log\left( x \right) \cos\left( x^{\pi} \right) \right) \sin\left( \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} \right) \end{equation} \]
  • The size of these expressions can grow exponentially

1.2 Hand-coding (analytic) derivatives#

\[ df = f'(x) dx \]
function f(x)
    y = x
    for _ in 1:2
        a = y^pi
        b = cos(a)
        c = log(y)
        y = b * c
    end
    y
end

f(1.9), diff_wp(f, 1.9)
(-1.5346823414986814, -34.032439961925064)
function df(x, dx)
    y = x
    dy = dx
    for _ in 1:2
        a = y ^ pi
        da = pi * y^(pi - 1) * dy
        b = cos(a)
        db = -sin(a) * da
        c = log(y)
        dc = dy / y
        y = b * c
        dy = db * c + b * dc
    end
    y, dy
end

df(1.9, 1)
(-1.5346823414986814, -34.032419599140475)

We can go the other way#

We can differentiate a composition \(h(g(f(x)))\) as

(19)#\[\begin{align} \operatorname{d} h &= h' \operatorname{d} g \\ \operatorname{d} g &= g' \operatorname{d} f \\ \operatorname{d} f &= f' \operatorname{d} x. \end{align}\]

What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like

\[ \operatorname d h = \frac{dh}{dg} \left(\frac{dg}{df} \left(\frac{df}{dx} \operatorname d x \right) \right) .\]

The expression means the same thing if we rearrange the parentheses,

\[ \operatorname d h = \left( \left( \left( \frac{dh}{dg} \right) \frac{dg}{df} \right) \frac{df}{dx} \right) \operatorname d x \]

which we can compute in reverse order via

\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]

A reverse mode example#

\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]
function g(x)
    a = x^pi
    b = cos(a)
    c = log(x)
    y = b * c
    y
end
(g(1.4), diff_wp(g, 1.4))
(-0.32484122107701546, -1.2559760384500684)
function gback(x, y_)
    a = x^pi
    b = cos(a)
    c = log(x)
    y = b * c
    # backward pass
    c_ = y_ * b
    b_ = c * y_
    a_ = -sin(a) * b_
    x_ = 1/x * c_ + pi * x^(pi-1) * a_
    x_
end
gback(1.4, 1)
-1.2559761698835525

1.3 Automatic differentiation: Zygote.jl#

using Pkg
Pkg.add("Zygote")

import Zygote
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.10/Project.toml`
  No Changes to `~/.julia/environments/v1.10/Manifest.toml`
Zygote.gradient(f, 1.9)
(-34.03241959914049,)
g(x) = exp(x) + x^2
@code_llvm Zygote.gradient(g, 1.9)
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:146 within `gradient`
define 
[1 x double] @julia_gradient_4660(double %0) #0 {
top:
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:147 within `gradient`
; ┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:88 within `pullback` @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:90
; │┌ @ In[11]:1 within `g`
; ││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface2.jl:87 within `_pullback`
; │││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface2.jl within `macro expansion`
; ││││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:224 within `chain_rrule`
; │││││┌ @ /home/valeria/.julia/packages/ChainRulesCore/6Pucz/src/rules.jl:138 within `rrule` @ /home/valeria/.julia/packages/ChainRules/DbWAz/src/rulesets/Base/fastmath_able.jl:56
        %1 = call double @j_exp_4662(double %0)
; └└└└└└
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:148 within `gradient`
; ┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:91 within `#78`
; │┌ @ In[11]:1 within `g`
; ││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:212 within `ZBack`
; │││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/lib/number.jl:12 within `literal_pow_pullback`
; ││││┌ @ promotion.jl:423 within `*` @ float.jl:411
       %2 = fmul double %0, 2.000000e+00
; │└└└└
; │┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/lib/lib.jl:17 within `accum`
; ││┌ @ float.jl:409 within `+`
     %3 = fadd double %2, %1
; └└└
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:149 within `gradient`
  %unbox3.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %3, 0
  ret [1 x double] %unbox3.fca.0.insert
}

How does Zygote work?#

square(x) = x^2
@code_llvm square(1.5)
;  @ In[12]:1 within `square`
define double @julia_square_4684(double %0) #0 {
top:
; ┌ @ intfuncs.jl:332 within `literal_pow`
; │┌ @ float.jl:411 within `*`
    %1 = fmul double %0, %0
    ret double %1
; └└
}
@code_llvm Zygote.gradient(square, 1.5)
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:146 within `gradient`
define [1 x double] @julia_gradient_4688(double %0) #0 {
top:
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:148 within `gradient`
; ┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:91 within `#78`
; │┌ @ In[12]:1 within `square`
; ││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:212 within `ZBack`
; │││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/lib/number.jl:12 within `literal_pow_pullback`
; ││││┌ @ promotion.jl:423 within `*` @ float.jl:411
       %1 = fmul double %0, 2.000000e+00
; └└└└└
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:149 within `gradient`
  %unbox2.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %1, 0
  ret [1 x double] %unbox2.fca.0.insert
}

Kinds of algorithmic differentation#

  • Source transformation: Fortran code in, Fortran code out

    • Duplicates compiler features, usually incomplete language coverage

    • Produces efficient code

  • Operator overloading: C++ types

    • Hard to vectorize

    • Loops are effectively unrolled/inefficient

  • Just-in-time compilation: tightly coupled with compiler

    • JIT lag

    • Needs dynamic language features (JAX) or tight integration with compiler (Zygote, Enzyme)

    • Some sharp bits in Python’s JAX

Forward or reverse?#

It all depends on the shape.

  • If you have one input, many outputs: use forward mode

    • “One input” can be looking in one direction

  • If you have many inputs, one output: use reverse mode

    • Will need to traverse execution backwards (“tape”)

    • Hierarchical checkpointing

  • What about square cases (same number of input/output)? Forward is usually a bit more efficient.

Can you differentiate an algorithm?#

Examples:

  • Optimization: Input \(c\), output \(x\) such that \(f(x,c)=0\)

  • Finding eigenvalues/eigenvectors: Input \(A\), output \(\lambda\) such that \(Ax = \lambda x\) for some nonzero vector

2. Recap on Finite Differences#

Finite Differences#

To define derivatives, we use Taylor’s expansion:

\[ u_{i+1} = u(x_i + \Delta x) \approx u(x_i) + \Delta x \partial_x u \bigg\rvert_i + \frac{1}{2} \Delta x^2 \partial^2_x u \bigg\rvert_i + \frac{1}{6} \Delta x^3 \partial^3_x u \bigg\rvert_i + O(\Delta x^4) \]

Similarly,

\[ u_{i-1} = u(x_i - \Delta x) \approx u(x_i) - \Delta x \partial_x u \bigg\rvert_i + \frac{1}{2} \Delta x^2 \partial^2_x u \bigg\rvert_i - \frac{1}{6} \Delta x^3 \partial^3_x u \bigg\rvert_i + O(\Delta x^4) \]

Consider a uniformly discretized domain (with uniform \(h \equiv \Delta x\) subintervals)

1D uniformily discretized domain.

We can define the first-order derivative at the point \(x_i\) using the forward difference:

\[ \frac{\partial u_i^F}{\partial x} \approx \frac{ u_{i+1} - u_{i}}{\Delta x} + O(\Delta x) \]

this is a first-order approximation.

Similarly, we can define the first-order derivative at the point \(x_i\) using the backward difference:

\[ \frac{\partial u_{i}^B}{\partial x} \approx \frac{ u_{i} - u_{i-1}}{\Delta x} + O(\Delta x) \]

this is a first-order approximation.

If we use one point to the right of \(x_i\) and one point to the left of \(x_i\) we have a centered difference approximation for the first-order derivative at the point \(x_i\) using the centered difference:

\[ \frac{\partial u_{i}^C}{\partial x} \approx \frac{ u_{i+1} - u_{i-1}}{2 \Delta x} + O(\Delta x^2) \]

this is a second-order approximation.

Thus we note that the centered difference approximates the first derivative with respect to \(x\) more accurately than either of the one-sided differences, \(O( \Delta x^2 )\) versus \(\Delta x\).

We can now define a second-order derivative, at the point \(x_i\) using a centered difference formula:

\[ \frac{\partial^2 u^C_i}{\partial x^2} = \frac{\frac{\partial u_{i+\frac{i}{2}}^F}{\partial x} -\frac{ \partial u_{i-\frac{i}{2}}^B}{\partial x} }{\Delta x} \approx \frac{u_{i-1} -2 u_i + u_{i+1}}{\Delta x^2} + O(\Delta x^2)\]

3. Ill-conditioned optimization#

\[ L(c; x, y) = \frac 1 2 \lVert \underbrace{f(x, c) - y}_{r(c)} \rVert^2 \]

Recall that the computation of the gradient of the loss function \(L\) requires the Jacobian, denoted by \(J\), of the model \(f\) differentiated w. r. t. the constants \(c\).

\[ g(c) = \nabla_c L = r^T \underbrace{\nabla_c f}_{J} \]

We can find the constants \(c\) for which \(g(c) = 0\) using a Newton method

\[ g(c + \delta c) = g(c) + \underbrace{\nabla_c g}_H\delta c + O((\delta c)^2) \]

The Hessian requires the second derivative of \(f\), which can cause problems.

\[ H = J^T J + r^T (\nabla_c J)\]

Newton-like methods for optimization#

We want to solve

\[ H \delta c = -g(c) \]

Update

\[c \gets c + \gamma \delta c\]
using a line search or trust region.

  • Gauss-Newton: \(H = J^T J\)

  • Levenberg-Marquardt: \(H = J^T J + \alpha J\)

Outlook on optimization#

  • The optimization problem can be solved using a Newton method. It can be onerous to implement the needed derivatives.

  • The Gauss-Newton method is often more practical than Newton while being faster than gradient descent, though it lacks robustness.

  • The Levenberg-Marquardt method provides a sort of middle-ground between Gauss-Newton and gradient descent.

  • Many globalization techniques are used for models that possess many local minima.

  • One pervasive approach is stochastic gradient descent, where small batches (e.g., 1, 10 or 20) are selected randomly from the corpus of observations (500 in the example we’ve seen with many realizations), and a step of gradient descent is applied to that reduced set of observations. This helps to escape saddle points and weak local minima.

  • Among expressive models \(f(x,c)\), some may converge much more easily than others. Having a good optimization algorithm is essential for nonlinear regression with complicated models, especially those with many parameters \(c\).

  • Classification is a very similar problem to regression, but the observations \(y\) are discrete, thus, in this case

    • models \(f(x,c)\) must have discrete output

    • the least squares loss function is not appropriate.

  • Reading: Why momentum really works