36) Differentiation

36) Differentiation#

Last time#

Gradient descent
Nonlinear models

Today#

Computing derivatives
1.1 Numeric
1.2 Analytic by hand
1.3 Algorithmic (automatic) differentiation
Recap on Finite Differences
Ill-conditioned optimization

1. Computing derivatives#

We know the definition of the difference quotient from Calculus:

\[\lim_{h\to 0} \frac{f(x+h) - f(x)}{h}\]

How should we choose \(h\)?

Taylor series#

Classical accuracy analysis assumes that functions are sufficiently smooth, meaning that derivatives exist and Taylor expansions are valid within a neighborhood. In particular,

\[ f(x+h) = f(x) + f'(x) h + f''(x) \frac{h^2}{2!} + \underbrace{f'''(x) \frac{h^3}{3!} + \dotsb}_{O(h^3)} . \]

The big-\(O\) notation is meant in the limit \(h\to 0\). Specifically, a function \(g(h) \in O(h^p)\) (sometimes written \(g(h) = O(h^p)\)) when there exists a constant \(C\) such that

\[ g(h) \le C h^p \]

for all sufficiently small \(h\).

Rounding error#

We have an additional source of error, rounding error, which comes from not being able to compute \(f(x)\) or \(f(x+h)\) exactly, nor subtract them exactly. Suppose that we can, however, compute these functions with a relative error on the order of \(\epsilon_{\text{machine}}\). This leads to

\[\begin{split} \begin{split} \tilde f(x) &= f(x)(1 + \epsilon_1) \\ \tilde f(x \oplus h) &= \tilde f((x+h)(1 + \epsilon_2)) \\ &= f((x + h)(1 + \epsilon_2))(1 + \epsilon_3) \\ &= [f(x+h) + f'(x+h)(x+h)\epsilon_2 + O(\epsilon_2^2)](1 + \epsilon_3) \\ &= f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) \end{split} \end{split}\]

Tedious error propagation#

\[\begin{split} \begin{split} \left\lvert \frac{\tilde f(x+h) \ominus \tilde f(x)}{h} - \frac{f(x+h) - f(x)}{h} \right\rvert &= \left\lvert \frac{f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) - f(x)(1 + \epsilon_1) - f(x+h) + f(x)}{h} \right\rvert \\ &\le \frac{|f(x+h)\epsilon_3| + |f'(x+h)x\epsilon_2| + |f(x)\epsilon_1| + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}}h)}{h} \\ &\le \frac{(2 \max_{[x,x+h]} |f| + \max_{[x,x+h]} |f' x| \epsilon_{\text{machine}} + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h)}{h} \\ &= (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h} + O(\epsilon_{\text{machine}}) \\ \end{split} \end{split}\]

where we have assumed that \(h \ge \epsilon_{\text{machine}}\). This error becomes large (relative to \(f'\) – we are concerned with relative error after all).

What can be problematic:

\(f\) is large compared to \(f'\)
\(x\) is large
\(h\) is too small

Automatic step size selection#

Reference: Numerical Optimization.

Walker and Pernice
Dennis and Schnabel

# compute derivative via a forward first-order finite difference scheme
diff(f, x; h=1e-8) = (f(x+h) - f(x)) / h

# and with automatic selection of h
function diff_wp(f, x; h=1e-8)
    """Diff using Walker and Pernice (1998) choice of step"""
    h *= (1 + abs(x))
    (f(x+h) - f(x)) / h
end

# Let's try it!
x = 1000
@show diff(sin, x) - cos(x) # with first-order FD scheme
@show diff_wp(sin, x) - cos(x) # with automatic selection of h

diff(sin, x) - cos(x) = 4.444392023295052e-7
diff_wp(sin, x) - cos(x) = -4.139506408429305e-6

-4.139506408429305e-6

1.1 Symbolic differentiation#

We can also use a package to symbolically differentiate like we would by hand. (You might have used similar packages in MATLAB, Mathematica, or Maple).

using Pkg
Pkg.add("Symbolics")

using Symbolics

@variables x
Dx = Differential(x)

y = sin(x)
Dx(y)

   Resolving package versions...

  No Changes to `~/.julia/environments/v1.10/Project.toml`
  No Changes to `~/.julia/environments/v1.10/Manifest.toml`

┌ Warning: Circular dependency detected.
│ Precompilation will be skipped for dependencies in this cycle:
│  ┌ Symbolics → SymbolicsForwardDiffExt
│  └─ Symbolics → SymbolicsPreallocationToolsExt
└ @ Pkg.API.Precompilation ~/julia-1.10.10/share/julia/stdlib/v1.10/Pkg/src/precompilation.jl:583

\[ \begin{equation} \frac{\mathrm{d} \sin\left( x \right)}{\mathrm{d}x} \end{equation} \]

expand_derivatives(Dx(y))

\[ \begin{equation} \cos\left( x \right) \end{equation} \]

Awesome, what about products? This package can follow the product rule.

y = x
for _ in 1:2
    y = cos(y^pi) * log(y)
end
expand_derivatives(Dx(y))

\[ \begin{equation} \frac{\left( \frac{\cos\left( x^{\pi} \right)}{x} - 3.1416 x^{2.1416} \log\left( x \right) \sin\left( x^{\pi} \right) \right) \cos\left( \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} \right)}{\log\left( x \right) \cos\left( x^{\pi} \right)} - \left( \frac{3.1416 \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{2.1416}}{x} - 9.8696 \cos^{2.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} x^{2.1416} \sin\left( x^{\pi} \right) \right) \sin\left( \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} \right) \log\left( \log\left( x \right) \cos\left( x^{\pi} \right) \right) \end{equation} \]

The size of these expressions can grow exponentially

1.2 Hand-coding (analytic) derivatives#

With (mild) algebra abuse, the expression

\[ \frac{dt}{dx} = f'(x) \]

is equivalent to

\[ df = f'(x) dx \]

function f(x)
    y = x
    for _ in 1:2
        a = y^pi
        b = cos(a)
        c = log(y)
        y = b * c # product
    end
    y
end

f(1.9), diff_wp(f, 1.9)

(-1.5346823414986814, -34.032439961925064)

function df(x, dx)
    y = x
    dy = dx
    for _ in 1:2
        a = y ^ pi
        da = pi * y^(pi - 1) * dy
        b = cos(a)
        db = -sin(a) * da
        c = log(y)
        dc = dy / y
        y = b * c # product
        dy = db * c + b * dc # product rule for differentiation
    end
    y, dy
end

df(1.9, 1)

(-1.5346823414986814, -34.032419599140475)

Forward Vs reverse mode#

We can differentiate a composition \(h(g(f(x)))\) as

(19)#\[\begin{align} \operatorname{d} h &= h' \operatorname{d} g \\ \operatorname{d} g &= g' \operatorname{d} f \\ \operatorname{d} f &= f' \operatorname{d} x. \end{align}\]

What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like

\[ \operatorname d h = \frac{dh}{dg} \left(\frac{dg}{df} \left(\frac{df}{dx} \operatorname d x \right) \right) .\]

The expression means the same thing if we rearrange the parentheses,

\[ \operatorname d h = \left( \left( \left( \frac{dh}{dg} \right) \frac{dg}{df} \right) \frac{df}{dx} \right) \operatorname d x \]

which we can compute in reverse order via

\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]

A reverse mode example#

\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]

Let’s see a worked example:

\[ z = x y + \sin(x) \]

This in pseudo-code would be represented as:

x = ?
y = ?
a = x * y
b = sin(x)
z = a + b

Evaluating the derivative in the forward mode gives

dx = ?
dy = ?
da = y * dx + x * dy
db = cos(x) * dx
dz = da + db

and evaluating in the reverse mode gives

gz = ?
gb = gz
ga = gz
gy = x * ga
gx = y * ga + cos(x) * gb

Let’s see another example in code:

function g(x)
    a = x^pi
    b = cos(a)
    c = log(x)
    y = b * c
    y
end

# forward mode derivative
(g(1.4), diff_wp(g, 1.4))

(-0.32484122107701546, -1.2559760384500684)

# backward mode derivative
function gback(x, y_)
    a = x^pi
    b = cos(a)
    c = log(x)
    y = b * c
    # backward pass
    c_ = y_ * b
    b_ = c * y_
    a_ = -sin(a) * b_
    x_ = 1/x * c_ + pi * x^(pi-1) * a_
    x_
end
gback(1.4, 1)

-1.2559761698835525

1.3 Automatic differentiation: Zygote.jl #

using Pkg
Pkg.add("Zygote")

import Zygote

   Resolving package versions...

  No Changes to `~/.julia/environments/v1.10/Project.toml`
  No Changes to `~/.julia/environments/v1.10/Manifest.toml`

┌ Warning: Circular dependency detected.
│ Precompilation will be skipped for dependencies in this cycle:
│  ┌ Symbolics → SymbolicsForwardDiffExt
│  └─ Symbolics → SymbolicsPreallocationToolsExt
└ @ Pkg.API.Precompilation ~/julia-1.10.10/share/julia/stdlib/v1.10/Pkg/src/precompilation.jl:583

Zygote.gradient(f, 1.9)

(-34.03241959914049,)

g(x) = exp(x) + x^2

# Let's look at the LLVM bitcode here
@code_llvm Zygote.gradient(g, 1.9)

;  @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:152 within `gradient`
define [1 x double] @julia_gradient_4263(double %0) #0 {
top:
;  @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:153 within `gradient`
; ┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:94 within `pullback` @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:96
; │┌ @ In[11]:1 within `g`
; ││┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81 within `_pullback`
; │││┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl within `macro expansion`
; ││││┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:234 within `chain_rrule`
; │││││┌ @ /home/vbarra/.julia/packages/ChainRulesCore/Vsbj9/src/rules.jl:138 within `rrule` @ /home/vbarra/.julia/packages/ChainRules/14CDN/src/rulesets/Base/fastmath_able.jl:56
        %1 = call double @j_exp_4265(double %0)
; └└└└└└
;  @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:154 within `gradient`
; ┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:97 within `#88`
; │┌ @ In[11]:1 within `g`
; ││┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:222 within `ZBack`
; │││┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/lib/number.jl:12 within `literal_pow_pullback`
; ││││┌ @ promotion.jl:423 within `*` @ float.jl:411
       %2 = fmul double %0, 2.000000e+00
; │└└└└
; │┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/lib/lib.jl:9 within `accum`
; ││┌ @ float.jl:409 within `+`
     %3 = fadd double %2, %1
; └└└
;  @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:155 within `gradient`
  %unbox3.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %3, 0
  ret [1 x double] %unbox3.fca.0.insert
}

How does Zygote work?#

square(x) = x^2

# Let's look at the LLVM bitcode here
@code_llvm square(1.5)

;  @ In[12]:1 within `square`
define double @julia_square_4284(double %0) #0 {
top:
; ┌ @ intfuncs.jl:332 within `literal_pow`
; │┌ @ float.jl:411 within `*`
    %1 = fmul double %0, %0
    ret double %1
; └└
}

@code_llvm Zygote.gradient(square, 1.5)

;  @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:152 within `gradient`
define [1 x double] @julia_gradient_4288(double %0) #0 {
top:
;  @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:154 within `gradient`
; ┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:97 within `#88`
; │┌ @ In[12]:1 within `square`
; ││┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:222 within `ZBack`
; │││┌ @ /home/vbarra/.julia/packages/Zygote/55SqB/src/lib/number.jl:12 within `literal_pow_pullback`
; ││││┌ @ promotion.jl:423 within `*` @ float.jl:411
       %1 = fmul double %0, 2.000000e+00
; └└└└└
;  @ /home/vbarra/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:155 within `gradient`
  %unbox2.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %1, 0
  ret [1 x double] %unbox2.fca.0.insert
}

Kinds of algorithmic differentation#

Source transformation: Fortran code in, Fortran code out
- Duplicates compiler features, usually incomplete language coverage
- Produces efficient code
Operator overloading: C++ types
- Hard to vectorize
- Loops are effectively unrolled/inefficient
Just-in-time compilation: tightly coupled with compiler
- JIT lag
- Needs dynamic language features (JAX) or tight integration with compiler (Zygote, Enzyme)
- Some sharp bits in Python’s JAX

Forward or reverse?#

It all depends on the shape.

If you have one input, many outputs: use forward mode
- “One input” can be looking in one direction
If you have many inputs, one output: use reverse mode
- Will need to traverse execution backwards (“tape”)
- Hierarchical checkpointing
What about square cases (same number of input/output)? Forward is usually a bit more efficient.

Can you differentiate an algorithm?#

Examples:

Optimization: Input \(c\), output \(x\) such that \(f(x,c)=0\)
Finding eigenvalues/eigenvectors: Input \(A\), output \(\lambda\) such that \(Ax = \lambda x\) for some nonzero vector

2. Finite Differences (FD)#

Finite Differences#

To define derivatives, we use Taylor’s expansion:

\[ u_{i+1} = u(x_i + \Delta x) \approx u(x_i) + \Delta x \partial_x u \bigg\rvert_i + \frac{1}{2} \Delta x^2 \partial^2_x u \bigg\rvert_i + \frac{1}{6} \Delta x^3 \partial^3_x u \bigg\rvert_i + O(\Delta x^4) \]

Similarly,

\[ u_{i-1} = u(x_i - \Delta x) \approx u(x_i) - \Delta x \partial_x u \bigg\rvert_i + \frac{1}{2} \Delta x^2 \partial^2_x u \bigg\rvert_i - \frac{1}{6} \Delta x^3 \partial^3_x u \bigg\rvert_i + O(\Delta x^4) \]

Consider a uniformly discretized domain (with uniform \(h \equiv \Delta x\) subintervals)

1D uniformily discretized domain.

We can define the first-order derivative at the point \(x_i\) using the forward difference:

\[ \frac{\partial u_i^F}{\partial x} \approx \frac{ u_{i+1} - u_{i}}{\Delta x} + O(\Delta x) \]

this is a first-order approximation.

Similarly, we can define the first-order derivative at the point \(x_i\) using the backward difference:

\[ \frac{\partial u_{i}^B}{\partial x} \approx \frac{ u_{i} - u_{i-1}}{\Delta x} + O(\Delta x) \]

this is a first-order approximation.

If we use one point to the right of \(x_i\) and one point to the left of \(x_i\) we have a centered difference approximation for the first-order derivative at the point \(x_i\) using the centered difference:

\[ \frac{\partial u_{i}^C}{\partial x} \approx \frac{ u_{i+1} - u_{i-1}}{2 \Delta x} + O(\Delta x^2) \]

this is a second-order approximation.

Thus we note that the centered difference approximates the first derivative with respect to \(x\) more accurately than either of the one-sided differences, \(O( \Delta x^2 )\) versus \(\Delta x\).

We can now define a second-order derivative, at the point \(x_i\) using a centered difference formula:

\[ \frac{\partial^2 u^C_i}{\partial x^2} = \frac{\frac{\partial u_{i+\frac{1}{2}}^F}{\partial x} -\frac{ \partial u_{i-\frac{1}{2}}^B}{\partial x} }{\Delta x} \approx \frac{u_{i-1} -2 u_i + u_{i+1}}{\Delta x^2} + O(\Delta x^2)\]

3. Ill-conditioned optimization#

\[ L(c; x, y) = \frac 1 2 \lVert \underbrace{f(x, c) - y}_{r(c)} \rVert^2 \]

Recall that the computation of the gradient of the loss function \(L\) requires the Jacobian, denoted by \(J\), of the model \(f\) differentiated w. r. t. the constants \(c\).

\[ g(c) = \nabla_c L = r^T \underbrace{\nabla_c f}_{J} \]

We can find the constants \(c\) for which \(g(c) = 0\) using a Newton method

\[ g(c + \delta c) = g(c) + \underbrace{\nabla_c g}_H\delta c + O((\delta c)^2) \]

The Hessian requires the second derivative of \(f\), which can cause problems.

\[ H = J^T J + r^T (\nabla_c J)\]

Newton-like methods for optimization#

We want to solve

\[ H \delta c = -g(c) \]

Update

\[ c \gets c + \gamma \delta c \]

using a line search or trust region.

Gauss-Newton: \(H = J^T J\)
Levenberg-Marquardt: \(H = J^T J + \alpha J\)

Outlook on optimization#

The optimization problem can be solved using a Newton method. It can be onerous to implement the needed derivatives.
The Gauss-Newton method is often more practical than Newton while being faster than gradient descent, though it lacks robustness.
The Levenberg-Marquardt method provides a sort of middle-ground between Gauss-Newton and gradient descent.
Many globalization techniques are used for models that possess many local minima.
One pervasive approach is stochastic gradient descent, where small batches (e.g., 1, 10 or 20) are selected randomly from the corpus of observations (500 in the example we’ve seen with many realizations), and a step of gradient descent is applied to that reduced set of observations. This helps to escape saddle points and weak local minima.
Among expressive models \(f(x,c)\), some may converge much more easily than others. Having a good optimization algorithm is essential for nonlinear regression with complicated models, especially those with many parameters \(c\).
Classification is a very similar problem to regression, but the observations \(y\) are discrete, thus, in this case
- models \(f(x,c)\) must have discrete output
- the least squares loss function is not appropriate.
Reading: Why momentum really works