12) Newton methods#

Last time#

  • Convergence classes

  • Intro to Newton methods

Today#

  1. Newton-Raphson Method
    1.1 An implementation

  2. Convergence of fixed-point (by mean value theorem)

  3. Convergence of fixed-point (by Taylor series)

  4. A fresh derivation of Newton’s method via convergence theory

using Plots
default(linewidth=4, legendfontsize=12)

f(x) = cos(x) - x
hasroot(f, a, b) = f(a) * f(b) < 0
function bisect_iter(f, a, b, tol)
    hist = Float64[]
    while abs(b - a) > tol
        mid = (a + b) / 2
        push!(hist, mid)
        if hasroot(f, a, mid)
            b = mid
        else
            a = mid
        end
    end
    hist
end
bisect_iter (generic function with 1 method)

Convergence classes#

Tip

Check this Reading

A convergent rootfinding algorithm produces a sequence of approximations \(x_k\) such that

\[\lim_{k \to \infty} x_k \to x_*\]
where \(f(x_*) = 0\). For analysis, it is convenient to define the errors \(e_k = x_k - x_*\). We say that an iterative algorithm is \(q\)-linearly convergent if
\[\lim_{k \to \infty} |e_{k+1}| / |e_k| = \rho < 1.\]
(The \(q\) is for “quotient”.) A smaller convergence factor \(\rho\) represents faster convergence. A slightly weaker condition (\(r\)-linear convergence or just linear convergence - the “r” here is for “root”) is that
\[ |e_k| \le \epsilon_k \]
for all sufficiently large \(k\) when the sequence \(\{\epsilon_k\}\) converges \(q\)-linearly to 0.

We saw that:

Note

  • Why is \(r\)-convergence considered a weaker form of convergence relative to \(q\)-convergence? Notice that root convergence is concerned only with the overall rate of decrease of the error while quotient convergence requires the error to decrease at each iteration of the algorithm. Thus \(q\)-convergence is a stronger form of convergence than \(r\)-convergence

  • \(q\)-convergence implies (\(\implies\)) \(r\)-convergence.

hist = bisect_iter(f, -1, 3, 1e-10)
r = hist[end] # What are we trusting?
hist = hist[1:end-1]
scatter( abs.(hist .- r), yscale=:log10)
ks = 1:length(hist)
ρ = 0.5
plot!(ks, 4 * (ρ .^ ks))

1. Newton-Raphson Method#

Much of numerical analysis reduces to Taylor series, the approximation

\[ f(x) = f(x_0) + f'(x_0) (x-x_0) + f''(x_0) (x - x_0)^2 / 2 + \underbrace{\dotsb}_{O((x-x_0)^3)} \]
centered on some reference point \(x_0\).

In numerical computation, it is exceedingly rare to look beyond the first-order approximation

\[ \tilde f_{x_0}(x) = f(x_0) + f'(x_0)(x - x_0) . \]
Since \(\tilde f_{x_0}(x)\) is a linear function, we can explicitly compute the unique solution of \(\tilde f_{x_0}(x) = 0\) as
\[ x = x_0 - \frac{f(x_0)}{f'(x_0)} . \]
This is Newton’s Method (aka Newton-Raphson or Newton-Raphson-Simpson) for finding the roots of differentiable functions.

1.1 An implementation#

function newton(f, fp, x0; tol=1e-8, verbose=false)
    x = x0
    for k in 1:100 # max number of iterations
        fx = f(x)
        fpx = fp(x)
        if verbose
            println("iteration $k: x=$x,  f(x)=$fx,  f'(x)=$fpx")
        end
        if abs(fx) < tol
            return x, fx, k
        end
        x = x - fx / fpx
    end
end

f(x) = cos(x) - x
fp(x) = -sin(x) - 1
newton(f, fp, 1; tol=1e-15, verbose=true)
iteration 1: x=1,  f(x)=-0.45969769413186023,  f'(x)=-1.8414709848078965
iteration 2: x=0.7503638678402439,  f(x)=-0.018923073822117442,  f'(x)=-1.6819049529414878
iteration 3: x=0.7391128909113617,  f(x)=-4.6455898990771516e-5,  f'(x)=-1.6736325442243012
iteration 4: x=0.739085133385284,  f(x)=-2.847205804457076e-10,  f'(x)=-1.6736120293089505
iteration 5: x=0.7390851332151607,  f(x)=0.0,  f'(x)=-1.6736120291832148
(0.7390851332151607, 0.0, 5)

That’s really fast!#

  • 10 digits of accuracy in 4 iterations.

  • How is this convergence test different from the one we used for bisection?

  • How can this break down?

\[ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} \]
newton(f, fp, -pi/2+.1; verbose=true)
iteration 1: x=-1.4707963267948965,  f(x)=1.5706297434417247,  f'(x)=-0.0049958347219742905
iteration 2: x=312.9170549232224,  f(x)=-312.59435002533314,  f'(x)=-0.05350037037604283
iteration 3: x=-5529.927542752894,  f(x)=5530.676391917825,  f'(x)=-0.33725953180603474
iteration 4: x=10868.945936970244,  f(x)=-10868.376227850798,  f'(x)=-0.17815359146505727
iteration 5: x=-50136.70732252356,  f(x)=50135.70777741902,  f'(x)=-1.0301593101044748
iteration 6: x=-1468.7903453577164,  f(x)=1468.8859787856973,  f'(x)=-1.9954166200403913
iteration 7: x=-732.6603742863299,  f(x)=731.8761094362295,  f'(x)=-1.6204261800544972
iteration 8: x=-281.0038172368656,  f(x)=280.8358913898124,  f'(x)=-1.9857996296872256
iteration 9: x=-139.58174866993488,  f(x)=139.79912372811944,  f'(x)=-0.02391184615360531
iteration 10: x=5706.856156210999,  f(x)=-5707.008659764705,  f'(x)=-1.9883029222393844
iteration 11: x=2836.5648158265976,  f(x)=-2837.5220962814674,  f'(x)=-1.2891610809292957
iteration 12: x=635.503879177757,  f(x)=-634.8839651181479,  f'(x)=-1.784669713127157
iteration 13: x=279.7607629875442,  f(x)=-280.7481464325292,  f'(x)=-0.8416524942745961
iteration 14: x=-53.80700796583557,  f(x)=52.88592082634005,  f'(x)=-1.3893564966145653
iteration 15: x=-15.74196061822768,  f(x)=14.742538472479499,  f'(x)=-1.0339908015219368
iteration 16: x=-1.4840596284124175,  f(x)=1.5706876106501757,  f'(x)=-0.0037592697076601622
iteration 17: x=416.333158287511,  f(x)=-416.4052274406877,  f'(x)=-1.99739963763799
iteration 18: x=207.8594910282616,  f(x)=-206.9888910802111,  f'(x)=-1.4919915959184906
iteration 19: x=69.12620885264326,  f(x)=-68.1262712417355,  f'(x)=-1.0111702413616044
iteration 20: x=1.7525179991632598,  f(x)=-1.933241162917233,  f'(x)=-1.9835340045381016
iteration 21: x=0.7778731690296721,  f(x)=-0.0654654835715871,  f'(x)=-1.7017658368004631
iteration 22: x=0.7394040200105153,  f(x)=-0.0005337303513481828,  f'(x)=-1.6738476794194503
iteration 23: x=0.7390851556610822,  f(x)=-3.756576461011463e-8,  f'(x)=-1.6736120457726615
iteration 24: x=0.7390851332151608,  f(x)=-2.220446049250313e-16,  f'(x)=-1.6736120291832148
(0.7390851332151608, -2.220446049250313e-16, 24)

2. Convergence of fixed-point (by mean value theorem)#

Consider the iteration

\[x_{k+1} = g(x_k)\]
where \(g\) is a continuously differentiable function. Suppose that there exists a fixed point \(x_* = g(x_*)\). By the mean value theorem, we have that
\[ x_{k+1} - x_* = g(x_k) - g(x_*) = g'(c_k) (x_k - x_*) \]
for some \(c_i\) between \(x_k\) and \(x_*\).

Taking absolute values,

\[|e_{k+1}| = |g'(c_k)| |e_k|,\]
which converges to zero if \(|g'(c_k)| < 1\).

3. Convergence of fixed-point (by Taylor series)#

Consider the iteration

\[x_{k+1} = g(x_k)\]
where \(g\) is a continuously differentiable function. Suppose that there exists a fixed point \(x_* = g(x_*)\). There exists a Taylor series at \(x_*\),
\[ g(x_k) = g(x_*) + g'(x_*)(x_k - x_*) + O((x_k-x_*)^2) \]
and thus

(1)#\[\begin{align} x_{k+1} - x_* &= g(x_k) - g(x_*) \\ &= g'(x_*) (x_k - x_*) + O((x_k - x_*)^2). \end{align}\]

In terms of the error \(e_k = x_k - x_*\),

\[ \left\lvert \frac{e_{k+1}}{e_k} \right\rvert = \lvert g'(x_*) \rvert + O(e_k).\]

Poll 12.1: Convergence class of the fixed-point method#

Is this convergence:

  • A=q-linear

  • B=r-linear

  • C=neither

Recall the definition of q-linear convergence

\[ \lim_{k\to\infty} \left\lvert \frac{e_{k+1}}{e_k} \right\rvert = \rho < 1. \]

So, clearly, we need

\[\lvert g'(x_*) \rvert < 1\]
to have convergence of the fixed-point iteration method. Otherwise, it diverges.

Aside:#

Limit \(n\to\infty\):#

We’d say an algorithm costs \(O(n^2)\) if its running time on input of size \(n\) is less than \(c n^2\) for some constant \(c\) and sufficiently large \(n\).

Sometimes we write \(\operatorname{cost}(\texttt{algorithm}, n) = O(n^2)\) or (preferably) \(\operatorname{cost}(\texttt{algorithm}) \in O(n^2)\).

Note that \(O(\log n) \subset O(n) \subset O(n\log n) \subset O(n^2) \subset \dotsb\).

We say the algorithm is in \(\Theta(n^2)\) (“big theta”) if

\[ c_1 n^2 < \operatorname{cost}(\texttt{algorithm}) < c_2 n^2 \]
for some positive constants \(c_1,c_2\) and sufficiently large \(n\).

Limit \(h \to 0\):#

In numerical analysis, we often have a small real number, and now the definitions take the limit as the small number goes to zero. So we say a term in an expression is in \(O(h^2)\) if

\[ \lim_{h\to 0} \frac{\operatorname{term}(h)}{h^2} < \infty . \]

Big \(O\) terms can be manipulated as

(2)#\[\begin{align} h O(h^k) &= O(h^{k+1}) \\ O(h^k)/h &= O(h^{k-1}) \\ c O(h^k) &= O(h^k) \\ O(h^k) - O(h^k) &= ? \end{align}\]

Example of a fixed-point iteration#

We wanted to solve \(\cos x - x = 0\), which occurs when \(g(x) = \cos x\) is a fixed point.

xstar, _ = newton(f, fp, 1.)
g(x) = cos(x)
gprime(x) = -sin(x)
@show xstar
@show gprime(xstar)
plot([x->x, g], xlims=(-2, 3), label = ["y=x" "y=g(x)"])
scatter!([xstar], [xstar],
    label="\$x_*\$")
xstar = 0.739085133385284
gprime(xstar) = -0.6736120293089505
function fixed_point(g, x, n)
    xs = [x]
    for k in 1:n
        x = g(x)
        append!(xs, x)
    end
    xs
end

xs = fixed_point(g, 2., 15)
plot!(xs, g.(xs), seriestype=:path, marker=:auto)

Verifying fixed-point method convergence theory#

\[ \left\lvert \frac{e_{k+1}}{e_k} \right\rvert \to \lvert g'(x_*) \rvert \]
@show gprime(xstar)
es = xs .- xstar
es[2:end] ./ es[1:end-1]
gprime(xstar) = -0.6736120293089505
15-element Vector{Float64}:
 -0.9161855415615605
 -0.15197657010596488
 -0.734870205299266
 -0.624132525531327
 -0.7026257933893496
 -0.6523498121376077
 -0.6870971782336925
 -0.664168570025122
 -0.6798044680427148
 -0.6693659427636027
 -0.6764378047956165
 -0.6716930541785153
 -0.6748976495459512
 -0.6727427617641084
 -0.6741962236114177
scatter(abs.(es), yscale=:log10, label="fixed point")
plot!(k -> abs(gprime(xstar))^k, label="\$|g'|^k\$")

Reading: FNC Fixed-point iteration#

Plotting Newton convergence#

function newton_hist(f, fp, x0; tol=1e-12)
    x = x0
    hist = []
    for k in 1:100 # max number of iterations
        fx = f(x)
        fpx = fp(x)
        push!(hist, [x fx fpx])
        if abs(fx) < tol
            return vcat(hist...) # concatenate arrays
        end
        x = x - fx / fpx
    end
end
newton_hist (generic function with 1 method)
xs = newton_hist(f, fp, 1.97)
@show x_star = xs[end,1]
@show xs[1:end-1,1] .- x_star
plot(xs[1:end-1,1] .- x_star, yscale=:log10, marker=:auto, xlabel = "# of iterations", ylabel = "abs error")
x_star = xs[end, 1] = 0.7390851332151607
xs[1:end - 1, 1] .- x_star = [1.2309148667848393, 0.003309687313930887, 2.4103207354464473e-6, 1.2827516826519059e-12]

Poll 12.2: Convergence of Newton’s method#

Is this convergence:

  • A=q-linear

  • B=r-linear

  • C=neither?

Formulations are not unique (constants):#

If \(x = g(x)\) then

\[x = \underbrace{x + c(g(x) - x)}_{g_2}\]
for any constant \(c \ne 0\). Can we choose \(c\) to make \(\lvert g_2'(x_*) \rvert\) small?

c = .5
g2(x) = x + c * (cos(x) - x)
g2p(x) = 1 + c * (-sin(x) - 1)
@show g2p(xstar)
plot([x->x, g, g2], ylims=(-5, 5), label=["y=x" "y=g(x)" "y=g2(x)"])
g2p(xstar) = 0.16319398534552476
xs = fixed_point(g2, 1., 15)
xs .- xstar
16-element Vector{Float64}:
  0.26091486661471597
  0.03106601954878585
  0.004893162344945079
  0.0007941171212053622
  0.00012947850276123773
  2.112687301181193e-5
  3.4475537732392425e-6
  5.62475483634195e-7
  9.16501970982253e-8
  1.4814399151852342e-8
  2.2752605355336186e-9
  2.2894852680366284e-10
 -1.0499723313017739e-10
 -1.594951948291623e-10
 -1.683889694348295e-10
 -1.6984036399492197e-10

Formulations are not unique (functions)#

If \(x = g(x)\) then

\[x = \underbrace{x + h(x) \big(g(x) - x\big)}_{g_3(x)}\]
for any smooth \(h(x) \ne 0\). Can we choose \(h(x)\) to make \(\lvert g_3'(x) \rvert\) small?

h(x) = -1 / (gprime(x) - 1)
g3(x) = x + h(x) * (g(x) - x)
plot([x-> x, cos, g2, g3], ylims=(-5, 5), label=["y=x" "y=g(x)" "y=g2(x)" "y=g3(x)"])

Exercise 12.1:#

Compute \(|g_3'(x_*)|\).

Note:

  • We don’t know \(g'(x_*)\) in advance because we don’t know \(x_*\) yet.

  • This method converges very fast

  • We actually just derived Newton’s method.

4. A fresh derivation of Newton’s method via convergence theory#

  • A rootfinding problem \(f(x) = 0\) can be converted to a fixed point problem

    \[x = x + f(x) =: g(x)\]
    but there is no guarantee that \(g'(x_*) = 1 + f'(x_*)\) will have magnitude less than 1.

  • Problem-specific algebraic manipulation can be used to make \(|g'(x_*)|\) small.

  • \(x = x + h(x) f(x)\) is also a valid formulation for any \(h(x)\) bounded away from \(0\).

  • Can we choose \(h(x)\) such that

    \[ g'(x) = 1 + h'(x) f(x) + h(x) f'(x) = 0\]
    when \(f(x) = 0\)?

In other words,

\[ x_{k+1} = x_k + \underbrace{\frac{-1}{f'(x_k)}}_{h(x_k)} f(x_k) . \]

Quadratic convergence!#

\[ \left\lvert \frac{e_{k+1}}{e_k} \right\rvert \to \lvert g'(x_*) \rvert \]
  • What does it mean that \(g'(x_*) = 0\)?

  • It turns out that Newton’s method has locally quadratic convergence to simple roots,

    \[\lim_{k \to \infty} \frac{|e_{k+1}|}{|e_k|^2} < \infty.\]

  • “The number of correct digits doubles each iteration.”

  • Now that we know how to make a good guess accurate, the effort lies in getting a good starting guess.

Rootfinding summary and outlook#

  • Newton methods are immensely successful

    • Convergence theory is local; we need good initial guesses

    • Computing the derivative \(f'(x)\) is intrusive (and can be error-prone)

      • Avoided by secant methods (approximate the derivative)

      • Algorithmic or numerical differentiation can also be used instead of computing the derivative by hand

    • Bisection is robust when conditions are met

    • When does Newton diverge?

  • More topics

    • Do we find all the roots?

    • Times when Newton converges slowly

Exercise 12.2:#

  1. Solve the fixed point iteration \(g(x_n) = (2+{(x_n - 2)}^2)\), up to a tolerance \(10^{-6}\) or maximum number of iterations \(N=50\), with (a) \(x_0 = 2.8\), (b) \(x_0 = 3.1\). Which execution is better? And why?

  2. Solve the fixed point iteration \(g(x_n) = (2+{(x_n - 2)}^3)\), up to a tolerance \(10^{-6}\) or maximum number of iterations \(N=50\), with (a) \(x_0 = 2.8\), (b) \(x_0 = 3.1\). Which execution is better? And why?