Rosemary Renaut, Jodi Mead - Arizona State Universityrosie/mypresentations/cfgpres.pdf · 2008. 1....

Regularization Parameter Estimation for LeastSquares: A Newton method using the χ2-distribution

Rosemary Renaut, Jodi Mead

Arizona State and Boise State

September 2007

Renaut and Mead (ASU/Boise) Scalar Newton method September 2007 1 / 31

Outline

1 Introduction- Ill-posed least squaresSome Standard Methods

2 A Statistically based method: Chi squared MethodBackgroundAlgorithmSingle Variable Newton MethodExtend for General D: Generalized TikhonovObservations

3 Results

4 Conclusions

5 References

Regularized Least Squares for Ax = bAssume: A ∈ Rm×n, b ∈ Rm, x ∈ Rn, and the system is ill-posed.

Generalized Tikhonov regularization, operator D acts on x.

x̂ = argmin J(x) = argmin{‖Ax− b‖2Wb

+ ‖D(x− x0)‖2Wx}. (1)

Assume N (A) ∩N (D) = ∅Statistically Wb is inverse covariance matrix for data b.Standard: Wx = λ2I, λ unknown penalty parameter: (1) is

x̂(λ) = argmin J(x) = argmin{‖Ax− b‖2Wb

+ λ2‖D(x− x0)‖2}. (2)

+ ‖D(x− x0)‖2Wx}. (1)

+ λ2‖D(x− x0)‖2}. (2)

+ ‖D(x− x0)‖2Wx}. (1)

+ λ2‖D(x− x0)‖2}. (2)

+ ‖D(x− x0)‖2Wx}. (1)

+ λ2‖D(x− x0)‖2}. (2)

Question: What is the correct λ?

Some standard approaches I: L-curve - Find the corner

Let r(λ) = (A(λ)− A)b:Influence Matrix A(λ) =A(AT WbA + λ2DT D)−1AT

log(‖Dx‖), log(‖r(λ)‖)

Trade off contributions.Expensive - requires range ofλ.GSVD makes calculationsefficient.Not statistically based

Find corner

No cornerRenaut and Mead (ASU/Boise) Scalar Newton method September 2007 4 / 31

Find corner

Some standard approaches II: Generalized Cross-Validation (GCV)

Minimizes GCV function

‖b− Ax(λ)‖2Wb

[trace(Im − A(λ))]2,

which estimates predictiverisk.Expensive - requires range ofλ.GSVD makes calculationsefficient.Statistically basedRequires minimum

Multiple minima

Sometimes flat

Multiple minima

Sometimes flat

Multiple minima

Sometimes flat

Multiple minima

Sometimes flat

Multiple minima

Sometimes flat

Some standard approaches III: Unbiased Predictive Risk Estimation(UPRE)

Minimize expected value ofpredictive risk: MinimizeUPRE function

+2 trace(A(λ))−m

Expensive - requires range ofλ.GSVD makes calculationsefficient.Statistically basedMinimum needed

+2 trace(A(λ))−m

Development

A New Statistically based method: The Chi squared MethodIts BackgroundA Newton algorithmSome ExamplesFuture Work

Development

General Result: Tikhonov (D = I) Cost functional at min is χ2 r.v.

Theorem (Rao:73, Tarantola, Mead (2007))

J(x) = (b− Ax)T Cb−1(b− Ax) + (x− x0)

T Cx−1(x− x0),

x and b are stochastic (need not be normal)r = b− Ax0 are iid. (Assume no components are zero)Matrices Cb = Wb

−1 and Cx = Wx−1 are SPD -

Then for large m,minimium value of J is a random variableit follows a χ2 distribution with m degrees of freedom.

J(x) = (b− Ax)T Cb−1(b− Ax) + (x− x0)

T Cx−1(x− x0),

J(x) = (b− Ax)T Cb−1(b− Ax) + (x− x0)

T Cx−1(x− x0),

J(x) = (b− Ax)T Cb−1(b− Ax) + (x− x0)

T Cx−1(x− x0),

J(x) = (b− Ax)T Cb−1(b− Ax) + (x− x0)

T Cx−1(x− x0),

Implications:

Theorem implies

m −√

2zα/2 < J(x̂) < m +√

2zα/2

for confidence interval (1− α), x̂ the solution.Equivalently, when D = I,

m −√

2zα/2 < rT (ACxAT + Cb)−1r < m +√

2zα/2.

Implications:

Theorem implies

m −√

2zα/2 < J(x̂) < m +√

2zα/2

m −√

2zα/2 < rT (ACxAT + Cb)−1r < m +√

2zα/2.

Implications:

Theorem implies

m −√

2zα/2 < J(x̂) < m +√

2zα/2

m −√

2zα/2 < rT (ACxAT + Cb)−1r < m +√

2zα/2.

Implications:

Theorem implies

m −√

2zα/2 < J(x̂) < m +√

2zα/2

m −√

2zα/2 < rT (ACxAT + Cb)−1r < m +√

2zα/2.

Note no assumptions on Wx : it is completely general

Implications:

Theorem implies

m −√

2zα/2 < J(x̂) < m +√

2zα/2

m −√

2zα/2 < rT (ACxAT + Cb)−1r < m +√

2zα/2.

Note no assumptions on Wx : it is completely general

Can we use the result to obtain an efficient algorithm?

First attempt: a New Algorithm for Estimating Model Covariance

Algorithm (Mead 07)Given confidence interval parameter α, initial residual r = b− Ax0 andestimate of the data covariance Cb, find Lx which solves the nonlinearoptimization.

Minimize ‖LxLxT‖2

FSubject to m −

√2zα/2 < rT (ALxLx

T AT + Cb)−1r < m +√

2zα/2ALxLx

T AT + Cb well-conditioned.

Minimize ‖LxLxT‖2

FSubject to m −

√2zα/2 < rT (ALxLx

T AT + Cb)−1r < m +√

2zα/2ALxLx

T AT + Cb well-conditioned.

Expensive

Single Variable Approach: Seek efficient, practical algorithm

Let Wx = σ−2x I, where regularization parameter λ = 1/σx.

Use SVD to implement UbΣbV Tb = Wb

1/2A, svs σ1 ≥ σ2 ≥ . . . σpand define s = UbWb

1/2r:Find σx such that

m −√

2zα/2 < sT diag(1

σ2i σ2

x + 1)s < m +

√2zα/2.

Equivalently, find σ2x such that

F (σx) = sT diag(1

1 + σ2xσ2

i)s−m = 0.

m −√

2zα/2 < sT diag(1

σ2i σ2

x + 1)s < m +

√2zα/2.

F (σx) = sT diag(1

1 + σ2xσ2

i)s−m = 0.

m −√

2zα/2 < sT diag(1

σ2i σ2

x + 1)s < m +

√2zα/2.

F (σx) = sT diag(1

1 + σ2xσ2

i)s−m = 0.

m −√

2zα/2 < sT diag(1

σ2i σ2

x + 1)s < m +

√2zα/2.

F (σx) = sT diag(1

1 + σ2xσ2

i)s−m = 0.

m −√

2zα/2 < sT diag(1

σ2i σ2

x + 1)s < m +

√2zα/2.

F (σx) = sT diag(1

1 + σ2xσ2

i)s−m = 0.

Scalar Root Finding: Newton’s Method

Extension to Generalized Tikhonov

Define

x̂GTik = argminJD(x) = argmin{‖Ax− b‖2Wb

+ ‖D(x− x0)‖2Wx}, (3)

Define

+ ‖D(x− x0)‖2Wx}, (3)

Theorem

For large m, the minimium value of JD is a random variable whichfollows a χ2 distribution with m− n + p degrees of freedom. (Assumingthat no components of r are zero)

Define

+ ‖D(x− x0)‖2Wx}, (3)

Theorem

Proof.Use the Generalized Singular Value Decomposition for[Wb

1/2A, Wx1/2D]

Define

+ ‖D(x− x0)‖2Wx}, (3)

Theorem

Proof.Use the Generalized Singular Value Decomposition for[Wb

1/2A, Wx1/2D]

Find Wx such that JD is χ2 with m − n + p d.o.f.

Newton Root Finding Wx = σ−2x Ip

LetGSVD of [Wb

1/2A, D]

A = U[

Υ0m−n×n

]X T D = V [M, 0p×n−p]X T ,

γi are the generalized singular valuesm̃ = m − n + p −

∑pi=1 s2

i δγi 0 −∑m

i=n+1 s2i ,

s̃i = si/(γ2i σ2

x + 1), i = 1, . . . , pti = s̃iγi .

Newton Root Finding Wx = σ−2x Ip

LetGSVD of [Wb

1/2A, D]

A = U[

Υ0m−n×n

]X T D = V [M, 0p×n−p]X T ,

γi are the generalized singular valuesm̃ = m − n + p −

∑pi=1 s2

i δγi 0 −∑m

i=n+1 s2i ,

s̃i = si/(γ2i σ2

x + 1), i = 1, . . . , pti = s̃iγi .

Find root of∑p

γ2i σ2+1)s2

i +∑m

i=n+1 s2i = m

Solve F = 0, where

F (σx) = sT s̃− m̃ and F ′(σx) = −2σx‖t‖22.

An Illustrative Example: phillips Fredholm integral equation (Hansen)

Add noise to bStandard deviationσbi = .01|bi |+ .1bmax

Covariance matrixCb = σ2

bIm = Wb−1

σ2b average of σ2

− is the original b and ∗ noisydata.

Example Error 10%

An Illustrative Example: phillips Fredholm integral equation (Hansen)

Compare Solutions:+ is reference x0. −− is exact.L-Curve oThree other solutions: UPRE,GCV and χ2 method (blue,magenta, black)Each method gives differentsolution - but UPRE, GCV andχ2 are comparable.

Comparison with new method

Observations: Example F

Initialization GCV, UPRE, L-curve, χ2 use GSVD (or SVD).Algorithm is cheap as compared to GCV, UPRE, L-curve.F is monotonic decreasing, even

Initialization GCV, UPRE, L-curve, χ2 use GSVD (or SVD).Algorithm is cheap as compared to GCV, UPRE, L-curve.F is monotonic decreasing, evenSolution either exists and is unique for positive σ

Or no solution exists F (0) < 0.

Or no solution exists F (0) < 0.Theoretically, limσ→∞ F > 0 possible. Equivalent to λ = 0. Noregularization needed.

Remark on F (0) < 0

Notice, when F (0) < 0, m̃ is too big relative to J.Equivalently, there are insufficient degrees of freedom.Notice

J(x̂) = ‖P1/2s‖22, P = diag(1/((γiσ)2 + 1), 0n−p, Im−n)

In particular J(x̂(0)) = ‖P1/2(0)s‖22 = y , for some y . If y < m̃, set

m̃ = floor(y)Theorem is revised to: m̃ = min{floor(J(0)), m − n + p}.In example J(0) ≈ 39, F (0) ≈ −461. On right m̃ = 38.

Remark on F (0) < 0

Example: Seismic Signal Restoration

Real data set of 48 signals of length 500.The point spread function is derived from the signalsSolve Pf = g, where P is psf matrix, g is signal and restore f .Calculate the signal variance pointwise over all 48 signals.Compare restoration of S-wave with derivative orders 0, 1, 2Weighting matrices are I, σ−2

g I, and diag(σ−2gi

), cases 1, 2, and 3.

Tikhonov Regularization

ObservationsReducedDegrees ofFreedomRelevant!Degrees ofFreedom foundautomaticallyCase 2 and 1have differentsolutionsCase 3 leavesthe features ofthe signal

First and Second Order Derivative Restoration

Observations

Here derivativesmoothing is notdesirableCase 3 preservessignal characteristics.Given value is λ, theregularizationparameter.λ increases withderivative order - moresmoothing.

Comparison with L-curve and UPRE Solutions

Observations

L-curveunderestimates λseverely.UPRE and χ2 aresimilar when DOF arelimited on χ2.UPRE underestimatesfor case 2 and 3weighting.

Newton’s Method converges in 5− 10 Iterations

l cb Iterations kmean std

0 1 8.23e + 00 6.64e − 010 2 8.31e + 00 9.80e − 010 3 8.06e + 00 1.06e + 001 1 4.92e + 00 5.10e − 011 2 1.00e + 01 1.16e + 001 3 1.00e + 01 1.19e + 002 1 5.01e + 00 8.90e − 012 2 8.29e + 00 1.48e + 002 3 8.38e + 00 1.50e + 00

Table: Convergence characteristics for problem phillips with n = 40 over 500runs

Newton’s Method converges in 5− 10 Iterations

l cb Iterations kmean std

0 1 6.84e + 00 1.28e + 000 2 8.81e + 00 1.36e + 000 3 8.72e + 00 1.46e + 001 1 6.05e + 00 1.30e + 001 2 7.40e + 00 7.68e − 011 3 7.17e + 00 8.12e − 012 1 6.01e + 00 1.40e + 002 2 7.28e + 00 8.22e − 012 3 7.33e + 00 8.66e − 01

Table: Convergence characteristics for problem blur with n = 36 over 500runs

Estimating The Error and Predictive Risk

Errorl cb χ2 L GCV UPRE

mean mean mean mean0 2 4.37e − 03 4.39e − 03 4.21e − 03 4.22e − 030 3 4.32e − 03 4.42e − 03 4.21e − 03 4.22e − 031 2 4.35e − 03 5.17e − 03 4.30e − 03 4.30e − 031 3 4.39e − 03 5.05e − 03 4.38e − 03 4.37e − 032 2 4.50e − 03 6.68e − 03 4.39e − 03 4.56e − 032 3 4.37e − 03 6.66e − 03 4.43e − 03 4.54e − 03

Table: Error characteristics for problem phillips with n = 60 over 500 runs witherror contaminated x0. Relative errors larger than .009 removed.

Results are comparable

Riskl cb χ2 L GCV UPRE

mean mean mean mean0 2 3.78e − 02 5.22e − 02 3.15e − 02 2.92e − 020 3 3.88e − 02 5.10e − 02 2.97e − 02 2.90e − 021 2 3.94e − 02 5.71e − 02 3.02e − 02 2.74e − 021 3 1.10e − 01 5.90e − 02 3.27e − 02 2.79e − 022 2 3.41e − 02 6.00e − 02 3.35e − 02 3.79e − 022 3 3.61e − 02 5.98e − 02 3.35e − 02 3.82e − 02

Table: Error characteristics for problem phillips with n = 60 over 500 runs

χ2 method does not give best estimate of risk

Error Histogram

Normal noise on rhs, first order derivative, Cb = σ2IRenaut and Mead (ASU/Boise) Scalar Newton method September 2007 22 / 31

Error Histogram

Exponential noise on rhs, first order derivative, Cb = σ2IRenaut and Mead (ASU/Boise) Scalar Newton method September 2007 22 / 31

Conclusions

χ2 Newton algorithm is cost effectiveIt performs as well ( or better) than GCV and UPRE whenstatistical information is available.Should be method of choice when statistical information isprovidedMethod can be adapted to find Wb if Wx is provided.

Future Work

Analyse for truncated expansions (TSVD and TGSVD) -reducethe degrees of freedom.Further theoretical analysis and simulations with other noisedistributions. Comparison new work of Rust & O’Leary 2007.Can it be extended for nonlinear regularization terms? (TV?)Development of the nonlinear least squares for general diagonalWx.Efficient calculation of uncertainty information, covariance matrix.Nonlinear problems?

Some Solutions: with no prior information x0

Illustrated are solutions and error bars

No Statistical InformationSolution is Smoothed

With statistical informationCb = diag(σ2

Some Generalized Tikhonov Solutions: First Order Derivative

No Statistical Information Cb = diag(σ2bi

Some Generalized Tikhonov Solutions: Prior x0: Solution not smoothed

Some Generalized Tikhonov Solutions: x0 = 0: Exponential noise

Relationship to Discrepancy Principle

The discrepancy principle can be implemented by a Newtonmethod.Finds σx such that the regularized residual satisfies

σ2b =

1m‖b− Ax(σ)‖2

2. (4)

Consistent with our notation

p∑i=1

γ2i σ2 + 1

)2s2i +

m∑i=n+1

s2i = m, (5)

Weight in the first sum is squared here, otherwise functional is thesame.But discrepancy principle often oversmooths. What happenshere?

Major References

Bennett A, 2005 Inverse Modeling of the Ocean and Atmosphere(Cambridge University Press)Hansen, P. C., 1994, Regularization Tools: A Matlab Package forAnalysis and Solution of Discrete Ill-posed Problems, NumericalAlgorithms 6 1-35.Mead J., 2007, A priori weighting for parameter estimation, J. Inv.Ill-posed Problems, to appear.Rao, C. R., 1973, Linear Statistical Inference and its applications,Wiley, New York.Tarantola A 2005 Inverse Problem Theory and Methods for ModelParameter Estimation (SIAM).Vogel, C. R., 2002. Computational Methods for Inverse Problems,(SIAM), Frontiers in Applied Mathematics.

blur Atmospheric (Gaussian PSF) (Hansen): Again with noise

Solution on Left and Degraded on the Right

Solutions using x0 = 0, Generalized Tikhonov Second Derivative 5% noise

Solutions using x0 = 0, Generalized Tikhonov Second Derivative 10% noise

Rosemary Renaut, Jodi Mead - Arizona State Universityrosie/mypresentations/cfgpres.pdf · 2008. 1....

Documents

Transcript of Rosemary Renaut, Jodi Mead - Arizona State Universityrosie/mypresentations/cfgpres.pdf · 2008. 1....