How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

41
How to lasso positively, quickly and correctly David Bryant University of Otago SUQ 2013 David Bryant

Transcript of How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

Page 1: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

How to lasso positively, quickly and correctly

David Bryant

University of Otago

SUQ 2013

David Bryant

Page 2: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

The Lasso (a.k.a. L1 regularization)

Consider the linear inverse problem

y = Xβ + ε.

An L1-regularized solution takes a parameter δ and minimizes thepenalized residual

‖Xβ − y‖22 + δ‖β‖1.

This has the advantage that the solutions are typically sparse:used for variable selection.

When δ = 0 we obtain the least squares solution. As δ →∞,the solutions approach 0.

We can replace ‖β‖ with a weighted version∑

i wi |βi |.Introduced into statistics by Tibshirani (1996) under the nameof ‘the LASSO

David Bryant

Page 3: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Computing the Lasso

Computational challenge to efficiently compute LASSO solutionsfor a range of δ values (i.e. all?).

First observation (from KKT conditions): the set of LASSOsolutions

argminβ{‖Xβ − y‖22 + δ‖β‖1}

for δ ≥ 0 equals the set of solutions to

argminβ{‖Xβ − y‖22 such that ‖β‖1 ≤ λ}

for λ ≥ 0.

David Bryant

Page 4: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

LARS - Lasso algorithm

Let β(λ) denote the optimal solution for a given λ. That is,

β(λ) = argmin‖Xβ − y‖22 such that ‖β‖1 ≤ λ.

It is not too hard to show that β(λ) is piecewise linear (as afunction of λ).

0 OLS

Curve starts at 0 and finishes at the un-penalised solution.

David Bryant

Page 5: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Lasso computations

Osborne, Presnall and Turlach (2000)394 citations

Efron, Hastie, Johnstone and Tibshirani (2004)2909 citations

David Bryant

Page 6: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

The positive Lasso

In many applications (including ours), we require variablecoefficients which are non-negative.

β(λ) = argmin‖Xβ − y‖2 such that β ≥ 0 and ‖β‖ ≤ λ.

David Bryant

Page 7: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

LARS-LASSO

Efron et al. propose a ‘Positive Lasso Lars’ algorithm.

Let β = 0 and c = X′(y − Xβ).while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Move β in direction wA until

an entry becomes negative, ORnew variable(s) join the set of those with maximum ci .

Update c = X′(y − Xβ).end

Problem: can fail if more than one variable leaves or enters A atany one time.

David Bryant

Page 8: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Multiple entries

Is this ‘one-at-a-time’ restriction a problem?

NO: you can always add random noise to break ties.

YES: With a positivity constraint ties appear as part of thealgorithm (not just degenerate data). Also, we ran into problemswith our degenerate models and problems.

David Bryant

Page 9: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Our algorithm for the positive Lasso

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

Page 10: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

KKT conditions

For each λ we want to find

minβ‖Xβ − y‖2 such that β ≥ 0 and ‖β‖1 = λ. (†)

From KKT conditions:

β solves (†) if and only if β is feasible and ci is maximal for all iwith βi > 0.

David Bryant

Page 11: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

One step

Consider moving β in direction v. For γ define

βγ = β + γv

so thatcγ = X′(y − Xβγ) = c− γX′Xv.

The conditions that βγ and cγ need to satisfy are then

Feasibility: βγ ≥ 0.

Optimality: cγi maximal for all i such that βγi > 0.

Increasing:∑

βγ >∑

β.

David Bryant

Page 12: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Defining the search direction

Define A = {i : ci maximal}. LARS-Lasso algorithm considers the(unconstrained) search direction w, where wA = (X′AXA)−11.

From above, the actual search direction should be v satisfying

For all i ∈ A such that βi = 0, vi ≥ 0.

For all i ∈ A such that βi > 0 or vi > 0, (X′Xv)i = 1.

For all i ∈ A, (X′Xv)i ≥ 1.

For all i 6∈ A, vi = 0.

These are the KKT conditions for the constrained problem

minv‖X(vA −wA)‖ such that βi = 0⇒ vi ≥ 0

where vi = 0 for all i 6∈ A.

David Bryant

Page 13: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

An algorithm for the positive Lasso

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

Page 14: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Phylogenetics

David Bryant

Page 15: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Application: linear models in phylogenetics

Each edge in the tree corresponds to a split (bipartition) of theobjects into two parts. These splits and their weights determineevolutionary distances between the objects:

0.5 0.2

0.2 0.1

0.1

0.3

0.1 0.1

0.3

A

B

C

D

EF

y contains the observed distances between objects;

X indicates which splits/branches separate which pairs;

β is the vector of split/branch weights to be inferred.

y = Xβ + ε

David Bryant

Page 16: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Application: phylogenetic networks

Most collections of splits do not encode a tree, however they canbe represented using a split network.

0.5 0.1

0.2

0.2

0.2

0.1 0.2

0.1 0.05

0.3 0.1

0.1

0.1

A

B

C

D

EF

Useful for data exploration since we can depict conflicting signals,and represent the amount of noise∗.

David Bryant

Page 17: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Phylogenetic networks

David Bryant

Page 18: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

From English accents...

David Bryant

Page 19: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

...to Swedish worms.

David Bryant

Page 20: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

From the origin of modern wheat....

David Bryant

Page 21: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

to the origin of life.

David Bryant

Page 22: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Networks and overfitting

With phylogenetic networks we intentionally over-fit the data.

In practice, many variables (splits) are eliminated using NNLS.

A large component of my student Alethea Rea’s Ph.D. thesis wasdevoted to methods for cleaning up the remainder: the Lasso wasan obvious choice.

David Bryant

Page 23: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Numerical issues

Let n be the number of objects. Then

X is(n2

)×(n2

).

X′X typically poorly conditioned.

X not sparse, but structured, so efficient algorithms for Xv,X′v.

(X′X)−1 sparse.

David Bryant

Page 24: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Numerical issues

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

Page 25: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Algorithm steps with numerical issues

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

Page 26: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Strategies

The key computation is the choice of search direction:

minv‖XA(vA −wA)‖2 such that βi = 0⇒ vi ≥ 0.

Started with PGCG (thanks to John) but had problems withconditioning of (X′AXA) and with degeneracy

‘Regressed’ to an active set method. Made use of the sparseness of(X′X)−1, PCG and the Woodbury formula to solve sub-problems.

David Bryant

Page 27: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: full network

David Bryant

Page 28: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 29: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 30: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 31: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 32: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 33: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 34: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 35: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 36: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 37: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 38: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Simple example: lasso networks

David Bryant

Page 39: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Open problems

1 Making a choice of λ.

2 Weights within the penalty function (adaptive lasso?)

3 More general error distributions.

David Bryant

Page 40: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

LASSO sampling

What I like most about the LARS and LARS-Lasso algorithm isthat you effectively get an estimate for all possible values of λ:these are the points on the path β.

Lasso-LARS sampler forπ(β|y, λ)

would produce a ‘nice’ function β : < −→ <n such that for each λ,β(λ) has the conditional marginal distribution

β(λ) ∼ π(β|y, λ).

Goal:

1 Sample β(λ0) from π(β|y, λ0).

2 Remainder of β(λ) computed deterministically from β(λ0)(e.g. numerically).

David Bryant

Page 41: How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider the linear inverse problem y = X + : An L 1-regularized solution takes a parameter

Summary

We propose a way to ‘correct’ the LARS-positive lasso algorithm toaccount for degeneracies.

Motivation was applications to phylogenetic networks, though weare exploring other applications.

David Bryant