How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

Post on 22-Sep-2020

2 views 0 download

Transcript of How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

How to lasso positively, quickly and correctly

David Bryant

University of Otago

SUQ 2013

David Bryant

The Lasso (a.k.a. L1 regularization)

Consider the linear inverse problem

y = Xβ + ε.

An L1-regularized solution takes a parameter δ and minimizes thepenalized residual

‖Xβ − y‖22 + δ‖β‖1.

This has the advantage that the solutions are typically sparse:used for variable selection.

When δ = 0 we obtain the least squares solution. As δ →∞,the solutions approach 0.

We can replace ‖β‖ with a weighted version∑

i wi |βi |.Introduced into statistics by Tibshirani (1996) under the nameof ‘the LASSO

David Bryant

Computing the Lasso

Computational challenge to efficiently compute LASSO solutionsfor a range of δ values (i.e. all?).

First observation (from KKT conditions): the set of LASSOsolutions

argminβ{‖Xβ − y‖22 + δ‖β‖1}

for δ ≥ 0 equals the set of solutions to

argminβ{‖Xβ − y‖22 such that ‖β‖1 ≤ λ}

for λ ≥ 0.

David Bryant

LARS - Lasso algorithm

Let β(λ) denote the optimal solution for a given λ. That is,

β(λ) = argmin‖Xβ − y‖22 such that ‖β‖1 ≤ λ.

It is not too hard to show that β(λ) is piecewise linear (as afunction of λ).

0 OLS

Curve starts at 0 and finishes at the un-penalised solution.

David Bryant

Lasso computations

Osborne, Presnall and Turlach (2000)394 citations

Efron, Hastie, Johnstone and Tibshirani (2004)2909 citations

David Bryant

The positive Lasso

In many applications (including ours), we require variablecoefficients which are non-negative.

β(λ) = argmin‖Xβ − y‖2 such that β ≥ 0 and ‖β‖ ≤ λ.

David Bryant

LARS-LASSO

Efron et al. propose a ‘Positive Lasso Lars’ algorithm.

Let β = 0 and c = X′(y − Xβ).while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Move β in direction wA until

an entry becomes negative, ORnew variable(s) join the set of those with maximum ci .

Update c = X′(y − Xβ).end

Problem: can fail if more than one variable leaves or enters A atany one time.

David Bryant

Multiple entries

Is this ‘one-at-a-time’ restriction a problem?

NO: you can always add random noise to break ties.

YES: With a positivity constraint ties appear as part of thealgorithm (not just degenerate data). Also, we ran into problemswith our degenerate models and problems.

David Bryant

Our algorithm for the positive Lasso

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

KKT conditions

For each λ we want to find

minβ‖Xβ − y‖2 such that β ≥ 0 and ‖β‖1 = λ. (†)

From KKT conditions:

β solves (†) if and only if β is feasible and ci is maximal for all iwith βi > 0.

David Bryant

One step

Consider moving β in direction v. For γ define

βγ = β + γv

so thatcγ = X′(y − Xβγ) = c− γX′Xv.

The conditions that βγ and cγ need to satisfy are then

Feasibility: βγ ≥ 0.

Optimality: cγi maximal for all i such that βγi > 0.

Increasing:∑

βγ >∑

β.

David Bryant

Defining the search direction

Define A = {i : ci maximal}. LARS-Lasso algorithm considers the(unconstrained) search direction w, where wA = (X′AXA)−11.

From above, the actual search direction should be v satisfying

For all i ∈ A such that βi = 0, vi ≥ 0.

For all i ∈ A such that βi > 0 or vi > 0, (X′Xv)i = 1.

For all i ∈ A, (X′Xv)i ≥ 1.

For all i 6∈ A, vi = 0.

These are the KKT conditions for the constrained problem

minv‖X(vA −wA)‖ such that βi = 0⇒ vi ≥ 0

where vi = 0 for all i 6∈ A.

David Bryant

An algorithm for the positive Lasso

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

Phylogenetics

David Bryant

Application: linear models in phylogenetics

Each edge in the tree corresponds to a split (bipartition) of theobjects into two parts. These splits and their weights determineevolutionary distances between the objects:

0.5 0.2

0.2 0.1

0.1

0.3

0.1 0.1

0.3

A

B

C

D

EF

y contains the observed distances between objects;

X indicates which splits/branches separate which pairs;

β is the vector of split/branch weights to be inferred.

y = Xβ + ε

David Bryant

Application: phylogenetic networks

Most collections of splits do not encode a tree, however they canbe represented using a split network.

0.5 0.1

0.2

0.2

0.2

0.1 0.2

0.1 0.05

0.3 0.1

0.1

0.1

A

B

C

D

EF

Useful for data exploration since we can depict conflicting signals,and represent the amount of noise∗.

David Bryant

Phylogenetic networks

David Bryant

From English accents...

David Bryant

...to Swedish worms.

David Bryant

From the origin of modern wheat....

David Bryant

to the origin of life.

David Bryant

Networks and overfitting

With phylogenetic networks we intentionally over-fit the data.

In practice, many variables (splits) are eliminated using NNLS.

A large component of my student Alethea Rea’s Ph.D. thesis wasdevoted to methods for cleaning up the remainder: the Lasso wasan obvious choice.

David Bryant

Numerical issues

Let n be the number of objects. Then

X is(n2

)×(n2

).

X′X typically poorly conditioned.

X not sparse, but structured, so efficient algorithms for Xv,X′v.

(X′X)−1 sparse.

David Bryant

Numerical issues

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

Algorithm steps with numerical issues

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

Update c = X′(y − Xβ).end

David Bryant

Strategies

The key computation is the choice of search direction:

minv‖XA(vA −wA)‖2 such that βi = 0⇒ vi ≥ 0.

Started with PGCG (thanks to John) but had problems withconditioning of (X′AXA) and with degeneracy

‘Regressed’ to an active set method. Made use of the sparseness of(X′X)−1, PCG and the Woodbury formula to solve sub-problems.

David Bryant

Simple example: full network

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Simple example: lasso networks

David Bryant

Open problems

1 Making a choice of λ.

2 Weights within the penalty function (adaptive lasso?)

3 More general error distributions.

David Bryant

LASSO sampling

What I like most about the LARS and LARS-Lasso algorithm isthat you effectively get an estimate for all possible values of λ:these are the points on the path β.

Lasso-LARS sampler forπ(β|y, λ)

would produce a ‘nice’ function β : < −→ <n such that for each λ,β(λ) has the conditional marginal distribution

β(λ) ∼ π(β|y, λ).

Goal:

1 Sample β(λ0) from π(β|y, λ0).

2 Remainder of β(λ) computed deterministically from β(λ0)(e.g. numerically).

David Bryant

Summary

We propose a way to ‘correct’ the LARS-positive lasso algorithm toaccount for degeneracies.

Motivation was applications to phylogenetic networks, though weare exploring other applications.

David Bryant