How to lasso positively, quickly and correctly

David Bryant

University of Otago

SUQ 2013

David Bryant

The Lasso (a.k.a. L1 regularization)

Consider the linear inverse problem

y = Xβ + ε.

An L1-regularized solution takes a parameter δ and minimizes thepenalized residual

‖Xβ − y‖22 + δ‖β‖1.

This has the advantage that the solutions are typically sparse:used for variable selection.

When δ = 0 we obtain the least squares solution. As δ →∞,the solutions approach 0.

We can replace ‖β‖ with a weighted version∑

i wi |βi |.Introduced into statistics by Tibshirani (1996) under the nameof ‘the LASSO

David Bryant

Computing the Lasso

Computational challenge to efficiently compute LASSO solutionsfor a range of δ values (i.e. all?).

First observation (from KKT conditions): the set of LASSOsolutions

argminβ{‖Xβ − y‖22 + δ‖β‖1}

for δ ≥ 0 equals the set of solutions to

argminβ{‖Xβ − y‖22 such that ‖β‖1 ≤ λ}

for λ ≥ 0.

David Bryant

LARS - Lasso algorithm

Let β(λ) denote the optimal solution for a given λ. That is,

β(λ) = argmin‖Xβ − y‖22 such that ‖β‖1 ≤ λ.

It is not too hard to show that β(λ) is piecewise linear (as afunction of λ).

Curve starts at 0 and finishes at the un-penalised solution.

David Bryant

Lasso computations

Osborne, Presnall and Turlach (2000)394 citations

Efron, Hastie, Johnstone and Tibshirani (2004)2909 citations

David Bryant

The positive Lasso

In many applications (including ours), we require variablecoefficients which are non-negative.

β(λ) = argmin‖Xβ − y‖2 such that β ≥ 0 and ‖β‖ ≤ λ.

David Bryant

LARS-LASSO

Efron et al. propose a ‘Positive Lasso Lars’ algorithm.

Let β = 0 and c = X′(y − Xβ).while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Move β in direction wA until

an entry becomes negative, ORnew variable(s) join the set of those with maximum ci .

Update c = X′(y − Xβ).end

Problem: can fail if more than one variable leaves or enters A atany one time.

David Bryant

Multiple entries

Is this ‘one-at-a-time’ restriction a problem?

NO: you can always add random noise to break ties.

YES: With a positivity constraint ties appear as part of thealgorithm (not just degenerate data). Also, we ran into problemswith our degenerate models and problems.

David Bryant

Our algorithm for the positive Lasso

Let β = 0 and c = X′y.while ‖c‖ > 0

A = {i : ci is maximum}.Choose the search direction wA = (X′AXA)−11Find v minimizing ‖XA(vA −wA)‖2 such that

vi ≥ 0 when βi = 0.Move β in direction v until

an entry goes negative, ORnew variable(s) join the set A

David Bryant

KKT conditions

For each λ we want to find

minβ‖Xβ − y‖2 such that β ≥ 0 and ‖β‖1 = λ. (†)

From KKT conditions:

β solves (†) if and only if β is feasible and ci is maximal for all iwith βi > 0.

David Bryant

One step

Consider moving β in direction v. For γ define

βγ = β + γv

so thatcγ = X′(y − Xβγ) = c− γX′Xv.

The conditions that βγ and cγ need to satisfy are then

Feasibility: βγ ≥ 0.

Optimality: cγi maximal for all i such that βγi > 0.

Increasing:∑

βγ >∑

David Bryant

Defining the search direction

Define A = {i : ci maximal}. LARS-Lasso algorithm considers the(unconstrained) search direction w, where wA = (X′AXA)−11.

From above, the actual search direction should be v satisfying

For all i ∈ A such that βi = 0, vi ≥ 0.

For all i ∈ A such that βi > 0 or vi > 0, (X′Xv)i = 1.

For all i ∈ A, (X′Xv)i ≥ 1.

For all i 6∈ A, vi = 0.

These are the KKT conditions for the constrained problem

minv‖X(vA −wA)‖ such that βi = 0⇒ vi ≥ 0

where vi = 0 for all i 6∈ A.

David Bryant

An algorithm for the positive Lasso

David Bryant

Phylogenetics

David Bryant

Application: linear models in phylogenetics

Each edge in the tree corresponds to a split (bipartition) of theobjects into two parts. These splits and their weights determineevolutionary distances between the objects:

0.5 0.2

0.2 0.1

0.1 0.1

y contains the observed distances between objects;

X indicates which splits/branches separate which pairs;

β is the vector of split/branch weights to be inferred.

y = Xβ + ε

David Bryant

Application: phylogenetic networks

Most collections of splits do not encode a tree, however they canbe represented using a split network.

0.5 0.1

0.1 0.2

0.1 0.05

0.3 0.1

Useful for data exploration since we can depict conflicting signals,and represent the amount of noise∗.

David Bryant

Phylogenetic networks

David Bryant

From English accents...

David Bryant

...to Swedish worms.

David Bryant

From the origin of modern wheat....

David Bryant

to the origin of life.

David Bryant

Networks and overfitting

With phylogenetic networks we intentionally over-fit the data.

In practice, many variables (splits) are eliminated using NNLS.

A large component of my student Alethea Rea’s Ph.D. thesis wasdevoted to methods for cleaning up the remainder: the Lasso wasan obvious choice.

David Bryant

Numerical issues

Let n be the number of objects. Then

X is(n2

)×(n2

X′X typically poorly conditioned.

X not sparse, but structured, so efficient algorithms for Xv,X′v.

(X′X)−1 sparse.

David Bryant

Numerical issues

David Bryant

Algorithm steps with numerical issues

David Bryant

Strategies

The key computation is the choice of search direction:

minv‖XA(vA −wA)‖2 such that βi = 0⇒ vi ≥ 0.

Started with PGCG (thanks to John) but had problems withconditioning of (X′AXA) and with degeneracy

‘Regressed’ to an active set method. Made use of the sparseness of(X′X)−1, PCG and the Woodbury formula to solve sub-problems.

David Bryant

Simple example: full network

David Bryant

Simple example: lasso networks

David Bryant

Open problems

1 Making a choice of λ.

2 Weights within the penalty function (adaptive lasso?)

3 More general error distributions.

David Bryant

LASSO sampling

What I like most about the LARS and LARS-Lasso algorithm isthat you effectively get an estimate for all possible values of λ:these are the points on the path β.

Lasso-LARS sampler forπ(β|y, λ)

would produce a ‘nice’ function β : < −→ <n such that for each λ,β(λ) has the conditional marginal distribution

β(λ) ∼ π(β|y, λ).

1 Sample β(λ0) from π(β|y, λ0).

2 Remainder of β(λ) computed deterministically from β(λ0)(e.g. numerically).

David Bryant

Summary

We propose a way to ‘correct’ the LARS-positive lasso algorithm toaccount for degeneracies.

Motivation was applications to phylogenetic networks, though weare exploring other applications.

David Bryant

How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

Transcript of How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

Documents

Transcript of How to lasso positively, quickly and correctly · The Lasso (a.k.a. L 1 regularization) Consider...

LECTURE NOTES 19 - web.hep.uiuc.eduweb.hep.uiuc.edu/home/serrede/P435/Lecture_Notes/P... · anti-ferro magnetism (a.k.a. ferrimagnetism) In some magnetically-ordered materials ∃

Gaussian Graphical Models and Graphical Lasso · 2018-11-07 · Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Fall 2018. Multivariate Gaussians Consider

Adaptive lasso, MCP, and SCAD - myweb.uiowa.edu · Adaptive lasso Concave penalties Introduction Although the lasso has many excellent properties, it is a biased estimator and this

Recent advances on Regularized Generalized …najim/gdr-estimation/tenenhaus.pdf · Recent advances on Regularized Generalized Canonical Correlation Analysis Arthur Tenenhaus 2013/05/16

THE ALEXAMENOS GRAFFITI, 2 nd C (a.k.a., “graffito blasfemo”) “Alexamenos worships his god”

· Estimating the Lasso’s E ective Noise Johannes Lederer1 Ruhr-University Bochum Michael Vogt2 University of Bonn Much of the theory for the lasso in the linear model Y = X +"hing

Gaussian Graphical Models and Graphical Lassoyc5/ele538b_sparsity/lectures/... · 2018-11-07 · [1]”Sparse inverse covariance estimation with the graphical lasso,” J. Friedman,

PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA …math.mit.edu/~liewang/square-root lasso nonparametric.pdf · of relevant regressors from the selected model harming post model

Variable Splitting Methods · Two Typical Problems I Regularized estimation to get sparse solutions ˆ =argmin 1 2 ky X k2 2 + k k1 Arises in biomedical problems: genome wide association

Model Selection and Lasso - Princeton University · variability of data Model selection and Lasso 8-13. Residual sum of squares (RSS) We shall set X˜ = X(and hence m= n) out of simplicity

Patrick Breheny September 6 - University of Iowa · 2018. 3. 27. · Patrick Breheny BST 764: Applied Statistical Modeling 13/16. The Lasso Fitting ... pgg45 å b j 2 max å j 2 0.0

Robotics Finite State MachinesFinite State Machines (FSMs) AFinite State Machine(a.k.a. nite state automaton) is an abstract device (simpler than a Turing machine). It consists of:

Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Lasso: Algorithms - MyWeb · 2019-02-17 · Lasso geometry Coordinate descent Lasso vs. forward selection LARS Forward selection and lasso paths Let us consider the regression paths

Shrinkage method II: Lasso - UMass Amherstpeople.math.umass.edu/~anna/stat697F/Chapter6_part2.pdf · Shrinkage method II: Lasso Lasso, short for Least Absolute Shrinkage and Selection

Tecniche di copertura di copertura I Esempio. Il Gamma di un portafoglio Delta-neutral è pari a −10000. Sulla base della (8), un cambiamento (in un breve lasso di tempo) pari a

Lecture 7: Minimization or maximization of functions ...neufeld/numerical/lecturenotes7.pdfLecture 7: Minimization or maximization ... • As in the case of root finding, ... (a.k.a.

Lasso: Algorithms - MyWebLasso geometry Coordinate descent Lasso vs. forward selection LARS Lasso as soft relaxation of ‘ 0-penalization Thus, the lasso can be thought of as a \soft"

Fast Algorithms for LAD Lasso Problems - Robert J. Vanderbei · Lasso References Regression Shrinkage and Selection via the LASSO, Tsibshirani, J. Royal Statistical Soc., Ser. B,

Adaptive lasso, MCP, and SCAD - University of Iowamyweb.uiowa.edu/pbreheny/7600/s16/notes/2-29.pdf · Adaptive lasso Concave penalties Adaptive lasso, MCP, and SCAD Patrick Breheny