Regression using Apache SystemML by Alexandre V Evfimievski

43
Regression in SystemML Alexandre Evfimievski 1

Transcript of Regression using Apache SystemML by Alexandre V Evfimievski

RegressioninSystemML

AlexandreEvfimievski

1

LinearRegression

• INPUT:Records(x1,y1),(x2,y2),…,(xn,yn)– Eachxi ism-dimensional:xi1,xi2,…,xim– Eachyi is1-dimensional

• Wanttoapproximateyi asalinearcombinationofxi-entries– yi ≈β1xi1 +β2xi2 +…+βmxim– Casem=1:yi ≈β1xi1 (Note:x=0mapstoy=0)

• Intercept:a“freeparameter”fordefaultvalueofyi– yi ≈β1xi1 +β2xi2 +…+βmxim +βm+1

– Casem=1:yi ≈β1xi1 +β2

• Matrixnotation:Y≈Xβ,orY≈(X |1) β ifwithintercept– Xisn× m,Yisn× 1,β ism× 1or(m+1)× 1

2

LinearRegression:LeastSquares

• Howtoaggregateerrors:yi – (β1xi1 +β2xi2 +…+βmxim)?– What’sworse:manysmallerrors,orafewbigerrors?

• Sumofsquares:∑i≤n (yi – (β1xi1 +β2xi2 +…+βmxim))2 →min– Afewbigerrorsaremuchworse!Wesquarethem!

• Matrixnotation:(Y– Xβ)T (Y– Xβ)→min

• Goodnews:easytosolveandfindtheβ’s• Badnews:toosensitivetooutliers!

3

LinearRegression:DirectSolve

• (Y– Xβ)T (Y– Xβ)→min• YTY– YT(Xβ)– (Xβ)TY+(Xβ)T(Xβ)→min• ½βT(XTX) β – βT(XTY)→min

• Takethegradientandsetitto0:(XTX) β – (XTY)=0• Linearequation:(XTX) β =XTY;Solution:β =(XTX)–1(XTY)

A = t(X) %*% X;b = t(X) %*% y;. . .. . .beta_unscaled = solve (A, b);

4

ComputationofXTX

• Input(n× m)-matrixXisoftenhugeandsparse– RowsX[i,]makeupnrecords,oftenn>>106

– ColumnsX[,j]arethefeatures

• MatrixXTXis(m× m)anddense– Cells:(XTX)[j1,j2]=∑ i≤n X[i,j1]*X[i,j2]– Partofcovariancebetweenfeatures#j1 and#j2 acrossallrecords– mcouldbesmallorlarge

• Ifm≤1000,XTXissmalland“directsolve”isefficient…– …aslongasXTXiscomputedtherightway!– …andaslongasXTXisinvertible(nolinearlydependentfeatures)

5

ComputationofXTX

• Naïvecomputation:a) ReadXintomemoryb) Copyitandrearrangecellsintothetransposec) Multiplytwohugematrices,XT andX

• Thereisabetterway:XTX=∑ i≤n X[i,]T X[i,](outerproduct)– Foralli =1,…,ninparallel:

a) ReadonerowX[i,]b) Compute(m× m)-matrix:Mi[j1,j2]=X[i,j1]*X[i,j2]c) Aggregate:M=M+Mi

• Extendsto(XTX) vandXTdiag(w) X,usedinotherscripts:– (XTX) v=∑ i≤n (∑ j≤m X[i,j] v[j]) *X[i,]T– XTdiag(w) X=∑ i≤n wi *X[i,]T X[i,]

6

ConjugateGradient• WhatifXTXistoolarge,m>>1000?

– DenseXTXmaytakefarmorememorythansparseX

• FullXTXnotneededtosolve(XTX) β =XTY– Useiterativemethod– Onlyevaluate(XTX) vforcertainvectorsv

• Ex.:GradientDescentforf (β)=½βT(XTX) β – βT(XTY)– Startwithanyβ =β0

– Takethegradient:r=df(β)=(XTX) β – (XTY)(also, residual)– Findnumbera tominimizef (β + a · r):a =– (rTr)/(rTXTX r)– Update:βnew←β + a · r

• Butgradientistoolocal– And“forgetful”

*a · r

7

ConjugateGradient• PROBLEM:Gradienttakesaverysimilardirectionmanytimes• Enforceorthogonalitytopriordirections?

– Takethegradient:r=(XTX) β – (XTY)– Subtractpriordirections:p(k) =r– λ1p(1) – …– λk-1p(k-1)

• Pickλi toensure(p(k) ·p(i))=0???– Findnumbera(k) tominimizef (β + a(k)· p(k)),etc…

• STILL,PROBLEMS:– Valuea(k) doesNOTminimizef (a(1)· p(1)+…+ a(k)· p(k)+…+ a(m)· p(m))– Keepallpriordirectionsp(1),p(2),…,p(k)?That’salot!

• SOLUTION:EnforceConjugacy– Conjugatevectors:uT(XTX)v=0,insteadofuTv=0

• MatrixXTXactsasthe“metric”indistortedspace– Thisdoesminimizef (a(1)· p(1)+…+ a(k)· p(k)+…+ a(m)· p(m))

• And, onlyneedp(k-1) andr(k) tocomputep(k)8

ConjugateGradient• Algorithm,stepbystep

i = 0; beta = matrix (0, ...); Initially:β =0r = - t(X) %*% y; Residual&gradientr=(XTX) β – (XTY)p = - r; Directionforβ:negativegradientnorm_r2 = sum (r ^ 2); Normofresidualerror=rTrnorm_r2_target = norm_r2 * tolerance ̂ 2; Desirednormofresidualerrorwhile (i < mi & norm_r2 > norm_r2_target){ WEHAVE:pisthenextdirectionforβ

q = t(X) %*% (X %*% p) + lambda * p; q=(XTX)pa = norm_r2 / sum (p * q); a =rTr/p(XTX)pminimizesf (β + a · p)beta = beta + a * p; Update: βnew ←β +a ·pr = r + a * q; rnew ←(XTX) (β + a · p)– (XTY)old_norm_r2 = norm_r2; =r+a ·(XTX)pnorm_r2 = sum (r ^ 2); Updatethenormofresidualerror=rTrp = -r + (norm_r2 / old_norm_r2) * p; Updatedirection:(1)takenegativegradient;

(2)enforceconjugacywithpreviousdirectioni = i + 1; Conjugacytoallolderdirectionsisautomatic!

}9

DegeneracyandRegularization• PROBLEM:WhatifXhaslinearlydependentcolumns?

– Cause:recodingcategoricalfeatures,addingcompositefeatures– ThenXTXisnota“metric”:existsǁpǁ>0suchthatpT(XTX)p=0– InCGstepa =rTr/p(XTX)p:DivisionByZero!

• Infact,thenLeastSquareshas∞solutions– MostofthemhaveHUGEβ-values

• Regularization:Penalizeβ withlargervalues– L2-Regularization:(Y– Xβ)T (Y– Xβ)+λ · βTβ →min– ReplaceXTXwithXTX+λI– Pickλ <<diag(XTX),refinebycross-validation– DoNOTregularizeintercept

• CG: q = t(X) %*% (X %*% p) + lambda * p;

10

ShiftingandScalingX

• PROBLEM:Featureshavevastlydifferentrange:– Examples:[0,1];[2010,2015];[$0.01,$1Billion]

• Eachβi inY≈Xβ hasdifferentsize&accuracy?– Regularizationλ· βTβ alsorange-dependent?

• SOLUTION:Scale&shiftfeaturestomean=0,variance=1– Needsintercept:Y≈(X | 1) β– Equivalently:(Xnew |1)=(X |1)%*% SST“Shift-ScaleTransform”

• BUT:SparseXbecomesDenseXnew …

• SOLUTION:(Xnew |1) %*% M=(X |1) %*% (SST %*% M)– ExtendstoXTXandotherX-products– Furtheroptimization:SSThasspecialshape

11

ShiftingandScalingX– LinearRegressionDirectSolve

codesnippetexample:A = t(X) %*% X;b = t(X) %*% y;if (intercept_status == 2) {

A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]);A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ];b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ];

}A = A + diag (lambda);

beta_unscaled = solve (A, b);

if (intercept_status == 2) {beta = scale_X * beta_unscaled;beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled;

} else {beta = beta_unscaled;

}

12

RegressioninStatistics

• Model:Y=Xβ* +ε whereε isarandomvector– Thereexistsa“true”β*

– Eachεi isGaussianwithmeanμi =Xiβ* andvarianceσ2

• Likelihoodmaximizationtoestimateβ*

– Likelihood:ℓ(Y|X,β,σ)=∏i ≤n C(σ) · exp(– (yi – Xiβ)2 /2σ2)– Logℓ(Y|X,β,σ)=n · c(σ)– ∑ i ≤n (yi – Xiβ)2 /2σ2

– Maximumlikelihoodoverβ =LeastSquares

• Whydoweneedstatisticalview?– Confidenceintervalsforparameters– Goodnessoffittests– Generalizations:replaceGaussianwithanotherdistribution

13

MaximumLikelihoodEstimator

• Ineach(xi,yi)letyi havedistributionℓ(yi |xi,β,φ)– Recordsaremutuallyindependentfori =1,…,n

• Estimatorforβ isafunctionf(X,Y)– Yisrandom→f(X,Y)random– Unbiasedestimator:forallβ,meanEf(X,Y)=β

• Maximumlikelihoodestimator– MLE (X,Y)=argmaxβ ∏i ≤n ℓ(yi |xi,β,φ)– Asymptoticallyunbiased:EMLE (X,Y)→β asn→∞

• Cramér-RaoBound– Forunbiasedestimators,Var f(X,Y)≥FI (X,β,φ) –1

– Fisherinformation:FI (X,β,φ)=– EY Hessianβ logℓ(Y| X,β,φ)– ForMLE:Var (MLE (X,Y)) →FI (X,β,φ) –1 asn→∞

14

VarianceofM.L.E.

• Cramér-RaoBoundisasimplewaytoestimatevarianceofpredictedparameters(forlargen):1. Maximizelogℓ(Y |X,β,φ)toestimateβ2. ComputetheHessian(2nd derivatives)oflogℓ(Y |X,β,φ)3. Compute“expected”Hessian:FI=– EY Hessian4. InvertFIasamatrix:getFI –1

5. UseFI –1 asapprox.covariancematrixfortheestimatedβ

• Forlinearregression:– Logℓ(Y|X,β,σ)=n · c(σ)– ∑ i ≤n (yi – Xiβ)2 /2σ2

– Hessian=–(1/σ2) · XTX;FI=(1/σ2) · XTX– Cov β ≈σ2 · (XTX) –1;Var βj ≈σ2 · diag ((XTX) –1) j

15

VarianceofYgivenX• MLEforvarianceofY=1/n·∑ i ≤n (yi – y avg)2

– Tomakeitunbiased,replace1/nwith1/(n– 1)

• Varianceofε inY=Xβ* +ε isresidualvariance– EstimatorforVar(ε)=1/(n– m– 1)·∑ i ≤n (yi – Xiβ)2

• Goodregressionmusthave:Var(ε)<<Var(Y)– “Explained”variance=Var(Y)– Var(ε)

• R-squared:estimate1– Var(ε)/Var(Y)totestfitness:– R2

plain =1– (∑ i ≤n (yi – Xiβ)2)/(∑ i ≤n (yi – y avg)2)– R2

adj. =1– (∑ i ≤n (yi – Xiβ)2)/(∑ i ≤n (yi – y avg)2) ·(n– 1)/(n– m– 1)

• Pearsonresidual:ri =(yi – Xiβ)/Var(ε)1/2– ShouldbeapproximatelyGaussianwithmean0andvariance1– Canuseinanotherfitnesstest(moreontestslater)

16

LinRegScripts:Inputs# INPUT PARAMETERS:# --------------------------------------------------------------------------------------------# NAME TYPE DEFAULT MEANING# --------------------------------------------------------------------------------------------# X String --- Location (on HDFS) to read the matrix X of feature vectors# Y String --- Location (on HDFS) to read the 1-column matrix Y of response values# B String --- Location to store estimated regression parameters (the betas)# O String " " Location to write the printed statistics; by default is standard output# Log String " " Location to write per-iteration variables for log/debugging purposes# icpt Int 0 Intercept presence, shifting and rescaling the columns of X:# 0 = no intercept, no shifting, no rescaling;# 1 = add intercept, but neither shift nor rescale X;# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1# reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero# for highly dependend/sparse/numerous features# tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if# L2 norm of the beta-residual is less than tolerance * its initial norm# maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum# fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv"# --------------------------------------------------------------------------------------------# OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value:# OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B:# icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B# icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]# icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]# Col.2: betas for shifted/rescaled X and intercept

17

LinRegScripts:Outputs# In addition, some regression statistics are provided in CSV format, one comma-separated# name-value pair per each line, as follows:## NAME MEANING# -------------------------------------------------------------------------------------# AVG_TOT_Y Average of the response value Y# STDEV_TOT_Y Standard Deviation of the response value Y# AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias# STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X)# DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr.# PLAIN_R2 Plain R^2 of residual with bias included vs. total average# ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average# PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average# ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average# PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant# ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant# -------------------------------------------------------------------------------------# * The last two statistics are only printed if there is no intercept (icpt=0)## The Log file, when requested, contains the following per-iteration variables in CSV# format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for# initial values:## NAME MEANING# -------------------------------------------------------------------------------------# CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y# where A = t(X) %*% X + diag (lambda), or a similar quantity# CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial# -------------------------------------------------------------------------------------

18

Caveats• Overfitting:β reflectindividualrecordsinX,notdistribution

– Typically,toofewrecords(smalln)ortoomanyfeatures(largem)– Todetect,usecross-validation– Tomitigate,selectfewerfeatures;regularizationmayhelptoo

• Outliers:SomerecordsinXarehighlyabnormal– Theybadlyviolatedistribution,orhaveverylargecell-values– CheckMINandMAXofY,X-columns,Xiβ,ri2 =(yi– Xiβ)2/ Var(ε)– Tomitigate,removeoutliers,orchangedistributionorlinkfunction

• Interpolationvs.extrapolation– Amodeltrainedononekindofdatamaynotcarryovertoanother

kindofdata;thepastmaynotpredictthefuture– Greatresearchtopic!

19

GeneralizedLinearModels• LinearRegression:Y = Xβ* +ε

– Eachyi isNormal (μi ,σ2)wheremeanμi =Xiβ*

– Variance(yi)=σ2 =constant

• LogisticRegression:– Eachyi isBernoulli (μi)wheremeanμi =1/(1+exp(– Xiβ*))– Prob [yi =1]=μi ,Prob [yi =0]=1– μi ,mean=probabilityof1– Variance(yi)=μi (1– μi)

• PoissonRegression:– Eachyi isPoisson (μi)wheremeanμi =exp (Xiβ*)– Prob [yi =k]=(μi)kexp (– μi) / k!fork=0,1,2,…– Variance(yi)=μi

• OnlyinLinearRegressionweadd errorεi tomeanμi20

GeneralizedLinearModels• GLMRegression:

– Eachyi hasdistribution=exp{(yi · θi – b(θi)) / a + c(yi ,a)}– Canonicalparameter θi representsthemean:μi =bʹ(θi)– Linkfunction connectsμi andXiβ*:Xiβ* =g(μi),μi =g –1(Xiβ*)– Variance(yi)=a · bʺ(θi)

• Example:LinearRegressionasGLM– C(σ) · exp (– (yi – Xiβ)2 /2σ2)=exp{(yi · θi – b(θi))/ a + c(yi ,a)}– θi =μi =Xiβ;b(θi)=(Xiβ)2/ 2;a=σ2 =variance

• Linkfunction =identity;c(yi ,a)=– yi2/2σ2+logC(σ)

• Example:LogisticRegressionasGLM– ( μi ) y[i] (1– μi) 1– y[i] =exp{yi ·log(μi)– yi ·log(1– μi)+log(1– μi)}

=exp{(yi · θi – b(θi))/ a + c(yi ,a)}– θi =log(μi / (1– μi))=Xiβ;b(θi)=– log(1– μi)=log(1+exp(θi))

• Linkfunction =log (μ / (1– μ));Variance=μ (1– μ);a=121

GeneralizedLinearModels• GLMRegression:

– Eachyi hasdistribution=exp{(yi · θi– b(θi)) / a + c(yi,a)}– Canonicalparameter θi representsthemean:μi =bʹ(θi)– Linkfunction connectsμi andXiβ*:Xiβ* =g(μi),μi =g –1(Xiβ*)– Variance(yi)=a · bʺ(θi)

• Whyθi?Whatisb(θi)?– θi makesformulassimpler,standsforμi (nobigdeal)– b(θi)defineswhatdistributionitis:linear,logistic,Poisson,etc.– b(θi)connectsmeanwithvariance:Var(yi)=a · bʺ(θi) ,μi =bʹ(θi)

• Whatislinkfunction?– Youchooseit tolinkμi withyourfeaturesβ1xi1 +β2xi2 +…+βmxim– Additiveeffects:μi =Xiβ;Multiplicativeeffects:μi =exp (Xiβ)

Bayeslaweffects:μi =1/(1+exp(– Xiβ));Inverse:μi =1/(Xiβ)– Xiβ hasrange(–∞,+∞),butμi mayrangein[0,1],[0,+∞)etc.

22

GLMsWeSupport• WespecifyGLMby:

– Meantovarianceconnection– Linkfunction(meantofeaturesumconnection)

• Mean-to-varianceforcommondistributions:– Var (yi)=a · (μi)0 =σ2:Linear/Gaussian– Var (yi)=a · μi(1– μi) :Logistic/Binomial– Var (yi)=a · (μi)1:Poisson– Var (yi)=a · (μi)2:Gamma– Var (yi)=a · (μi)3:InverseGaussian

• Wesupporttwotypes:PowerandBinomial– Var (yi)=a · (μi)α :Power,foranyα– Var (yi)=a · μi(1– μi) :Binomial

23

GLMsWeSupport• WespecifyGLMby:

– Meantovarianceconnection– Linkfunction(meantofeaturesumconnection)

Supportedlinkfunctions• Power:Xiβ =(μi)s wheres=0standsforXiβ =log(μi)

– Examples:identity,inverse,log,squareroot

• Linkfunctionsusedinbinomial/logisticregression:– Logit,Probit,Cloglog,Cauchit– LinkXiβ-range(–∞,+∞)withμi-range(0,1)– Differintailbehavior

• Canonicallinkfunction:– MakesXiβ =thecanonicalparameter θi,i.e.setsμi =bʹ (Xiβ)– PowerlinkXiβ =(μi)1– α ifVar=a · (μi)α ;Logitlinkforbinomial

24

GLMScriptInputs# NAME TYPE DEFAULT MEANING# ---------------------------------------------------------------------------------------------# X String --- Location to read the matrix X of feature vectors# Y String --- Location to read response matrix Y with either 1 or 2 columns:# if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg)# B String --- Location to store estimated regression parameters (the betas)# fmt String "text" The betas matrix output format, such as "text" or "csv"# O String " " Location to write the printed statistics; by default is standard output# Log String " " Location to write per-iteration variables for log/debugging purposes# dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial# vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1):# 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian# link Int 0 Link function code: 0 = canonical (depends on distribution),# 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit# lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1):# -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity# yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0# icpt Int 0 Intercept presence, X columns shifting and rescaling:# 0 = no intercept, no shifting, no rescaling;# 1 = add intercept, but neither shift nor rescale X;# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1# reg Double 0.0 Regularization parameter (lambda) for L2 regularization# tol Double 0.000001 Tolerance (epsilon)# disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data# moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations# mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum# ---------------------------------------------------------------------------------------------# OUTPUT: Matrix beta, whose size depends on icpt:# icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2

25

GLMScriptOutputs# In addition, some GLM statistics are provided in CSV format, one comma-separated name-value# pair per each line, as follows:# -------------------------------------------------------------------------------------------# TERMINATION_CODE A positive integer indicating success/failure as follows:# 1 = Converged successfully; 2 = Maximum number of iterations reached; # 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported# BETA_MIN Smallest beta value (regression coefficient), excluding the intercept# BETA_MIN_INDEX Column index for the smallest beta value# BETA_MAX Largest beta value (regression coefficient), excluding the intercept# BETA_MAX_INDEX Column index for the largest beta value# INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0)# DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter# or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0# DISPERSION_EST Dispersion estimated from the dataset# DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0# DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value# -------------------------------------------------------------------------------------------## The Log file, when requested, contains the following per-iteration variables in CSV format,# each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values:# -------------------------------------------------------------------------------------------# NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration# IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise# POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point# OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood)# OBJ_DROP_REAL Reduction in the objective during this iteration, actual value# OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation# OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region# GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted)# LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows# LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows# IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored# TRUST_DELTA Updated trust region size, the "delta"# ------------------------------------------------------------------------------------------- 26

GLMLikelihoodMaximization• 1record:ℓ (yi| θi,a)=exp {(yi · θi– b(θi)) / a + c(yi,a)}• Logℓ (Y |Θ,a)=1/a·∑ i≤n (yi · θi– b(θi)) +const (Θ)

• f(β;X,Y)=– ∑ i≤n (yi · θi– b(θi)) +λ/2 · βTβ →min– Hereθi isafunctionofβ:θi =bʹ –1(g –1(Xiβ))– Addregularizationwithλ/2toagreewithleastsquares– IfXhasintercept,doNOTregularizeitsβ-value

• Non-quadratic;howtooptimize?– Gradientdescent:fastestwhenfarfromoptimum– Newtonmethod:fastestwhenclosetooptimum

• TrustRegionConjugateGradient– Strikesagoodbalancebetweentheabovetwo

27

GLMLikelihoodMaximization• f(β;X,Y)=– ∑ i≤n (yi · θi– b(θi)) +λ/2 · βTβ →min

• Outeriteration:Fromβ toβnew =β +z– ∆f(z;β):=f(β +z;X,Y)– f(β;X,Y)

• Use“FisherScoring”toapproximateHessianand∆f(z;β)– ∆f(z;β)≈½ · zTA z+GTz,where:– A=XTdiag(w)X+λI andG=– XTu+λ · β– Vectorsu,wdependonβ viamean-to-varianceandlinkfunctions

• TrustRegion:Areaǁzǁ2 ≤δ wherewetrusttheapproximation∆f(z;β)≈½ · zTA z+GTz– ǁzǁ2 ≤δ toosmall→GradientDescentstep(1inneriteration)– ǁzǁ2 ≤δ mid-size→Cut-offConjugateGradientstep(2ormore)– ǁzǁ2 ≤δ toowide→FullConjugateGradientstep

FI=XTdiag(w) Xis“expected”Hessian

28

TrustRegionConj.Gradient

• CodesnippetforLogisticRegression

g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5);delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ̂ 2)));exit_g2 = sum (g ^ 2) * tolerance ̂ 2;while (sum (g ^ 2) > exit_g2 & i < max_i){

i = i + 1;r = g;r2 = sum (r ^ 2); exit_r2 = 0.01 * r2;d = - r; z = zeros_D; j = 0; trust_bound_reached = FALSE; while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j){

j = j + 1;Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d;c = r2 / sum (d * Hd);[c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2);z = z + c * d; r = r + c * Hd; r2_new = sum (r ^ 2);d = - r + (r2_new / r2) * d; r2 = r2_new;

}p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z))));f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val;delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g)));if (f_chg < 0){

beta = beta + z; f_val = f_val + f_chg; w = p * (1 - p);g = - t(X) %*% ((1 - p) * y) + lambda * beta;

} }

ensure_quadratic =function (double x, a, b, c)return (double x_new, boolean test)

{test = (a * x^2 + b * x + c > 0);if (test) {rad = sqrt (b ^ 2 - 4 * a * c);if (b >= 0) {x_new = - (2 * c) / (b + rad);

} else {x_new = - (b - rad) / (2 * a);

}} else {x_new = x;

} }

29

TrustRegionConj.Gradient

• TrustregionupdateinLogisticRegressionsnippet

update_trust_region =function (double delta,

double z_distance,double f_chg_exact,double f_chg_linear_approx,double f_chg_quadratic_approx)

return (double delta){

sigma1 = 0.25;sigma2 = 0.5;sigma3 = 4.0;

if (f_chg_exact <= f_chg_linear_approx) {alpha = sigma3;

} else {alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx));

}

rho = f_chg_exact / f_chg_quadratic_approx;

if (rho < 0.0001) {delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta);

} else { if (rho < 0.25) {delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta));

} else { if (rho < 0.75) {delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta));

} else {delta = max (delta, min (alpha * z_distance, sigma3 * delta));

}}} }

30

GLM:OtherStatistics• REMINDER:

– Eachyi hasdistribution=exp{(yi · θi– b(θi)) / a + c(yi,a)}– Variance(yi)=a · bʺ(θi)=a · V(μi)

• VarianceofYgivenX– Estimatingtheβ givesV(μi)=V (g –1(Xiβ))– Constant“a”iscalleddispersion ,analogueofσ2

– Estimator:a≈1/(n– m) · ∑ i≤n(yi – μi)2/V(μi)

• Varianceofparametersβ– WeuseMLE,henceCramér-Raoformulaapplies(forlargen)– FisherInformation:FI=(1/a) ·XTdiag(w) X,wi= (V(μi) · gʹ(μi)2) –1

– Estimator:Covβ ≈a · (XTdiag(w) X) –1,Varβj =(Covβ)jj

31

GLM:Deviance• LetXhavemfeatures,ofwhichkmayhavenoeffectonY

– Will“noeffect”resultinβj ≈0?(Unlikely.)– Estimateβj andVar βj thentestβj /(Var βj)1/2 againstN(0,1) ?

• Student’s t-testisbetter

• LikelihoodRatioTest:

• NullHypothesis: YgivenXfollowsGLMwithβ1 =…=βk =0– If NH istrue,D isasympt.distributedasχ2 withkdeg.offreedom– If NH isfalse,D → +¥ asn→+¥

• P-value%=Prob [ χ2k > D]· 100%

( )

( )0

...,,,0...,,0;,|max

...,,,...,,;,|maxlog2

1GLM

11GLM>

⎥⎥

⎢⎢

⎡⋅=

+

+

mk

mkk

aXYL

aXYLD

ββ

ββββ

β

β

32

GLM:Deviance• Totestmanynestedmodels(featuresubsets)weneedtheir

maximumlikelihoodstocomputeD– PROBLEM:Term“c(yi,a)”inGLM’sexp{(yi · θi– b(θi)) / a + c(yi,a)}

• Instead,computedeviance:

• “Saturatedmodel”hasnoX,noβ,butpicksthebestθi foreachindividualyi (notrealisticatall,justconvention)– Term“c(yi,a)”isthesameinbothmodels!– But“a”hastobefixed,e.g.to1

• Devianceitselfisusedforgoodnessoffittests,too

( )( )

0...,,,...,,;,|max

model saturated:;|maxlog2

11GLM

GLM>

⎥⎥

⎢⎢

⎡ Θ⋅=

+

Θ

mkkaXYL

aYLD

βββββ

33

Survival AnalysisGivenSurvival data from individuals as (time, event)Categorical/continuous features for each individual

EstimateProbability of survival to a feature time Rate of hazard at a given time

Ex.† death from specific cancer? lost to follow-up

††

?†?

1 2 3 4 5 6 7 8

9

I I I I I I I I I

Pat

ient

s 2

1

3

4

5

Time

2734

Cox RegressionSemi-parametric model “robust”Most commonly usedHandles categorical and continuous dataHandles (right/left/interval) censored data

Baseline hazard covariates

coefficients

2935

36

EventHazardRate

• SymptomeventsE followaPoissonprocess:

timeE1 E2 E3 E4

Death

Hazard function

Hazard function = Poisson rate:

Given state and hazard, we could compute the probability of the observed event count:

[ ]t

tttEtht Δ

Δ+∈=

→Δ

state),[Problimstate);( |0

[ ] ,!

in eventsProb 21 KHetttKKH−

=≤≤ dttthHt

t))(state;(2

1∫=

37

CoxProportionalHazards

• Assumethatexactly1patientgetseventE attimet• TheprobabilitythatitisPatient#i isthehazardratio:

• Coxassumption:

• Timeconfoundercancelsout!

t

[ ] ∑ ==

n

j ji sthsthEi1

);();(gets#Prob

s1

si = statei

s2

sn

Patient #1

Patient #2

Patient #3

Patient #n– 1

Patient #n

. . . . .

)(exp)((state))(state);( T00 sththth λ⋅=Λ⋅=

38

Cox“Partial”Likelihood

• Cox“partial”likelihoodforthedatasetisaproductoverallE:

Patient #1

Patient #2

Patient #3

Patient #n– 1

Patient #n

. . . . .

[ ] ∏∑

∏∑ ==

=== EtEt n

j j

tn

j j

t

ts

ts

tsth

tsthEL ::

1T

)(whoT

1

)(whoCox )(

)()(

)()(exp

)(exp

)(;

)(; allProb)(

λ

λλ

Cox RegressionSemi-parametric model “robust”Most commonly usedHandles categorical and continuous dataHandles (right/left/interval) censored data

Cox regression in DMLFitting parameters via negative partial log-likelihoodMethod: trust region Newton with conjugate gradientInverting the Hessian using block Cholesky for

computing std error of betasSimilar features as coxph() in R, e.g., stratification,

frequency weights, offsets, goodness of fit testing, recurrent event analysis

Baseline hazard covariates

coefficients

2939

BACK-UP

40

Kaplan-Meier Estimator

2841

Kaplan-Meier Estimator

2842

ConfidenceIntervals

• DefinitionofConfidenceInterval;p-value• Likelihoodratiotest• Howtouseitforconfidenceinterval• Degreesoffreedom

43