Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf ·...

54
Sta$s$cs & Experimental Design with R Barbara Kitchenham Keele University 1

Transcript of Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf ·...

Page 1: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Sta$s$cs  &  Experimental  Design  with  R  

Barbara  Kitchenham  Keele  University  

1  

Page 2: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Correla$on  and  Regression  

2  

Page 3: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Correla$on  

•  The  associa$on  between  two  variable  •  Strength  of  associa$on  usually  measured  by  a  correla$on  coefficient  ρ  in  range  [-­‐1,  1]  

•  Most  well  known  –  Pearson  Product    Moment  Correla$on  coefficient    

•  Arises  from  bi-­‐variate  normal  distribu$on  –  If  both  variables  are  standardized  then  ploQed  –  Elipse  shape  indicates  an  associa$on  

»  Narrower  the  elipse  the  closer  ρ~1(+ve)  or  -­‐1  (-­‐ve)  –  Circular  shape  indicates  no  associate  with  ρ~0  

  3  

Page 4: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Bivariate  Normal  Distribu$on  •  Bivariate  Normal  distribu$on  

•  Standard  Bivariate  Normal  z~N(0,1)  

•  Generalises  to  n  dimensions  •  Pearson’s  ρ  is  a  parameter  of  the  distribu$on  

4  

Page 5: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Pearson’s  ρ  •  From  the  bivariate  normal  distribu$on  •  Es$mated  from  data  

•  Calcula$ng  r  does  require  normality  –  But  sta$s$cal  tests  of  significance  do  –  Test  H0  r=0  can  be  based  on  T  having  Student’s  t  distribu$on  n-­‐2  df,  where  

–  There  is  also  a  normalising  transforma$on  •  Which  has  standard  error  

–  Used  when  correla$ons  from  different  sources  need  to  be  aggregated  (such  as  during  meta-­‐analyses)  

5  

Page 6: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Small  Data  set  

6  

5000 15000 25000 35000

1020

3040

5060

70Data from ICL

Effort

LoC

•  Using  cor.test  in  R  ρ=0.57,  T=1.9448  n.s.  •  Delete  A  and  ρ=0.57,  T=5.887***  •  Delete  B  and  ρ=0.28,  T=0.760  n.s.  

A

B  

Page 7: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Factors  Affec$ng  Magnitude  Pearson’s  ρ  

•  The  slope  of  the  line  about  which  points  are  clustered  –  If  slope=0,  ρ=0,  the  larger  the  slope  the  larger  is  ρ  

•  The  magnitude  of  the  devia$ons  from  the  line  –  Closer  points  are  to  no$onal  line  the  larger  is  ρ  

•  Outliers  •  Restric$ng  range  of  X  values  

–  Can  increase  or  decrease  ρ  •  Curvature  

–  ρ  assumes  a  linear  rela$onship  

7  

Page 8: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Robust  correla$on  

•  Spearman’s  ρ  – Replace  data  values  by  ranks  – Uses  same  calcula$on  as  Pearson    

•  With  previous  data  set  – All  data,  r=0.41  p=0.25  – With  A  removed,  r=0.67,  p=0.059  – With  B  removed,  r=0.18,  p=0.64  

8  

Page 9: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Non-­‐Parametric  Correla$on  

•  Kendall’s  tau  (τ)  •  Based  on  calcula$ng  slopes  between  all  pairs  of  points  – Takes  median  slope  

•  With  previous  data  set  – All  data,  r=0.33  p=0.22  – With  A  removed,  r=0.56,  p=0.045  – With  B  removed,  r=0.17,  p=0.61  

9  

Page 10: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

-10000 0 10000 20000 30000 40000

-200

2040

6080

x

y

RelPlot  •  relplot  func$on  is  a  bivariate  equivalent  of  box  plot  •  Shows  the  central  ellipsoid  part  of  the  bi-­‐variate  distribu$on  plus  outliers  •  Calculates  a  robust  es$mate  of  r=0.90  •  Does  not  generalise  to  more    dimensions  •  Assuming  bi-­‐variate  normal  means  nega$ve  values  are  expected  

10  

Page 11: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

MGV  method  for  outliers  

11  

5000 15000 25000 35000

1020

3040

5060

70

X

Y

*

*

*

*

*

*

*

*

*

o

MGV method

•  Minimum  Generalised  Variance  method  can  be  used  with  many  variables  

Page 12: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Robust  Correla$ons  •  Winsorized  correla$on  (wincor(x,y))  

–  Replace  X  and  y  values  at  extremes  with  25  (low)  75  (high)  percen$le  values  

–  0.407  sig.level=.276  

•  Percentage  Bend  Correla$on  –  Not  es$mate  of  Pearson’s  r  –  New  correla$on  robust  to  changes  in  distribu$on  –  Based  on  trimming  univariate  outliers  –  corb(x,y,corfun=pbcor,nboot=599)  –  rpb=.441  Boostrap  CI=(-­‐0.44,  0.97)  

•  Skipped  correla$ons  (i.e.  remove  outliers)  –  Removed  based  on  MGV    then  use  Pearson  (r=0.91)  –  Need  to  adjust  Test  value  &  cri$cal  value    

 

12  

Page 13: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Comparison  on  full  data  set  

13  

-10000 0 10000 20000 30000 40000

050

100

150

200

250

300

x

y

relplot

10000 20000 30000 40000

050

100

150

200

250

300

X

Y

* * *

**

*

**

**

*

**

*

* ****

****

*

* **

o

o

o

o

o

o

MGV method

Page 14: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Linear  Regression  •  Finding  the  parameters  of    a  model  of  the  form  

–  Y  is  the  response/outcome/dependent  variable  –  Xi  is  the  ith    of  p  s$mulus/input/independent  variables  

–  βi  is  the  ith  parameter  of  the  model  •  A  linear  model  is  linear  w.r.t  the  parameters    –  Polynomial  models  are  linear  models  of  the  nth  order  where  n  is  highest  power  

–  I.e.  a  second-­‐order  regression  model  has  form  

–  A  non-­‐linear  model  might  have  form  

14  

Page 15: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Least  Squares  Principles  

•  Basic  model    for  one  input  variable  is  

•  Sum  of  squares  of  devia$ons  from  true  line  is  

•  To  es$mate  by  least  squares  – Differen$ate  w.r.t  each  parameter  in  turn  –  To  find  the  turning  point  (i.e.  minimum)  set  each  differen$al  to  0    •  Solve  for  each  parameter  in  turn  

15  

Page 16: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Parameter  Es$ma$on  

•  Differen$als  are  

•  Solu$ons  aser  setng  each  to  0  are  

•  For  standardized  normal  variables  – Slope  must  less  than  1,  even  if  Y=X  – The  larger  the  error  term,  the  larger  r  and  the  lower  the  value  of  b1  

 16  

Page 17: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Bivariate  Normal  Distribu$ons  

17  

b1=0.9018  b0=-­‐0.0097  

b1=0.57441  b0  =-­‐0.07613  

-3 -2 -1 0 1 2 3

-3-2

-10

12

rho=0.5x

y

-2 -1 0 1 2 3

-3-2

-10

12

3

rho=0.9x

y

Page 18: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Mul$variate  Regression  

•  Formulate  in  matrix  algebra  terms,  assuming  X  and  Y  have  means  removed  i.e.  Y=y-­‐μy  

•  Y  is  an  (n×1)  vector  •  X  is  an  (n×p)  matrix  of  known  form  •  β  is  a  (p×1)  vector  of  parameters  •  ϵ  is  a  (n×1)  vector  of  error  terms  •  Where    E(ϵ)=0,  V(ϵ)  =Iσ2  

•  Solu$on  is      

18  

Page 19: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Least  Squares  Proper$es  •  FiQed  values  are  obtained  from  

•  Vector  of  residuals  •   Variance  of  parameters  

•  Mul$ple  Correla$on  Coefficient  •  Adjusted  •  Both  R2  Vulnerable  to  outliers  •  Many  diagnos$c  tools  available  based  on  residuals    and  Hat    Matrix  

19  

Page 20: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

The  Hat  Matrix  

•  Hat  Matrix  is  defined  as  •  Called  the  Hat  matrix  because  •  Its  important  because  if  hii  is  i-­‐the  diagonal  element  of  of  H  – Difference  between    •  Parameter  with  and  without  observa$on  xj  is  

•  FiQed  value  with  and  without  observa$on  xj  is  

20  

Page 21: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Three  Types  of  Residual  •  Residuals  •  Standardized  Residuals  

•  Studen$zed  Residuals  (based  on  omitng  each  data  point  in  turn  from  variance)  

•  Sadly  doesn’t  automa$cally  provide  fiQed  values  based  on  i-­‐1  points  –  However,  lm  provides  access  to  the  hat  matrix  values  

•  Via  the  fiQed  model  i.e.  hatvalues(fit)  •  So  can  be  calculated  by  wri$ng  your  own  R  program  

21  

Page 22: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Fitng  Regression  Models  in  R  •  The  R  command  is  –  lm(y~x1+x2+..+Xn,data=mydata)  

•  You  should  save  the  output  of  the  linear  model  e.g.    –  fit<-­‐lm(effort~loc,data=iclbt)  –  Effort=17.22+.00253322×loc  

•  From  the  object  “fit”  you  can  access  –  Residuals  – Hat  values  –  FiQed  values  

 22  

Page 23: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Plotng  Effort  and  Loc  showing  Regression  Line  

23  

10000 20000 30000 40000

050

100

150

200

250

300

loc

effort

Page 24: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Theil-­‐Sen  Regression  

24  

10000 20000 30000 40000

050

100

150

200

250

300

Loc

Effort

Page 25: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Using  Log  Transforma$on  

25  

7.5 8.0 8.5 9.0 9.5 10.0 10.5

12

34

5

Log(Loc)

Log(Effort)

Page 26: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Diagnos$cs  

•  Many  diagnos$c  facili$es  assume  fitng  via  the  linear  model  func$on  

•  To  evaluate  diagnos$cs  can  use  – Log(effort)=Log(loc)+log(dur)+co  

•  “co”  is  a  factor  that  defines  the  source  of  the  data  

•  Needs  to  be  defined  as  a  factor  to    –  iclbt$co<-­‐factor(c("1","2","3"))  

26  

Page 27: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Diagnos$c  Aids  -­‐  1  

•  Q-­‐Q  Plots  – Plots  Studen$zed  residuals  against  a  t  distribu$on  with  n-­‐p-­‐1  degrees  of  freedom  

•  Histogram  of  residuals  (all  types)  

27  

Page 28: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

QQPlot  for  ICLBT  data  

28  

-2 -1 0 1 2

-2-1

01

2

t Quantiles

Stu

dent

ized

Res

idua

ls(fi

t)

Page 29: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Residual  Plot  

29  

Distribution of Errors

Residuals

Dens

ity

-3 -2 -1 0 1 2

0.00.1

0.20.3

0.40.5

0.6 Normal CurveKernelDensity Curve

Page 30: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Diagnos$c  Aids  -­‐  2  

•  Component    +  Residual  plots  – Par$al  residual  plots  – For  each  j-­‐variable  plots                                  against  Xij  •  where  ϵi  are  based  on  full  model  

– The  straight  line  on  graph  is  the  least  squares  fit  

– The  other  line  is  the  “lowess”  line  •  A  nonparametric  weighted  fit  line  based  on  locally  weighted  polynomial  regression  

30  

Page 31: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

CrPlots  for  ICLBT  data  

31  

7.5 8.5 9.5 10.5

-1.5

-0.5

0.5

1.5

log(loc)

Com

pone

nt+R

esidu

al(log

(effo

rt))

1.5 2.0 2.5 3.0

-10

12

log(dur)

Com

pone

nt+R

esidu

al(log

(effo

rt))

1 2 3

-2.0

-1.0

0.0

1.0

co

Com

pone

nt+R

esidu

al(log

(effo

rt))

Component + Residual Plots

Page 32: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Diagnos$c  aids  -­‐  3  •  Test  for  non-­‐constant  error  variance  

–  ncvTest()  func$on  •  For  ICLBT  data,  ChiSquare  =  1  4.055072    p=0.044*  

•  Plot  of  absolute  standardized  residuals  versus  fiQed  values  with  best  fitng  line  (Spread-­‐Level  Plot)  –  Can  indicate  possible  non-­‐linearity  in  Y  variable  

•  Suggests  power  transform  •  0  sugges$on  iden$fies  log  transform  •  Suggests  –0.33  

•  Multcollinearity  vif()  func$on  –  Only  when  mul$ple  X  variables  –  Measure  extent  to  which  parameter  standard  devia$on  for  a  

parameter  is  expanded    •  Rela$ve  to  model  with  independent  variables  

–  If  square  root  of  vif  >2  there  may  be  a  problem  •  No  problem  for  this  model  

32  

Page 33: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Spread  Level  Plot  

33  

5 10 20 50 100

0.02

0.05

0.20

0.50

2.00

5.00

Spread-Level Plot for fit

Fitted Values

Abso

lute S

tuden

tized

Res

iduals

Page 34: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Major  Diagnos$c  Concepts  •  Outliers  

–  Observa$ons  that  are  not  predicted  well  by  model  –  Have  large  residuals  

•  High  leverage  points  –  Are  outliers  with  respect  to  other  predictors  –  Found  using  the  Hat  Matrix  

•  Influen$al  points  –  Observa$ons  that  have  an  major  impact  on  parameter  values  

–  High  leverage  points  that  are  also  outliers  •  Added  Value  plots  •  Cook’s  Distance  

34  

Page 35: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Cook’s  Distance  

•  Aim  to  summarize  the  informa$on  in  – Leverage  – Residual-­‐squared  plot  

•  Into  single  number  index  

•  Unusually  large  Cook’s  D  greater  than  – k  is  number  of  parameters  including  constant  – N  is  number  of  observa$ons  

35  

Page 36: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Aids  for  Outlier  Detec$on  -­‐1  •  Outlier  detec$on  based  on  Studen$zed  residuals  using  

outlierTest()  func$on  –  Reports  Bonferoni  adjusted  p-­‐value  for  the  largest  absolute  residuals  

–  Iden$fies  points  61  &  21  as  significant  outliers  •  Added  Value  Plots  

–  For  each  Xj    •  Show  impact  of  regressing  Y  on  other  variables  against  Xj  regressed  on  other  variables  

–  Can  be  used  to  assess  impact  of  specific  data  points  •  Influence  plot  

–  Studen$zed  Residals  against  Hat-­‐values  with  circles  indica$ng  Cook’s  distance  

36  

Page 37: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

0.10 0.15 0.20 0.25

-2-1

01

2

Circle size proportional to Cook's distanceHat-Values

Stud

entiz

ed R

esidu

als

7

16

23

Influence  plot  for  ICLBT  Dataset  •  Compare  main  outliers  (i.e.  1,  and  7)  with  outlier  detec$on  

results  Slide  (12)  •  Effect  of  removing  points  easy    use:    

–  update(fit,subset=-­‐c(7,16))  

37  

Page 38: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Impact  of  Removing  Outliers  

38  

Coefficients   Original   New  Intercept   -­‐3.1804*   -­‐4.1907**  Log(loc)   0.4895*   0.7089**  Log(dur)   0.7534·∙   0.390  

Co2   -­‐0.1049   -­‐0.1976  Co3   0.631    0.4219  Adj  R2   0.481   0.5876  

Page 39: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Influence  Plot  of  reduced  model  

39  

0.10 0.15 0.20 0.25 0.30

-2-1

01

2

Hat-Values

Stud

entiz

ed R

esidu

als

4

23

Page 40: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Models  with  Dummy  Variables  •  Exactly  equivalent  to  Analysis  of  Covariance  (ANCOVA)  •  Uses  variables  that  par$$on  the  dataset  

–  E.g.  Co  (which  stands  for  company)  in  the  ICLBT  database  •  Co  is  coded  as  an  integer  and  need  to  be  specified  to  R  

as    a  factor  •  R  maps  k  different  levels  per  factor  into  k-­‐1  dummy  

variables  –  The  effect  of  the  “missing”  dummy  variable  is  included  in  the  

intercept  –  If  only  one  dummy  variable  

•  The  Intercept  corresponds  to  the  effect  of  the  missing  variable  •  The  parameter  values  given  to  other  dummy  variables  are  

–  Effect  of  missing  dummy  variable  –  Effect  of  dummy  variable  

40  

Page 41: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Dummy  Variables  -­‐  2  

•  A  dummy  variable  shiss  the  intercept  of  the  regression  line    –  To  give  a  separate  regression  line  for  each  data  par$$on  

•  If  we  want  to  change  the  slope  as  well  as  the  intercept  we  need  to  change  the  model  to  a  model  with  interac$ons  –  lm(log(effort)~co*(log(dur)+log(loc)),data=iclbt)  

•  Mul$ple  factors  in  a  model  with  no  variables  produces  a  mul$-­‐way  ANOVA  

41  

Page 42: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Interac$ons  with  Company  

42  

Coefficients   Es$mate     Std.Error     t  value     Pr(>|t|)      (Intercept)     -­‐3.646   2.8032   -­‐1.301   0.2057  co2                 -­‐2.722   3.7120   -­‐0.733    0.4705  co3               0.7553   3.7311   0.202    0.8413  log(dur)       1.3364   0.6680   2.000   0.0569  (.)  log(loc)           0.3217   0.2839     1.133   0.2683  co2:log(dur)   -­‐0.7094   0.8775     -­‐0.808   0.4268  co3:log(dur)   -­‐1.1025   0.8165   -­‐1.350   0.1895  co2:log(loc)     0.6292   0.4405     1.428   0.1660  co3:log(loc)     0.2763     0.4095     0.675   0.5063  

Page 43: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Removing  X-­‐variables  •  May  need  to  select  most  plausible  model  with  least  

number  of  X-­‐variables  •  Stepwise  regression  available  in  R  

–  Forwards  stepwise  starts  with  no  variables  and  adds  one  at  a  $me  

–  Backwards  starts  with  all  variables  and  removes  them  one  at  a  $me  

–  Stepwise  goes  forward  but  re-­‐assesses  all  variables  as  each  new  one  is  added  

–  Based  on  Akaike  Informa$on  Criteria  (AIC)  

•  Can  also  inspect  all  possible  regressions  –  With  limited  number  of  variables  

43  

Page 44: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Akaike  Informa$on  Criterion  (AIC)  

•  Used  to  judge  compe$ng  models  –  Func$on  of  the  Log  Likelihood  func$on  

–  k    =  number  of  parameters  in  model  –  Smaller  values  are  preferable  

•  Version  adjusted  for  sample  size  n  is  preferable  

•  Assesses  impact  of  changing  number  of  parameters  (not  func$onal  form  of  model)  

44  

Page 45: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Other  capabili$es  

•  Kabacoff  published  R  func$ons  –  For  Cross-­‐valida$on  

•  Checking  a  model  by  splitng  the  data  into  valida$on  and  training  data  sets  

•  Predic$ng  the  outcome  value  for  the  valida$on  data  •   Perform  k-­‐fold  cross  valida$on  

–  I.e.  creates  k  different  training  &  valida$on  sets  at  random  –  Based  on  changes  to  the  R-­‐square  sta$s$c  

–  To  assess  the  rela$ve  importance  of  different  variables  •  Model  must  not  have  categorical  variables  

45  

Page 46: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Robust  Regression  •  Lowess  Local  Polynomial  Regression  

–  h  is  half-­‐width  of  a  window  enclosing  observa$ons  for  local  regression  

–  At  x0  es$mate  height  of  regression  curve  is  –  Typical  to  adjust  h  so  each  local  regression  includes  a  fixed  s  propor$on  of  data  

–     s  is  span  of  local-­‐regression  smoother  –  Large  span  smoother  fit  but  larger  order  of  local  regression  •  Require  a  trade-­‐off  

46  

Page 47: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

FiQed  line  for  ICLBT  data    Size  v.  Effort  

47  

10000 20000 30000 40000

050

100

150

200

250

300

loc

effort

Page 48: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Dura$on  v.  Effort  

48  

5 10 15 20 25

050

100

150

200

250

300

dur

effort

Page 49: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Mul$ple  lowess  Regression  

49  

0 50 100 150 200 250 300

050

100

150

200

Multiple regression using Lowess

effort

Fitte

d va

lues

1 2 3 4 5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Values on log scale

log(effort)

Log

Fitte

d V

alue

s

Page 50: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Kernel  Regression  

•  Kernel  es$mators  es$mate  some  measure  of  loca$on  for  y  given  x  

•  wi  is  a  measure  of  how  close  xi  is  to  x  

•  K(u)  is  a  contours,  bounded  and  symmetric  func$on  

50  

Page 51: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Kernel  Regression  -­‐  Con$nued  •  m(x)  es$mated  from  

•  h  is  span  –  ISQ  is  interquar$le  range  

•  Given  x  ,    –  b0  and  b1  es$mated  using  weighted  regression      

•  Smooth  is  created  by  taking  x  to  be  a  grid  of  points  and  plotng  results  

51  

Page 52: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Kernel  Regression    

52  

10000 20000 30000 40000

050

100

150

200

250

300

loc

effort

Page 53: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Non-­‐Parametric  Regression  •  Theil-­‐Sen  can  handle  mul$ple  regression  – Not  with  dummy  variables  –  FiQed  line  fiQed  mass  of  data  points  

•  5  fiQed  values  were  nega$ve  

53  

0 50 100 150 200 250 300

0100

200

300

400

500

effort

tsfitted

Page 54: Barbara Kitchenham’ Keele’University’crest.cs.ucl.ac.uk/cow/31/slides/Regression.pdf · Small’Dataset 6 5000 15000 25000 35000 10 20 30 40 50 60 70 Data from ICL Effort LoC

Conclusions  

•  Combina$on  of  transforming  variables  and  extensive  diagnos$c  facili$es  –  Seem  to  reduce  the  need  for  robust  regression  

•  At  least  in  the  case  of  linear  models  

•  Non-­‐parametric  approaches  don’t  always  work  well  – Don’t  permit  group  variables  – Are  not  integrated  with  diagnos$cs  library(car)  

•  Lowess  is  promising  – Not  yet  well  integrated  with  diagnos$cs  

54