Anja Bråthen Kristoffersen Biomedical Research Group · Coefficient of determination • the...

Regression and Programming in R

Anja Bråthen Kristoffersen

Biomedical Research Group

R Reference Card

http://cran.r-project.org/doc/contrib/Short-refcard.pdf

Simple linear regression

• Describes the relationship between two variables x and y

• The numbers α and β are called parameters, and ϵ is the error term.

2014.01.15 3

Example data

• Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

head(faithful)

eruptions waiting

1 3.600 79

2 1.800 54

3 3.333 74

4 2.283 62

5 4.533 85

6 2.883 55

2014.01.15 4

2014.01.15 5

P-value

Estimated

parameters

Coefficient of Determination

Coefficient of determination

• the quotient of the variances of the fitted values and observed values of the dependent variable.

2014.01.15 6

Prediction

• develop a 95% confidence interval of the mean eruption duration for the waiting time of 80 minutes

newdata <- data.frame(waiting=80)

predict(eruption.lm, newdata, interval="confidence") fit lwr upr 1 4.1762 4.1048 4.2476

2014.01.15 7

Residuals

• The difference between the observed data of the dependent variable y and the fitted values ŷ.

2014.01.15 8

yyResidual ˆ

eruption.res <- resid(eruption.lm)

plot(faithful$waiting, eruption.res, ylab="Residuals",

xlab="Waiting Time", main="Old Faithful Eruptions")

abline(0,0, col = "red", lwd = 2)

2014.01.15 9

Standardized residual

• the residual divided by its standard deviation

eruption.stdres = rstandard(eruption.lm)

plot(faithful$waiting, eruption.stdres,

ylab="Standardized Residuals", xlab="Waiting Time",

main="Old Faithful Eruptions")

abline(0, 0, col ="red", lwd = 2)

2014.01.15 10

residual of deviationstandard

residualresidualedStandardiz

2014.01.15 11

Normal Probability Plot of Residuals

qqnorm(eruption.stdres,

ylab="Standardized Residuals",

xlab="Normal Scores",

main="Old Faithful Eruptions")

qqline(eruption.stdres, col = "red", lwd=3)

2014.01.15 12

Generalized additive model

• Can replace the linear relationship between response and variable, as in linear regression, with a non-linear relationship. A spline.

2014.01.15 13

install.packages("mgcv")

library(mgcv)

This is mgcv 1.5-5 . For overview type

`help("mgcv-package")'.

eruption.gam <- gam(eruptions ~ 1+s(waiting), data=faithful)

plot(eruption.gam)

2014.01.15 14

plot(eruption.gam)

2014.01.15 15

2014.01.15 16

Coefficient of

Determination

p-value

eruption.res.gam <- resid(eruption.gam)

plot(faithful$waiting, eruption.res.gam, ylab="Residuals",

xlab="Waiting Time", main="Old Faithful Eruptions")

abline(0,0, col="red", lwd=2)

2014.01.15 17

eruption.res.gam <- resid(eruption.gam)

qqnorm(eruption.res.gam)

qqline(eruption.res.gam, col="red", lwd=3)

2014.01.15 18

Multiple regresion

• describes a dependent variable y (response) by independent variables x1, x2, ..., xp (p > 1) is expressed by the equation

where the numbers α and βk (k = 1, 2, ..., p) are the parameters, and ϵ is the error term.

2014.01.15 19

Multiple regresion Explore your data

• Explore your data before starting the analysis

– Are the responses correlated?

– Use plot (trellis graphics, boxplot)

– Use correlation (pearson, spearman)

2014.01.15 20

2014.01.15 21

DLBCL patient data

• Response: Germinal.cnter.B.cell.signature

• Explanatory variables – Lymph.node.signature

– Proliferation.signature

– BMP6

– MHC.class.II.signature

2014.01.15 21

pairs(dat[,8:11])

2014.01.15 22

cor(dat[,8:11])

2014.01.15 23

2014.01.15 24

Adjusted Coefficient of Determination

p-vaules

Not significant

p values

Adjusted Coefficient of Determination

2014.01.15 25

Hvor n er antall observasjoner og

p er antall parametere brukt i modellen

Comparing models

• As long as analysis are done on the same responses you can compare your models by information criteria:

• AIC (Akaike's An Information Criterion)

-2*log-likelihood + 2*npar

• BIC (Schwarz's Bayesian criterion)

-2*log-likelihood + log(n)*npar

• Goal: as small AIC or BIC as possible, i.e. explain most with less parameters

2014.01.15 26

The AIC value is less, chose this model

All p values are significant

Adjusted coefficient of

determination hardly

changed

fit2.res <- resid(fit2) fit2.fitted <- fitted(fit2) pairs(fit2.fitted, fit2.res, col = "darkgreen", pch="*")

2014.01.15 28

Logistic regression

• We use the logistic regression equation to predict the probability of a dependent variable that takes on only the values 0 and 1. Suppose x1, x2, ..., xp (p > 1) are the independent variables, α, βk (k = 1, 2, ..., p) are the parameters, and E(y) is the expected value of the dependent variable y

2014.01.15 29

DLBCL patient data

• Response: alive or dead

• Explanatory variables – Subgroup

– IPI.Group

– Germinal.center.B.cell.signature

– Lymph.node.signature

– Proliferation.signature

– BMP6

– MHC.class.II.signature

2014.01.15 30

DLBCL.glm <- glm(Status.at.follow.up ~ Subgroup + IPI.Group + Germinal.center.B.cell.signature + Lymph.node.signature + Proliferation.signature + BMP6 + MHC.class.II.signature, family= "binomial", data = dat)

2014.01.15 31

logistic

regression

is chosen

DLBCL2.glm <- glm(Status.at.follow.up ~ IPI.Group, family= "binomial", data = dat) summary(DLBCL2.glm) High Low Medium missing NA's 32 82 108 1 17

2014.01.15 32

The survival of

those containing

the IPI.group Low

is significantly

different from

those that are in

the group High

The estimate

is negative,

hence pations

in group Low

have less

probability to

die then those

in group High

Backward an forward inclusion of response variables

• Forward

– Start with all response variables, exclude the one that is least significant, compare AIC values.

• Backward

– Start with one regression for each response variable

– Include more and more response variables that are significant in the model, compare AIC values

2014.01.15 33

Programming in R

2014.01.15 34

Function

panel.hist <- function(x){

usrx <- par("usr")

on.exit(par(usrx))

par(usr = c(usrx[1:2], 0, 1.5) ) # indicates the position in the plot

hi <- hist(x, plot = FALSE) # calculate histogram without plotting it

Breaks <- hi$breaks # define breaks used in h

nB <- length(Breaks) # count the number of breaks

y <- hi$counts # find the counts in each interval

y <- y/max(y) # scale y for plotting

rect(Breaks[-nB], 0, Breaks[-1], y) #plots rectangulars in existing plot

pairs(dat[,7:11], col = dat[,5], cex = 0.5, pch = 24, diag.panel = panel.hist,

upper.panel=NULL)

2014.01.15 36

2014.01.15 37

2014.01.15 38

apply()

2014.01.15 39

• Use apply when you want the same thing done on every row or column of a matrix.

• apply(d, 1, functionA) will use functionA on every row of dataset d.

• apply(d, 2, functionA) will use functionA on every column of dataset d.

Example: apply()

2014.01.15 40

d <- matrix(runif(90), ncol = 10, nrow = 9)

head(d)

par(mfrow = c(3,3))

apply(d, 1, hist)

for – loop

• for (name in expr_1) expr_2

– name on variable: i

– expr_1 should be 1:antall

– expr_2 should be

sum(dbinom(1:6, 8, p[i])), but you have to save it somewhere:

beta8[i] <- sum(dbinom(1:6, 8, p[i])) then you have to create beta8 before the loop.

for – loop

p <- seq(0.6, 1, 0.01)

antall <- length(p)

beta8 <- rep(NA, antall) #allocate space for beta8

for(i in 1:antall){

beta8[i] <- sum(dbinom(1:6, 8, p[i]))

power8 <- 1 - beta8

plot(p, power8, type = "l")

2014.01.15 43

if statment

• if (expr_1) expr_2 else expr_3 • expr_1 is a statement that is either true or

false • expr_2 is preformed if expr_1 is true • expr_3 is optional, but preformed if expr_1 is

false and else statement is included

Example: if()

2014.01.15 45

x <- runif(1,0,1)

y <- rnorm(1,0,1)

if(x > y){

print(paste("x =", round(x,3), "is greater

than y =", round(y,3), sep = " "))

} else {

print(paste("x =", round(x,3), "is less than

y =", round(y,3), sep = " "))

[1] "x = 0.783 is greater than y 0.536"

When finished

• You could save your workspace in R, you could then save the workspace in different project folders. I do not do that!

• You could just save your script and run it once again if you want to work further on it.

• You could write your partial results (matrix) that you want to work further on to a .txt file, use write.table(objectname, ”path”, sep = ”\t”)

– objectname is the name of your object in R

– ”path” is your path to where you will save the object including the file name of the object.

– sep = ”\t” ensure that your txt file is tabulate separated and makes it easier to open in excel.

Anja Bråthen Kristoffersen Biomedical Research Group · Coefficient of determination • the...

Documents

Transcript of Anja Bråthen Kristoffersen Biomedical Research Group · Coefficient of determination • the...

Graph Homomorphisms with Complex Values: A Dichotomy …Graph Homomorphisms with Complex Values: A Dichotomy Theorem ... Bulatov and Grohe [2], and especially the recent beautiful

Performance rating of the transmuted exponential ... rating of the transmuted exponential distribution: an analytical approach ... The table of values ... θ = 1 1.05 1.20 1.35 1.50

THE ARITHMETIC OF THE VALUES OF MODULAR FUNCTIONS AND THE

PHYSICAL REVIEW D 123519 (2016) Bouncing universe in the ... · free cosmological model: a cyclic universe with the minimal and maximal values of the scale factor that remains the

Chapter 5 Orthogonality › ~andrzej › teach › mat342 › chap5.pdf · 2009-11-25 · the documents with largest values align best with the search vector. • If the ith document

Defining my values | 4people matters 2015

Learning strategies for neuronal nets - the ... · High values reduce the probabilty of running into a local minimum, risking the method to diverge and not finding the minimum (or

Arithmetic of linear forms involving odd zeta values

The probability of occurrence of specific values in a ...

Introduction to Artiﬁcial Intelligence · Likelihood Weighting •Start by fixing the values for evidence variables. •Sample the non-evidence variables using conditional probabilities

ECE2984 Lab Lecture 4 - AC measurementsLiaB/Equipment/Oscilloscope/AC... · • For Amplitude and DC offset, adjust the levels with the sliders or type the values into the boxes.

Anja Bråthen Kristoffersen Biomedical Research Group · Example data • Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone

Chapter 5 Orthogonality - math.la.asu.eduandrzej/teachold/342Fall09/teach/mat342/c… · the documents with largest values align best with the search vector. • If the ith document

16th C5 Graph Theory Workshop - tu-freiberg.de · 2012. 5. 3. · Colourings 1Minimum order of k-chromatic Kr+1-free graphs (Anja Kohl) Let n r(k) denote the smallest possible number

Sharp differences in the δ13C values of organic matter and ...

PACS numbers: 95.36.+x, 98.80.-k, 04.50.Kd, 98.80 · somewhat higher values for the f(R) best- t values compared to those mentioned in the past in the literature and we argue that

Using Filter Tables - University of Colorado Boulderecee.colorado.edu/~mathys/ecen2420/pdf/UsingFilterTables.pdf · Using Filter Tables ... With the component values shown in the

Dietary Reference Values for Vitamin E as tocopherol

On Approximating Functions of the Singular Values in ... - IBM · On Approximating Functions of the Singular Values in a Stream Yi Li Facebook, Inc. leeyi@umich.edu David P. Woodruff

AПDЯЄÏ ѪAПÏБβΔS Values Education 9 LM