Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong...

6
Statistical Inference - Course Project 1 Chan Chee-Foong May 14, 2016 Overview In this assignment, we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution will be simulated in R with rexp(n, λ) where λ is the rate parameter. The mean of exponential distribution is 1/λ and the standard deviation is also 1/λ For all simulations, unless otherwise stated, the following parameters are set: 1. lambda = λ = 0.2 2. noOfSim = no of simulations = 1000 3. n = no of exponential distribution per simulation = 40 4. Random seed set = 2016 Exponential Distribution Let us generate 1000 * 40 random numbers of an exponential distribution with λ = 0.2 and take a look at the distribution and its properties. expSample <- rexp(noOfSim*n,lambda) dfExpSample <- data.frame(sample = expSample) ggplot(data=dfExpSample, aes(x=sample)) + geom_histogram(stat="bin", binwidth = 0.2, col = blue, fill=purple)+ ylab(Frequency)+ xlab()+ labs(title = Histogram\n) 1

Transcript of Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong...

Page 1: Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong May 14, 2016 Overview Inthisassignment,wewillinvestigatetheexponentialdistributioninRandcompareitwiththeCentral

Statistical Inference - Course Project 1Chan Chee-Foong

May 14, 2016

Overview

In this assignment, we will investigate the exponential distribution in R and compare it with the CentralLimit Theorem. The exponential distribution will be simulated in R with rexp(n, λ) where λ is the rateparameter. The mean of exponential distribution is 1/λ and the standard deviation is also 1/λ

For all simulations, unless otherwise stated, the following parameters are set:

1. lambda = λ = 0.22. noOfSim = no of simulations = 10003. n = no of exponential distribution per simulation = 404. Random seed set = 2016

Exponential Distribution

Let us generate 1000 * 40 random numbers of an exponential distribution with λ = 0.2 and take a look atthe distribution and its properties.

expSample <- rexp(noOfSim*n,lambda)

dfExpSample <- data.frame(sample = expSample)

ggplot(data=dfExpSample, aes(x=sample)) +geom_histogram(stat="bin", binwidth = 0.2, col = 'blue', fill='purple') +ylab('Frequency') +xlab('') +labs(title = 'Histogram\n')

1

Page 2: Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong May 14, 2016 Overview Inthisassignment,wewillinvestigatetheexponentialdistributioninRandcompareitwiththeCentral

0

500

1000

1500

0 10 20 30 40 50

Fre

quen

cyHistogram

Properties of the exponential distribution generated are as follows:

1. Mean : 4.982. Standard Deviation : 4.993. Skewness : 1.974. Excess Kurtosis : 5.73

Noticed that the mean and standard deviation is close to 1/λ = 1/0.2 = 5. The distribution is not normalbecause skewness and excess kurtosis is not close to 0. The QQ plot below also shows that the distribution isnot normal.

2

Page 3: Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong May 14, 2016 Overview Inthisassignment,wewillinvestigatetheexponentialdistributioninRandcompareitwiththeCentral

0

20

40

−2.5 0.0 2.5Sample Quantiles

The

oret

ical

Qua

ntile

sQQ Plot

Studying the distribution of the mean of n exponentially generatedrandom variables

Simulations

Let us try instead to simulate 1000 sample set of 40 exponential random variables and calculating the meanof each sample. Noting that the expected sample mean and its standard error is as follows:

E[X] = 1/λ = 1/0.2 = 5Var[X] = 1/λˆ2 * 1/n = 1/0.2ˆ2 * 1/40 = 0.625SE[X] =

√V ar[X]/n = 5/

√40 = 0.79057

simSample <- matrix(rexp(n*noOfSim,lambda),noOfSim,n)

expMean <- 1/lambdastdError <- 1/lambda/sqrt(n)sampleMean <- apply(simSample, 1, mean)

dfSampleMean <- data.frame(sample = sampleMean)

ggplot(data=dfSampleMean, aes(x=sampleMean)) +geom_histogram(stat="bin", binwidth = 0.2, col = 'blue', fill='purple') +

3

Page 4: Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong May 14, 2016 Overview Inthisassignment,wewillinvestigatetheexponentialdistributioninRandcompareitwiththeCentral

ylab('Frequency') +xlab('') +labs(title = 'Histogram\n') +geom_vline(xintercept = mean(sampleMean), color = 'red', size = 1.5)

0

30

60

90

4 6 8

Fre

quen

cy

Histogram

Sample Mean vs Theoretical Mean

mean(sampleMean)

## [1] 5.009748

The sample mean is 5.00975. As indicated (red vertical line) on the histogram. This value is close to thetheoretical mean of 1/λ = 1/0.2 = 5.

Sample Variance vs Theoretical Variance

var(sampleMean)

## [1] 0.6194703

The sample variance is 0.61947. This value is close to the theoretical variance of 1/λˆ2 * 1/n = 1/0.2ˆ2 *1/40 = 0.625.

4

Page 5: Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong May 14, 2016 Overview Inthisassignment,wewillinvestigatetheexponentialdistributioninRandcompareitwiththeCentral

Distribution

Let us study the distribution of the sample means to see whether it follows the Central Limit Theorem whichstates that the distribution of averages of iid variables (properly normalised) becomes that of a standardnormal if the sample size is large.

To standardise the sample means, we will substract the sample means off the expected mean and divide bythe standard error.

stdSampleMean <- (sampleMean - expMean)/stdErrordfStdSampleMean <- data.frame(sample = stdSampleMean)

We plot the standardised sample means in the density plot. Noticed that the sample density plot (red) isvery close to the standard normal density plot (yellow).

ggplot(data=dfStdSampleMean, aes(x=sample)) +geom_histogram(aes(y = ..density..),

stat="bin", binwidth = 0.2, col = 'blue', fill='purple') +geom_density(col='red', size = 1.5) +stat_function(fun=dnorm, colour = "yellow", size = 1.5) +ylab('Density') +xlab('') +labs(title = 'Density Plot of the Standardised Sample Mean\n')

0.0

0.1

0.2

0.3

0.4

−2 0 2 4

Den

sity

Density Plot of the Standardised Sample Mean

Properties of the standardised sample means are as follows:

5

Page 6: Statistical Inference - Course Project 1 - Amazon S3 Inference - Course Project 1 Chan Chee-Foong May 14, 2016 Overview Inthisassignment,wewillinvestigatetheexponentialdistributioninRandcompareitwiththeCentral

1. Mean : 0.012. Standard Deviation : 13. Skewness : 0.284. Excess Kurtosis : 0.23

Noticed that the mean and standard deviation is close to those of a standard normal distribution of 0 and 1respectively. The distribution is normal because skewness and excess kurtosis is close to 0. The QQ plotbelow also shows that the distribution is close to normal.

−2.5

0.0

2.5

−2 0 2Sample Quantiles

The

oret

ical

Qua

ntile

s

QQ Plot

Conclusion

We have shown that the standardised sample means of the random variables generated from the exponentialdistribution has a distribution like that of a standard normal when n is large.

Libraries required for this assignment project: ggplot2, moments

6