LECTURE NOTES
FYS 4550/FYS9550 -
EXPERIMENTAL HIGH ENERGY PHYSICS
AUTUMN 2016
STATISTICS
A. STRANDLIE
NTNU AT GJØVIK
AND
UNIVERSITY OF OSLO
Statistics
• Statistics is about making inference about a statistical model, given
a set of data or measurements
– Parameters of a distribution
– Parameters describing the kinematics of a particle after a collision
• Position and momentum at some reference surface
– Parameters describing an interaction vertex (position, refined estimates
of particle momenta)
• Will consider two issues
– Parameter estimation
– Hypothesis tests and confidence intervals
Statistics
• Parameter estimation
• We want to estimate the unknown value of a parameter θ.
• An estimator is a function of the data which aims to estimate the
value of θ as closely as possible.
• General estimator properties
– Consistency
– Bias
– Efficiency
– Robustness
• A consistent estimator is an estimator which converges to the true
value of θ when the amount of data increases (formally, in the limit
of infinite amount of data).
^
Statistics
• The bias b of an estimator is given as
• Since the estimator is a function of the data, it is itself a random
variable with its own distribution.
• The expectation value of θ can be interpreted as the mean value of
the estimate for a very large number of hypothetical, identical
experiments.
• Obviously, unbiased (i.e. b=0) estimators are desirable.
^
Eb
Statistics
• The efficiency of an estimator is the inverse of the ratio of its
variance to the minimum possible value.
• The minimum possible value is given by the Rao-Cramer-Frechet
lower bound
where I(θ) is the Fisher information:
)(
1
2
2
min
I
b
2
);(lnE)(i
ixfI
Statistics
• The sum is over all the data, which are assumed independent and to follow
the pdf f(x; θ).
• The expression of the lower bound is valid for all estimators with the same
bias function b(θ) (for unbiased estimators b(θ) vanishes).
• If the variance of the estimator happens to be equal to the Cramer-Rao-
Frechet lower bound, it is called a minimum variance lower bound estimator
or a (fully) efficient estimator.
• Different estimators of the same parameter can also be compared by
looking at the ratios of the efficiencies. One then talks about relative
efficiencies.
• Robustness is the (qualitative) degree of insensitivity of the estimator to
deviations in the assumed pdf of the data
– e.g. noise in the data not properly taken into account
– wrong data
– etc
Statistics
• Common estimators for the mean and variance are (often called the
sample mean and the sample variance):
• The variances of these are:
N
i
i
N
i
i
xxN
s
xN
x
1
22
1
1
1
1
4
4
2
2
1
31)(
)(
N
Nm
NsV
NxV
Statistics
• For variables which obey the Gaussian distribution, this yields for
large N
• For Gaussian variables the sample mean is a fully efficient
estimator.
• If the different measurements used in the calculation of the sample
mean have different variances, a better estimator of the mean is a
weighted sample mean:
Nsstd
2)(
i
i
i
ii
w
xx
2
2
1
1
Statistics
• The method of maximum likelihood:
• Assume that we have N independent measurements all obeying the
pdf f(x;θ), where θ is a parameter vector consisting of n different
parameters to be estimated.
• The maximum likelihood estimate is the value of the parameter
vector θ which maximizes the likelihood function
• Since the natural logarithm is a monotoneously increasing function,
ln(L) and L will have maximum for the same value of θ.
);(1
θθ
N
i
ixfLθ
Statistics
• Therefore the maximum likelihood estimate can be found by solving the likelihood equations
for all i=1,…..,n.
• ML-estimators are asymptotically (i.e. for large amounts of data) unbiased and fully efficient
– Therefore very popular
• An estimate of the inverse of the covariance matrix of an ML-estimate is
evaluated at the estimated value of θ.
0ln
i
L
ji
ij
LV
ln2
1
Statistics
• The method of least squares.
• Simplest possible example: estimating the parameters of a straight
line (intercept and tangent of inclination angle) given a set of
measurements.
measurements
fitted line
Statistics
• Least-squares approach: minimizing the sum of squared distances S
between the line and the N measurements,
with respect to the parameters of the line (i.e. a and b).
• This cost function or objective function S can be written in a more
compact way by using matrix notation:
N
i i
ii baxyS
12
2))((
variance of measurement
error
θyθy HVHST
1
Statistics
• Here y is a vector of measurements, θ is a vector of the parameters
a and b, V is the (diagonal) covariance matrix of measurements
(consisting of the individual variances on the main diagonal), and H
is given by
• Taking the derivative of S with respect to θ, setting this to zero and
solving for θ yields the least-squares solution to the problem.
Nx
x
H
1
1 1
Statistics
• The result is:
• The covariance matrix of the estimated parameters is:
and the covariance matrix of the estimated positions is
yθ111 VHHVH TT
11cov HVH T
θ
TT HHVHH11covy
θy
H
Statistics
Simulating 10000 lines
Histogram of value of
estimated intercept
What is true value
of intercept?
Statistics
Simulating 10000 lines
Histogram of value of
tangent of angle of
inclination
What is true value?
StatisticsHistograms of normalized residuals of estimated parameters.
This means that for each fitted line and each estimated parameter,
a quantity ((estimated parameter-true parameter)/standard deviation
of parameter) is put into the histogram.
If everything is OK with the fitting procedure, these histograms should
have mean 0 and standard deviation 1.
mean=-0.0189
std=1.0038
mean=0.0157
std=1.0011
Statistics
• Least-squares estimation is for instance used in track fitting in high-
energy physics experiments.
• Track fitting is basically the same task as the line fit example:
estimating a set of parameters describing a particle track through a
tracking detector, given a set of measurements created by the
particle.
• In the general case the track model is not a straight line but rather a
helix (homogeneous magnetic field) or some other trajectory
obeying the equations of motion in an inhomogeneous magnetic
field.
• The principles of the fitting procedure, however, are largely the
same.
Statistics
• As long as there is a linear relationship between the parameters and the
measurements, the least-squares method is linear.
• If this relationship is a non-linear function F(θ), the problem is said to be of a
non-linear least-squares type:
• There exists no direct solution to this problem, and one has to resort to an
iterative approach (Gauss-Newton):
– Start out with an initial guess of θ, linearize function F around the initial guess by
a Taylor expansion and solve the resulting linear least-squares problem
– Use the estimated value for θ as a new expansion point for F and repeat the step
above
– Iterate until convergence (i.e. until θ changes less than a specified value from
one iteration to the next)
)()( 1θyθy FVFS
T
Statistics
• Relationship between maximum likelihood and least-squares:
• Consider a set of independent measurements y with mean values
F(x;θ).
• If these measurements follow a Gaussian distribution, the log-
likelihood function is basically
plus some terms which do not depend on θ.
• Maximizing the log-likelihood function is in this case equivalent to
minimizing the least-squares objective function.
N
i i
ii xFyL
12
2;
)(ln2
Statistics
• Confidence intervals and hypothesis tests.
• Confidence intervals:
– Given a set of measurements of a parameter, calculate an interval that one can be e.g. 95 % sure that the true value of the parameter is within
– Such an interval is called a 95 % confidence interval of a parameter
• Example: collect N measurements believed to come from a Gaussian distribution with unknown mean value μ and known standard deviation σ. Use the sample mean value to calculate a 100(1-α) % confidence interval for μ.
• From earlier: the sample mean is an unbiased estimator for μ with standard deviation σ/sqrt(N).
• For large enough N, the quantity is distributed according to a standard, normal distribution
(mean value 0, standard deviation 1)
N
XZ
/
Statistics
• Therefore:
• In words, there is a probability 1-α that the true mean is in the interval
• This interval is therefore a 100(1- α) % confidence interval for μ.
• Such intervals are highly relevant in physics analysis.
1//
1//
1//
1/
2/2/
2/2/
2/2/
2/2/
NzXNzXP
NzXNzP
NzXNzP
zN
XzP
NzXNzX /,/ 2/2/
Statistics
• Hypothesis tests:
• A hypothesis is a statement about the distribution of a vector x of
data.
• Similar to the previous example:
– given a number N measurements, test whether the measurements
come from a normal distribution with a certain expectation value μ or
not.
– define a test statistic, i.e. the quantity to be used in the evaluation of the
hypothesis. Here: the sample mean.
– define the significance level of the test, i.e. the probability that the
hypothesis will be discarded even though it is true.
– determine the critical region of the test statistic, i.e. interval(s) of values
of the test statistic which will lead to the rejection of the hypothesis
Statistics
• We then state two competing hypotheses:
– A null hypothesis, stating that the expectation value is equal to a given
value
– An alternative hypothesis, stating that the expectation value is not equal
to the given value
• Mathematically:
• Test statistic:
01
00
:
:
H
H
N
XZ
/
0
Statistics
2/z 2/z
Probability of being in shaded
area: α
Shaded area is therefore the
critical region of Z for
significance level αObtain a value of the test
statistic from test data by
calculating the sample mean
and transforming to Z.
Use the actual value of Z to
determine whether the null
hypothesis is rejected or not.
Statistics
• Alternatively: perform the test by calculating the so-called p-value of the test statistic.
• Given the actual value of the test statistic, what is the area below the pdf for the range of values of the test statistic starting from the actual one and extending to all values further away from the value defined by the null hypothesis? This area defines the p-value. – For the current example this would correspond to adding two integrals
of the pdf of the test statistic (because this is a so-called two-sided test):• one from minus infinity to minus the absolute value of the actual value of the
test statistic
• another from the absolute value of the actual value of the test statistic to plus infinity
• For a one-sided test one would stick to one integral of the abovementioned type
• If the p-value is less than the significance level: discard the null hypothesis. If not, don’t discard it.
Statistics
• p-values can be used in so-called goodness-of-fit tests.
• In such tests one frequently uses a test statistic which is assumed to
be chisquare distributed
– Is a measurement in a tracking detector compatible with belonging to a
particle track defined by a set of other measurements?
– Is a histogram with a set of entries in different bins compatible with an
expected histogram (defined by an underlying assumption of the
distribution)?
– Is the residual distributions of estimated parameters compatible with the
estimated covariance matrix of the parameters?
• If one can calculate many independent values of the test statistic,
the following procedure is often applied:
– Calculate the p-value of the test statistic each time the test statistic is
calculated
Statistics
– The p-value itself is also a random variable, and it can be shown that it
is distributed according to a uniform distribution if the test statistic
origins from the expected (chisquare) distribution.
– Create a histogram with the various p-values as entries and see
whether it looks reasonably flat
• NB! With only one calculated p-value, the null hypothesis can be
rejected but never confirmed!
• With many calculated p-values (as immediately above) the null
hypothesis can also (to a certain extent) be confirmed!
• Example: line fit (as before)
• For each fitted line, calculate the following chisquare:
θθθθθ 12 cov
T
Statistics
• Here θ is the true value of the parameter vector.
• For each value of the chisquare, calculate the corresponding p-value
– Integral of chisquare distribution from the value of the chisquare to
infinity
• Given in tables or in standard computer programs (CERNLIB, CLHEP,
MATLAB,….)
• Fill up a histogram with the p-values and make a plot:
Reasonably flat histogram, seems OK.
What we really test here is that the estimated
parameters are unbiased estimates of the true
parameters, distributed according to
a Gaussian with a covariance matrix as
obtained in the estimate!!
Top Related