of 22 /22
A little bit of statistics P( waow | news ) = ?

olivier-teytaud
• Category

## Education

• view

110

0

### Transcript of Statistics 101

A little bit of statistics

P( waow | news ) = ?

Posterior probability

● In case of independent items,● P( Observations | Θ) = product of

P( Observation1 | Θ)

x P( Observation2 | Θ)

x …

x P( ObservationZ | Θ)

Bayes theorem

● Bayes :

P( Θ | observations) P(observations)

= P( observations | Θ) P(Θ)

● So :

P( Θ | observations) = P(observations | Θ)

x P(Θ) / P(observation)

So, by independ. Items + Bayes,

● P( Θ | observations ) is proportional to

P(Θ) x P( obs1 | Θ) x … x P(obsZ | Θ)

● Definitions :– MAP (maximum a posteriori) : find Θ* such that

P(Θ*|observations) is max

– BPE (Bayesian posterior expectation): find ΘE = expectation of (Θ|observations)

– Maximum likelihood : P(Θ) uniform

– there are other possible tools

– ErrorEstimate = Expect. (Θ – estimator)2

log-likelihood

● Instead of probas, use log-probas.● Because :

– Products become sums ==> more precise on a computer for very small probabilities

Finding the MAP (or others estimates)

● Dimension 1 :– Golden Search (unimodal)

– Grid Search (multimodal, slow)

– Robust search (compromise)

– Newton Raphson (unimodal, precise expensive computations)

● Dimension large :– Jacobi algorithm

– Or Gauss-Seidel, or Newton, or NewUoa, or ...

Jacobi algorithm for maximizing in dimension D>1

● x=clever initialization, if possible

● While ( ||x' – x|| > epsilon )

– x'=current x

– For each parameter x(i), optimize it

● by a 1Dim algorithm● with just a few iterates

Jacobi = great when the objective function

– can be restricted to 1 parameter

– and then be much faster

Jacobi algorithm for maximizing in dimension D>1

● x=clever initialization, if possible

● While ( ||x' – x|| > epsilon )

– x'=current x

– For each parameter x(i), optimize it

● One iteration of robust search● But don't decrease the interval if optimum = close to current bounds

Jacobi = great when the objective function

– can be restricted to 1 parameter

– and then be much faster

Possible use

● Computing student's abilities, given item parameters

● Computing item parameters, given student abilities

● Computing both item parameters and student abilities (need plenty of data)

Priors

● How to know P(Θ) ?● Keep in mind that difficulties and abilities are

translation invariant– ==> so you need a reference

– ==> possibly reference = average Θ = 0

● If you have a big database and trust your model (3PL ?), you can use Jacobi+MAP.

What if you don't like Jacobi's result ?

● Too slow ? (initialization, epsilon larger, better 1D algorithm, better implementation...)

● Epsilon too large ?

● Maybe you use Map whereas you want Bpe ?==> If you get convergence and don't like the result, it's not because of Jacobi, it's because of the criterion.

● Maybe not enough data ?

Initializing IRT parameters ?

● Roughy approximations for IRT parameters :– Abilities (Θ)

– Item parameters (a,b,c in 3PL models)

● Priors can be very convenient for that.

Find Θ with quantiles !1. Rank students per performance.

Find Θ with quantiles !2. Cumulative distribution

ABILITIES

Find Θ with quantiles !3. Projections

Mediumstudent

BestN/(N+1)

Worst1/(N+1)

ABILITIES

Find Θ with quantiles !3. Projections

Mediumstudent

BestN/(N+1)

Worst1/(N+1)

ABILITIES

Equation version for approximating abilities Θ

if you have a prior (e.g. Gaussian), then a simple solution : – Rank students per score on the test

– For student i over N, Θ initialized at the prior's quantile 1 – i/(N+1)

E.g. With Gaussian prior mu, sigma,

then ability(i)=mu+sigma*norminv(1-i/(N+1))

With norminv e.g. as in http://www.wilmott.com/messageview.cfm?catid=10&threadid=38771

Equation version for approximating item parameters

Much harder !

There are formulas based on correlation. It's a very rough approximation.

How to estimate b if c=0 ?

Approximating item parameters

Much harder !

There are formulas based on correlation. It's a very rough approximation.

How to estimate b=difficulty if c=0 ?

Simple solution :– Assume a=1 (discrimination)

– Use the curve, or approximate

b = 4.8 x (1/2 - proba(success))

– If you know students' abilities, it's much easier

And for difficulty of items ?Use curve or approximation...

Codes

● IRT in R : there are packages, it's free, and R is a widely supported language for statistics.

● IRT in Octave : we started our implementation, but still very preliminary :– No missing data (the main strength of IRT) ==>

though this would be easy

– No user-friendly interface to data

● Others ? I did not check● ==> Cross-validation for comparing ?

How to get the percentile from the ability

● percentile is norm-cdf( (theta*-mu)/sigma).(some languages have normcdf included)

● Slow/precise implementation of norm-cdf: http://stackoverflow.com/questions/2328258/cumulative-normal-distribution-function-in-c-c

● Fast implementation of norm-cdf: http://finance.bi.no/~bernt/gcc_prog/recipes/recipes/node23.html

● Maybe fast Exp, if you want to save up time :-)