1 Conceptual Issues in Observed-Score Equating Wim J. van der Linden CTB/McGraw-Hill.

Post on 03-Jan-2016

224 views 0 download

Transcript of 1 Conceptual Issues in Observed-Score Equating Wim J. van der Linden CTB/McGraw-Hill.

1

Conceptual Issues in Observed-Score Equating

Wim J. van der Linden

CTB/McGraw-Hill

2

Outline

• Review of Lord (1980)

• Local equating

• Few examples

• Discussion

3

Review of Lord (1980)

• Notation– X: old test form with observed score X– Y: new test form Y with observed score Y– θ: common ability measured by X and Y– x=φ(y): equating transformation

4

Review of Lord (1980) Cont’d

• Case 1: Infallible measures– X and Y order any population identically– Equivalence of ranks establishes equating

transformation

( )G y p

( ( ))F y p 1( ) ( ( ))y F G y

5

Review of Lord (1980) Cont’d

• Case 1: Infallible measures Cont’d

– Q-Q curve – Issues related to discreteness, strict

monotonicity, and sampling error will beignored

– Equating is population invariant– Equating error always equal to zero

( ) ( )e x y x

6

Review of Lord (1980) Cont’d

• Case 2: Fallible measures– For each test taker, observed score are random

variables– Realizations of X and Y do not order populations

of test takers identically– Criterion of equity of equating

( ) θ θY Xf f for all θ

7

Review of Lord (1980) Cont’d

• Case 2: Fallible measures Cont’d

– Lord’s theorem: Under realistic conditions, scores X and Y on two tests cannot be equated unless either (1) both scores are perfectly reliable of (2) the two tests are strictly parallel [in which case φ(y)=y]

8

Review of Lord (1980) Cont’d

• Case 2: Fallible measures Cont’d

– Equating no longer population invariant

θ( ) ( ) (θ) θX Xf x f x f d

θ( ) ( ) (θ) θY Yf y f y f d

9

Review of Lord (1980) Cont’d

• Two approximate methods– IRT true-score equating

– Use ξ=ξ(η) to equate Y to X

1

(θ)n

ii

P

1

(θ)n

jj

P

10

Review of Lord (1980) Cont’d

• Two approximate methods Cont’d

– IRT observed-score equating, for a sample of test takers a=1,…,N

1

1

( ) ( θ )N

aXa

f x N f X

1

1

( ) ( θ )N

aYa

f y N f Y

11

Review of Lord (1980) Cont’d

• Lord’s forgotten question:

What is really needed is a criterion for evaluatingsuch approximate procedures, so as to be able to choose from among them. If you can’t be fair (provide equity) to everyone, what is the next best thing? (p.207)

12

Local Equating

• New definition of equating error

• Equity=no equating error!

• Setting e2(y) equal to zero and solving for φ(y) gives

2 ( ) θ θ( ;θ) ( ) ( )Y Xe x F y F x

* 1θ ( ) θ( ;θ) ( ), X Yx y F F y R

13

Local Equating Cont’d

• Because of monotonicity of x=φ(y), the resultis the family of error-free (or true) equating transformations

• Lord’s theorem is based on implicit assumptionof a single transformation

* 1θ θ( ;θ) ( ), X Yy F F y R

14

Local Equating Cont’d

• Theorem: For a population of test takers P for which X and Y measure the same θ, equating with the family of transformations φ*(y;θ) has the following properties:(i) equity for each p P(ii) symmetry in X and Y for each p P (iii) population invariance within P

15

Local Equating Cont’d

• Theorem defines population P– No sampling of test takers required– Includes future test takers

• Alternative definition of equating error:

13 θ θ( ;θ) ( ) ( )X Ye y y F F y

16

Local Equating Cont’d

• Definition of bias, MSE, etc., in equating now straightforward

• Lord’s criterion for finding the “next best thing”

17

Local Equating Cont’d

• Alternative motivations of local equating– Thought experiment– History of standard error of measurement– Comparison with

• true-score equating

• IRT observed-score equating

– Same score but different equated scores?

18

Local Equating Cont’d

• Alternative motivations Cont’d

– One measurement instrument but different transformations?

19

Few Examples

• It may seem as if local equating replaces Lord’s set of impossible conditions for equating (perfect reliability; parallel test) by another impossible condition (known ability)

• However, post hoc improvement of reliability or parallelness is impossible but we can always approximate an unknown ability

20

Few Examples Cont’d

• Possible approximations– Estimating ability– Anchor scores as a proxy of ability– Y=y as a proxy of ability– Proxies based on collateral information

21

Discussion

• Criterion of equity involves a different equating transformation for each ability level

• Traditional equating uses “one-size fits all” transformation, which compromises betweenthe transformations for ability levels. As a result, the equating is always (i) biased and(ii) population dependent

22

Discussion Cont’d

• Lord’s theorem on the impossibility or unnecessity of observed-score equating wastoo pessimistic because it assumed the use of a single equating transformation for a population of test takers

23

Equipercentile Method

0.0

0.5

1.0

0 5 10 15 20 25 30 35 40

Test Y

Test X

Test Score

Cum

ulat

ive

Pro

babi

lity

F (x)

G(y)

1( ( ))x F G y

p

24

Thought Experiment

y

pTest Y

25

Thought Experiment Cont’d

y

x

p

p

Test Y

Test X

26

Thought Experiment Cont’d

y

x

yx=

φ(y

)

p

p

p

Test Y

Test X

Transformation Y → X

27

Thought Experiment Cont’d

y

x

yx=

φ(y

)

p

p

p

qTest Y

Test X

Transformation Y → X

28

Thought Experiment Cont’d

y

x

yx=

φ(y

)

pq

p

p

Test Y

Test X

Transformation Y → X

q

29

Thought Experiment Cont’d

y

yx=

φ(y

)

pq

pq

qp

x

Test Y

Test X

Transformations Y → X

30

Thought Experiment Cont’d

Test Y (Population 1)

Test X (Population 2)

y

x

yx=

φ(y

)

Transformation Y → X

31

Thought Experiment Cont’d

y

x=φ

(y)

Transformation Y → X

y

x=φ

(y) qp

Transformations Y → X

32

Standard Error of Measurement

• Classical test theory involves one SEM for an entire population of test takers

• Stronger models condition on ability measured; e.g., IRT

'1E X XX

1/2

θ (θ)(1- (θ))i iEi

P P

33

True-Score Equating

• True-score equating is a degenerate case of local equating

( θ) ( θ), θE Y E X R

34

Different Equated Scores?

• Why should two test takers, p and q, with the same score of 23 out of 30 items correct on a new test form need different equated scores on the same old form?– Would this not even be unfair?– Fallible scores

35

Different Equated Scores? Cont’d

Observed-score distribution of p Observed-score distribution of q

36

Different Transformations?

• Example of measuring tape

• Number-correct scores are counts of responses, no fundamental measures

• Responses have person and item effects– Test equating requires “some type of control for

differential examinee ability”—von Davier, Holland & Thayer (2004, p. 2)

37

Different Transformations? Cont’d

• An effective way to disentangle item and person effects is through IRT modeling

• Observed-score equating is an attempt to do the same through a transformation of total scores– Only possible way is (i) to first condition on the

abilities and (ii) then transform the score to adjust for the item effects

38

Estimating Ability

• Assumption: fitting response model

• Calculate family of true equating transformations (Lord-Wingersky’s recursive procedure)

• Use member of family at point estimate of θ

• Bias study for 40-item subtests of LSAT

• Application in adaptive testing

39

Bias Study

-6

-4

-2

0

2

4

6

0 10 20 30 40e

-6

-4

-2

0

2

4

6

0 10 20 30 40

Bia

s

Bia

s

Traditional Equating Local Equating at θ

40

Family of True Transformationsfor LSAT Subtest

0

5

10

15

20

25

0 5 10 15 20 25

=-2.0x

y

=2.0

41

Anchor Score as Proxy

• Current methods– Chain equating– Poststratification equating– Linear equating methods: Tucker, Levine,

Braun-Holland, linear chain equating

• Use conditional distributions of X and Y given anchor score A=a

1( ( )) X a Y ax F F y a A

42

Anchor Score as Proxy Cont’d

• Empirical bias study for same LSAT subtests

43

Bias Study—Anchor-Test Design

-8-6-4-202468

0 10 20 30 40

Chain Equating

-8-6-4-202468

0 10 20 30 40

Poststratification Equating

-8-6-4-202468

0 10 20 30 40

Local Equating

44

Y=y as Proxy of Ability

• Single-group design– Estimate distributions of X given Y=y directly

from bivariate distribution of X and Y– Model-based estimate of Y given y

45

Y=y as Proxy of Ability

• Linear local equating

• Because μY|y=y (classical test theory),

( ) ( ), 0,1,..., .X y

X y Y yY y

x y y y n

( ) , 0,1,..., .X yx y y n

46

Collateral Information

• Any variables correlating substantiallywith θ– Earlier tests – Battery of subtests– Response times

• Alternative sources give different equatings; just find the “next best thing”