Extended Sampford sampling, balanced Pareto sampling, and...
Embed Size (px)
Transcript of Extended Sampford sampling, balanced Pareto sampling, and...
BANOCOSS, Norrfallsviken, Sweden, June 13-17, 2011
Extended Sampford sampling,
balanced Pareto sampling,
and sampling with prescribed
second-order inclusion probabilities
Lennart Bondesson
Dept of Math. & Math. Statistics, Umea University,SE-90187 Umea, Sweden
0. Introduction
Population U = {1,2, . . . , N}.
How to sample n units with prescribed inclusion probabilities
i, i = 1,2, . . . , N, with sum n is a topic with a long history (e.g.
Brewer & Hanif 1983, Chaudhuri & Vos 1988, Tille 2006).
Many researchers have contributed to the theory of ps sam-
pling. There are many solutions. Different samplers have differ-
ent favourite solutions.
Many samplers just press a button, but maybe not the best one.
I look upon a sample as binary N-vector x. Thus a sample of
size n = 3 from a population of size N = 20 is described as
x = [0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0]
and not as s = {3,12,14}. The first notation simplifies a lot! Adesign is described by its probability function p(x) = Pr(I = x).
Three important ps sampling designs:
The conditional Poisson (CP) design
The Sampford design
The Pareto design
Conditional Poisson (Hajek 1981, in particular):
p(x) N
i=1
pxii (1 pi)
1xi N
i=1
rxii , |x| = n (ri =
pi1 pi
).
We sample by repeatedly generating independent Ii Bin(1, pi),i = 1,2, . . . , N, until
Ni=1 Ii = n. The pis must be determined so
that we get the given inclusion probabilities i with sum n. This
is a bit complicated. Approximately it suffices to choose the pis
such thatN
i=1 pi = n and pi = i or (somewhat better)
pi1 pi
i
1 iexp(
12 i
d), where d =
Ni=1
i(1 i).
There are also iterative methods for determining suitable pis (e.g.
Tille 2006). The CP-design has maximum entropy.
Sampford sampling (Sampford, 1967):
p(x) N
i=1
xii (1 i)
1xiN
k=1
(1 k)xk, |x| = n; (N
i=1
i = n).
We can sample by selecting one unit with replacement (WR)
according to the probabilities i/n, and then n 1 further onesWR according to probabilities pi
i1i with sum 1. The full
procedure is repeated until all the n units are distinct.
The inclusion probabilities are as desired. Not easy to prove.
Alternatively we can sample n further units and repeat until all
n + 1 units are distinct. The n units constitute the sample.
Pareto sampling (Rosen 1997):
Let Ui, i = 1,2, . . . , N, be independent random numbers fromU(0,1) and let
Qi =Ui/(1 Ui)pi/(1 pi)
, i = 1,2, . . . , N.
Choose as sample those n units that have smallest ranking vari-ables Qi. The inclusion probabilities approximately equal the pisifN
i=1 pi = n.
The pf is of the form
p(x) =N
i=1
pxii (1 pi)
1xiN
k=1
ckxk, |x| = n.
Here ck 1 pk approximately ifN
i=1 pi = n. Hence, for pi = i,Pareto sampling is approximate Sampford sampling.
1. Extended Sampford sampling, ESS (B & G 2011)
Let pi (0,1) be such thatN
i=1 pi = n + a, where a [0,1).
Now draw one unit WR according to the probabilities pin+a. and
then n further ones WR according to the probabilities pi pi
1pi.Repeat the full procedure until all n + 1 units are distinct.
Let Ii = a for the first selected unit, Ii = 1 for the other n units,and Ii = 0 for the non-selected ones. The sample has the form
[0,0,1,1,0, a,0,1,0,0,1,1].
Remarkably, not easy to prove, E(Ii) = pi.
This is a generalization of the variant of Sampfords result.
Some other ps sampling methods also easily permit this type ofextension, for example systematic ps sampling, pivotal sampling(Deville & Tille 1998), and Pareto sampling.
For Pareto sampling, we let the unit corresponding to the orderstatistic Q(n+1) get Ii = a.
From [0,0,1,1,0, a,0,1,0,0,1,1], we can get a sample of tra-ditional form by making an additional random experiment withchance a to succeed. The sample size is then random, n or n+1.
There are several potential applications of ESS.
a) We could use the estimator
Y =N
i=1
yiIipi
,
where one Ii = a.
Of course, we need also yi for Ii = a but if a is small, it could bebetter to just guess a yi-value. The Sen-Yates-Grundy and theHajek-Rosen variance estimators are still valid!
b) Sampford sampling with prescribed is with sum n is oftentime consuming and it may happen that no sample is provided.
However, we can select M units by SRSWOR from {1,2, . . . , N}and then pick an extended Sampford sample among these Munits of the type [0,1,1, a,0,1,1,0]. Thus for all the M units,except one, the outcome (0 or 1) is decided. There remainN M +1 units for which the sampling outcome is not decided.We just repeat on these units the procedure. It is repeated untilthe outcome 0 or 1 is obtained for all units.
For M = 2 this method agrees with Deville & Tilles (1998)pivotal method for which the entropy is not so high. For M > 2(not too small), the sample probabilities are very close to thoseof ordinary Sampford sampling. The samples are obtained veryrapidly if M is not too large.
c) Sampling in blocks (real time sampling)
Units may come in blocks to the sampler (or the sampler mayvisit blocks of units). The inclusion probabilities may be propor-tional to some auxiliary variable and are determined in advance(common in forestry, for instance). These inclusion probabilitiesneed not sum to integers for units in a block.
We may then use for each block ESS. For each block there re-main exactly one unit for which the outcome is not yet decided.Then we may use just systematic ps sampling on these remain-ing units. Then the outcome can be decided immediately for allunits in a block.
d) ESS can also be used to perform stratified sampling ps sam-pling with several stratifications.
For the cross-strata the inclusion probabilities need not sum tointegers. Deville & Tilles (2004) cube method may replace thesystematic ps sampling in (c). We get high entropy.
More details in B & G (2011).
2. Balanced (restricted) Pareto sampling (B 2010)
Pareto reminder: Let Qi =Ui/(1Ui)pi/(1pi)
. Put Ii = xi = 1 if Qi Q(n).
Linear equivalent variant: Find, for Sn = {x;N
i=1 xi = n},
argminxSn
Ni=1
xi
(log
Ui1 Ui
logpi
1 pi
).
ForN
i=1 pi = n, we can adjust by replacing logpi
1pi by
logpi
1 pi
pi(1 pi)(pi 12)d2
, where d =N
i=1
pi(1 pi).
The inclusion probabilities i are then very close to the pis.
Remark. There is another way of improving to get inclusion
probabilities close to the pis.
We just condition the Uis to have sum N/2. To perform the
conditioning, we may just generate U1, U2, . . . , UN1 until
N
2 1
N1i=1
Ui