1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Transcript

Page 1: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Extended Poisson-Tweedie: properties andregression models for count data

Wagner H. Bonat∗ and Bent Jørgensen† and Celestin C. Kokonendji‡

and John Hinde§ and Clarice G. B. Demetrio¶

Abstract

We propose a new class of discrete generalized linear models based onthe class of Poisson-Tweedie factorial dispersion models with variance of theform µ + φµp, where µ is the mean, φ and p are the dispersion and Tweediepower parameters, respectively. The models are fitted by using an estimatingfunction approach obtained by combining the quasi-score and Pearson esti-mating functions for estimation of the regression and dispersion parameters,respectively. This provides a flexible and efficient regression methodology fora comprehensive family of count models including Hermite, Neyman Type A,Polya-Aeppli, negative binomial and Poisson-inverse Gaussian. The estimat-ing function approach allows us to extend the Poisson-Tweedie distributionsto deal with underdispersed count data by allowing negative values for thedispersion parameter φ. Furthermore, the Poisson-Tweedie family can auto-matically adapt to highly skewed count data with excessive zeros, without theneed to introduce zero-inflated or hurdle components, by the simple estimationof the power parameter. Thus, the proposed models offer a unified frameworkto deal with under, equi, overdispersed, zero-inflated and heavy-tailed countdata. The computational implementation of the proposed models is fast, re-lying only on a simple Newton scoring algorithm. Simulation studies showedthat the estimating function approach provides unbiased and consistent esti-mators for both regression and dispersion parameters. We highlight the ability

∗Department of Mathematics and Computer Science, University of Southern Denmark, Odense,Denmark. Department of Statistics, Parana Federal University, Curitiba, Parana, Brazil. E-mail:[email protected]†Department of Mathematics and Computer Science, University of Southern Denmark, Odense,

Denmark.‡Universite de Franche-Comte, Laboratoire de Mathematiques de Besancon, Besancon, France.§National University of Ireland, School of Mathematics, Galway, Ireland.¶Sao Paulo University, Sao Paulo, Brazil.

arX

iv:1

608.

0688

8v2

[st

at.M

Sep

2016

Page 2: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

of the Poisson-Tweedie distributions to deal with count data through a con-sideration of dispersion, zero-inflated and heavy tail indexes, and illustrate itsapplication with four data analyses. We provide an R implementation and thedata sets as supplementary materials.

1 Introduction

Generalized linear models (GLMs) (Nelder and Wedderburn; 1972) have been themain statistical tool for regression modelling of normal and non-normal data overthe past four decades. The success enjoyed by the GLM framework comes from itsability to deal with a wide range of normal and non-normal data. GLMs are fittedby a simple and efficient Newton score algorithm relying only on second-momentassumptions for estimation and inference. Furthermore, the theoretical backgroundfor GLMs is well established in the class of dispersion models (Jørgensen; 1987,1997) as a generalization of the exponential family of distributions. In particular,the Tweedie family of distributions plays an important role in the context of GLMs,since it encompasses many special cases including the normal, Poisson, non-centralgamma, gamma and inverse Gaussian.

In spite of the flexibility of the Tweedie family, the Poisson distribution is theonly choice for the analysis of count data in the context of GLMs. For this reason,in practice there is probably an over-emphasis on the use of the Poisson distributionfor count data. A well known limitation of the Poisson distribution is its meanand variance relationship, which implies that the variance equals the mean, referredto as equidispersion. In practice, however, count data can present other features,namely underdispersion (mean > variance) and overdispersion (mean < variance)that is often related to zero-inflation or a heavy tail. These departures can makethe Poisson distribution unsuitable, or at least of limited use, for the analysis ofcount data. The use of the Poisson distribution for non-equidispersed data maycause problems, because, in case of overdispersion, standard errors calculated underthe Poisson assumption are too optimistic and associated hypothesis tests will tendto give false positive results by incorrectly rejecting null hypotheses. The oppositesituation will appear in case of underdispersed data. In both cases, the Poissonmodel provides unreliable standard errors for the regression coefficients and hencepotentially misleading inferences.

The analysis of overdispersed count data has received much attention. Hinde andDemetrio (1998) discussed models and estimation algorithms for overdispersed data.Kokonendji et al. (2004, 2007) discussed the theoretical aspects of some discrete ex-ponential models, in particular, the Hinde-Demetrio and Poisson-Tweedie classes.

Page 3: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

El-Shaarawi et al. (2011) applied the Poisson-Tweedie family for modelling speciesabundance. Rigby et al. (2008) presented a general framework for modelling overdis-persed count data, including the Poisson-shifted generalized inverse Gaussian distri-bution. Rigby et al. (2008) also characterized many well known distributions, suchas the negative binomial, Poisson-inverse Gaussian, Sichel, Delaporte and Poisson-Tweedie as Poisson mixtures. In general, these models are computationally slow tofit to large data sets, their probability mass functions cannot be expressed explicitlyand they deal only with overdispersed count data. Further approaches include thenormalized tempered stable distribution (Kolossiatis et al.; 2011) and the tempereddiscrete Linnik distribution (Barabesi et al.; 2016).

The phenomenon of overdispersion is in general manifested through a heavy tailand/or zero-inflation. Zhu and Joe (2009) discussed the analysis of heavy-tailedcount data based on the Generalized Poisson-inverse Gaussian family. The problemof zero-inflation has been well discussed (Ridout et al.; 1998) and solved by includinghurdle or zero-inflation components (Zeileis et al.; 2008). These models are specifiedby two parts. The first part is a binary model for the dichotomous event of havingzero or count values, for which the logistic model is a frequent choice. Conditional ona count value, the second part assumes a discrete distribution, such as the Poissonor negative binomial (Loeys et al.; 2012), or zero-truncated versions for the hurdlemodel. While quite flexible, the two-part approach has the disadvantage of increasingthe model complexity by having an additional linear predictor to describe the excessof zeros.

The phenomenon of underdispersion seems less frequent in practical data anal-ysis, however, recently some authors have given attention towards the underdisper-sion phenomenon. Sellers and Shmueli (2010) presented a flexible regression modelbased on the COM-Poisson distribution that can deal with over and underdisperseddata. The COM-Poisson model has also recently been extended to deal with zero-inflation (Sellers and Raim; 2016). Zeviani et al. (2014) discussed the analysis ofunderdispersed experimental data based on the Gamma-Count distribution. Simi-larly, Kalktawi et al. (2015) proposed a discrete Weibull regression model to dealwith under and overdispersed count data. Although flexible, these approaches sharethe disadvantage that the probability mass function cannot be expressed explicitly,which implies that estimation and inference based on the likelihood function is diffi-cult and time consuming. Furthermore, the expectation is not known in closed-form,which makes these distributions unsuitable for regression modelling, where in gen-eral, we are interested in modelling the effects of covariates on a function of theexpectation of the response variable.

Given the plethora of available approaches to deal with count data in the lit-

Page 4: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

erature, it is difficult to decide, with conviction, which is the best approach for aparticular data set. The orthodox approach seems to be to take a small set of mod-els, such as the Poisson, negative binomial, Poisson-inverse Gaussian, zero-inflatedPoisson, zero-inflated negative binomial, etc, fit all of these models and then choosethe best fit by using some measures of goodness-of-fit, such as the Akaike or Bayesianinformation criteria. A typical example of this approach can be found in Oliveiraet al. (2016), where the authors compared the fit of eight different models for theanalysis of data sets related to ionizing radiation. Although reasonable, such an ap-proach is difficult to use in practical data analysis. The first problem is to define theset of models to be considered. Second, each count model can require specific fittingalgorithms and give its own set of fitting problems, in general due to bad behaviourof the likelihood function. Third, the choice of the best fit may not be obvious, withdifferent information criteria leading to different selected models. Finally, the un-certainty around the choice of distribution is not taken into account when choosingthe best fit. Thus, we claim that it is very useful and attractive to have a unifiedmodel that can automatically adapt to the underlying dispersion and that can beeasily implemented in practice.

The main goal of this article is to propose such a new class of count general-ized linear models based on the class of Poisson-Tweedie factorial dispersion mod-els (Jørgensen and Kokonendji; 2016) with variance of the form µ + φµp, where µis the mean, φ and p are the dispersion and Tweedie power parameters, respec-tively. The proposed class provides a unified framework to deal with over-, equi-,or underdispersed, zero-inflated, and heavy-tailed count data, with many potentialapplications.

As for GLMs, this new class relies only on second-moment assumptions forestimation and inference. The models are fitted by an estimating function ap-proach (Jørgensen and Knudsen; 2004; Bonat and Jørgensen; 2016), where the quasi-score and Pearson estimating functions are adopted for estimation of regression anddispersion parameters, respectively. The estimating function approach allows us toextend the Poisson-Tweedie distributions to deal with underdispersed count databy allowing negative values for the dispersion parameter φ. The Tweedie powerparameter plays an important role in the Poisson-Tweedie family, since it is an in-dex that distinguishes between important distributions, examples include Hermite(p = 0), Neyman Type A (p = 1), Polya-Aeppli (p = 1.5), negative binomial(p = 2) and Poisson-inverse Gaussian (p = 3). Furthermore, through the estimationof the Tweedie power parameter, the Poisson-Tweedie family automatically adaptsto highly skewed count data with excessive zeros, without the need to introducezero-inflated or hurdle components.

Page 5: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

The Poisson-Tweedie family of distributions and its properties are introduced inSection 2. In Section 3 we considered the estimating function approach for parame-ter estimation and inference. Section 4 presents the main results of two simulationstudies conducted to check the properties of the estimating function derived estima-tors and explore the flexibility of the extended Poisson-Tweedie models to deal withunderdispersed count data. The application of extended Poisson-Tweedie regressionmodels is illustrated in Section ??. Finally, discussions and directions for future workare given in Section 6. The R implementation and the data sets are available in thesupplementary material.

2 Poisson-Tweedie: properties and regression mod-

els

In this section, we derive the probability mass function and discuss the main prop-erties of the Poisson-Tweedie distributions. Furthermore, we propose the extendedPoisson-Tweedie regression model. The Poisson-Tweedie distributions are PoissonTweedie mixtures. Thus, our initial point is an exponential dispersion model of theform

fZ(z;µ, φ, p) = a(z, φ, p) exp(zψ − kp(ψ))/φ,

where µ = E(Z) = k′p(ψ) is the mean, φ > 0 is the dispersion parameter, ψ isthe canonical parameter and kp(ψ) is the cumulant function. The variance is givenby Var(Z) = φV (µ) where V (µ) = k′′p(ψ) is called the variance function. Tweediedensities are characterized by power variance functions of the form V (µ) = µp, wherep ∈ (−∞, 0]∪[1,∞) is the index determining the distribution. For a Tweedie randomvariable Z, we write Z ∼ Twp(µ, φ). The support of the distribution depends onthe value of the power parameter. For p ≥ 2, 1 < p < 2 and p = 0 the supportcorresponds to the positive, non-negative and real values, respectively. In thesecases µ ∈ Ω, where Ω is the convex support (i.e. the interior of the closed convexhull of the corresponding distribution support). Finally, for p < 0 the support againcorresponds to the real values, however the expectation µ is positive.

The function a(z, φ, p) cannot be written in a closed form, apart from the spe-cial cases corresponding to the Gaussian (p = 0), Poisson (φ = 1 and p = 1),non-central gamma (p = 3/2), gamma (p = 2) and inverse Gaussian (p = 3) dis-tributions (Jørgensen; 1997; Bonat and Kokonendji; 2016). Another important casecorresponds to the compound Poisson distributions, obtained when 1 < p < 2. Thecompound Poisson distribution is a frequent choice for the modelling of non-negative

Page 6: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

data that has a probability mass at zero and is highly right-skewed (Smyth andJørgensen; 2002; Andersen and Bonat; 2016).

The Poisson-Tweedie family is given by the following hierarchical specification

Y |Z ∼ Poisson(Z)

Z ∼ Twp(µ, φ).

Here, we require p ≥ 1, to ensure that Z is non-negative. In this case, the Poisson-Tweedie is an overdispersed factorial dispersion model (Jørgensen and Kokonendji;2016). The probability mass function for p > 1 is given by

f(y;µ, φ, p) =

∫ ∞0

zy exp−zy!

a(z, φ, p) exp(zψ − kp(ψ))/φdz. (1)

The integral (1) has no closed-form apart of the special case corresponding to thenegative binomial distribution, obtained when p = 2, i.e. a Poisson gamma mixture.For p = 1 the integral (1) is replaced by a sum and we have the Neyman Type Adistribution. Further special cases include the Hermite (p = 0), Poisson compoundPoisson (1 < p < 2), factorial discrete positive stable (p > 2) and Poisson-inverseGaussian (p = 3) distributions (Jørgensen and Kokonendji; 2016; Kokonendji et al.;2004).

Simulation from Poisson-Tweedie distributions is easy because of the availabilityof good simulation procedures for Tweedie distributions (Dunn; 2013). This alsomakes it easy to approximate the integral (1) using Monte Carlo integration, sincethe Tweedie family is a natural proposal distribution. Alternatively, we can evaluatethe integral using the Gauss-Laguerre method. Figure 1 presents the empirical prob-ability mass function for some Poisson-Tweedie distributions computed based on arandom sample of size 100, 000 (gray). Additionally, we display an approximation forthe probability mass function (black line) obtained by Monte Carlo integration. Weconsidered different values of the Tweedie power parameter (p = 1.1, 2, 3) combinedwith different values of the dispersion index (DI = 2, 5, 10, 20), which is defined by

DI = Var(Y )/E(Y ).

In all scenarios the expectation µ was fixed at 10.Figure 1 show that in the small dispersion case (DI = 2) the shape of the proba-

bility mass functions is quite similar for the different values of the power parameter.However, when the dispersion index increases the differences become more marked.For p = 1.1 the overdispersion is clearly attributable to zero-inflation, while for p = 3the overdispersion is due to the heavy tail. The negative binomial case (p = 2) is a

Page 7: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

0.00

0.04

0.08

DI = 2; p = 1.1

Mas

s fu

nctio

0 3 6 9 13 18 23 28 33

0.00

0.02

0.04

0.06

DI = 5; p = 1.1

Mas

s fu

nctio

0 6 13 21 29 37 45 53 62

0.00

0.10

0.20

0.30

DI = 10; p = 1.1

ass

func

tion

0 7 16 26 36 46 56 66 82

0.0

0.1

0.2

0.3

0.4

0.5

DI = 20; p = 1.1

Mas

s fu

nctio

0 12 26 40 54 68 82 96 112

0.00

0.04

0.08

DI = 2; p = 2

Mas

s fu

nctio

0 3 6 9 13 18 23 28 33

0.00

0.02

0.04

0.06

DI = 5; p = 2

Mas

s fu

nctio

0 6 13 21 29 37 45 53 61

0.00

0.02

0.04

0.06

0.08 DI = 10; p = 2

Mas

s fu

nctio

0 9 20 32 44 56 68 80 93

0.00

0.05

0.10

0.15

0.20

DI = 20; p = 2

Mas

s fu

nctio

0 20 44 68 92 119 151 182

0.00

0.04

0.08

DI = 2; p = 3

Mas

s fu

nctio

0 4 8 13 18 23 28 33 38 43

0.00

0.02

0.04

0.06

0.08

DI = 5; p = 3

Mas

s fu

nctio

0 8 17 27 37 47 57 67 77 88

0.00

0.02

0.04

0.06

0.08

DI = 10; p = 3

Mas

s fu

nctio

0 14 31 48 65 82 99 119 144

0.00

0.04

0.08

DI = 20; p = 3

Mas

s fu

nctio

0 29 62 95 133 179 222 291

Figure 1: Empirical (gray) and approximated (black) Poisson-Tweedie probabilitymass function by values of the dispersion index (DI) and Tweedie power parameter.

Page 8: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

critical point, where the distribution changes from zero-inflated to heavy-tailed. Theresults in Figure 1 also show that the Monte Carlo method provides a reasonable ap-proximation for the probability mass function for all Poisson-Tweedie distributions.

In order to further explore the flexibility of the Poisson-Tweedie distributions, weintroduce indices for zero-inflation

ZI = 1 +log P(Y = 0)

E(Y )

and a heavy tail

HT =P(Y = y + 1)

P(Y = y)for y →∞.

These indices are defined in relation to the Poisson distribution. The zero-inflatedindex is easily interpreted, since ZI < 0 indicates zero-deflation, ZI = 0 correspondsto no excess of zeroes, and ZI > 0 indicates zero-inflation. Similarly, HT → 1 wheny →∞ indicates a heavy tail distribution (for a Poisson distribution HT→ 0 wheny → ∞). Figure 2 presents the dispersion and zero-inflation indices as a functionof the expected values µ for different values of the dispersion and Tweedie powerparameters. The expected values are defined by µi = explog(10) + 0.8xi where xiis a sequence of length 100 from −1 to 1. We also present the heavy tail index forsome extreme values of the random variable. The dispersion parameter was fixedin order to have DI = 2, 5, 10 and 20 when the mean equals 10. We refer to thesedifferent cases as simulation scenarios 1 to 4, respectively.

The indices presented in Figure 2 show that for small values of the power parame-ter the Poisson-Tweedie distribution is suitable to deal with zero-inflated count data.In that case, the DI and ZI are almost not dependent on the values of the mean.However, the HT decreases as the mean increases. On the other hand, for largevalues of the power parameter the HT increases with increasing mean, showing thatthe model is specially suitable to deal with heavy-tailed count data. In this case, theDI and ZI increase quickly as the mean increases giving an extremely overdispersedmodel for large values of the mean. In general, the DI and ZI are larger than one andzero, respectively, which, of course, shows that the corresponding Poisson-Tweediedistributions cannot deal with underdispersed and zero-deflated count data.

In spite of the integral (1) having no closed-form, the first two moments (meanand variance) of the Poisson-Tweedie family can easily be obtained. This fact moti-vates us to specify a model by using only second-order moment assumptions. Con-sider a cross-sectional dataset, (yi,xi), i = 1, . . . , n, where yi’s are i.i.d. realizationsof Yi according to Yi ∼ PTwp(µi, φ) and g(µi) = ηi = x>i β, where xi and β are

Page 9: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

5 10 15 20

Scenario 1

510

1520

5 10 15 20

Scenario 2

1020

3040

5 10 15 20

Scenario 3

2040

6080

5 10 15 20

Scenario 4

Power parameter1.1 2 3

0.1

0.2

0.3

0.4

0.5

5 10 15 20

Scenario 1

0.3

0.4

0.5

0.6

0.7

5 10 15 20

Scenario 2

0.4

0.5

0.6

0.7

0.8

0.9

5 10 15 20

Scenario 3

0.5

0.6

0.7

0.8

0.9

5 10 15 20

Scenario 4

0.45

0.50

0.55

0.60

0.65

40 50 60 70 80 90 100

Scenario 1

0.70

0.75

0.80

0.85

40 50 60 70 80 90 100

Scenario 2

0.82

0.84

0.86

0.88

0.90

0.92

40 50 60 70 80 90 100

Scenario 3

0.90

0.91

0.92

0.93

0.94

0.95

0.96

40 50 60 70 80 90 100

Scenario 4

Figure 2: Dispersion (DI) and zero-inflation (ZI) indices as a function of µ by sim-ulation scenarios and Tweedie power parameter values. Heavy tail index (HT) forsome extreme values of the random variable Y by simulation scenarios and Tweediepower parameter values.

Page 10: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

(Q × 1) vectors of known covariates and unknown regression parameters, respec-tively. It is straightforward to show by using the factorial cumulant generatingfunction (Jørgensen and Kokonendji; 2016) that

E(Yi) = µi = g−1(x>i β)

Var(Yi) = Ci = µi + φµpi ,(2)

where g is a standard link function, for which here we adopt the logarithm linkfunction. The Poisson-Tweedie regression model is parametrized by θ = (β>,λ> =(φ, p)>)>. Note that, based on second-order moment assumptions, the only restric-tion to have a proper model is that Var(Yi) > 0, thus

φ > −µ(1−p)i ,

which shows that at least at some extent negative values for the dispersion parameterare allowed. Thus, the Poisson-Tweedie model can be extended to deal with under-dispersed count data, however, in doing so the associated probability mass functionsdo not exist.

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

p = 1.1

−0.7

−0.5

−0.3

−0.1

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

p = 2

−0.04

−0.02

−0.01−0.005

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

p = 3

−0.002

−0.001

−0.0005

−0.0001

Figure 3: Dispersion index as a function of µ by dispersion and Tweedie powerparameter values.

Figure 3 presents the DI as a function of the mean for different values of theTweedie power parameter and negative values for the dispersion parameter. Asexpected for negative values of the dispersion parameter the DI gives values smallerthan 1, indicating underdispersion. We also note that, as the mean increases the DIdecreases slowly for small values of the Tweedie power parameter and faster for largervalues of the Tweedie power parameter. This shows that the range of negative valuesallowed for the dispersion parameter decreases rapidly as the value of the Tweedie

Page 11: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

power parameter increases. Thus, for underdispersed data, we expect small values forthe Tweedie power parameter. Furthermore, the second-order moment assumptionsalso allow us to eliminate the non-trivial restriction on the parameter space of theTweedie power parameter. This makes it possible to estimate values between 0 and1 where the corresponding Tweedie distribution does not exist. Table 1 presentsthe main special cases and the dominant features of the Poisson-Tweedie modelsaccording to the values of the dispersion and power parameters.

Table 1: Reference models and dominant features by dispersion and power parametervalues.Reference Model Dominant features Dispersion PowerPoisson Equi φ = 0 −Hermite Over, Under φ ≶ 0 p = 0Neyman Type A Over, Under, Zero-inflation φ ≶ 0 p = 1Poisson compound Poisson Over, Under, Zero-inflation φ ≶ 0 1 < p < 2Polya-Aeppli Over, Under, Zero-inflation φ ≶ 0 p = 1.5Negative binomial Over, Under φ ≶ 0 p = 2Poisson positive stable Over, heavy tail φ > 0 p > 2Poisson-inverse Gaussian Over, heavy tail φ > 0 p = 3

3 Estimation and Inference

We shall now introduce the estimating function approach using terminology andresults from Jørgensen and Knudsen (2004) and Bonat and Jørgensen (2016). Theestimating function approach adopted in this paper combines the quasi-score andPearson estimating functions for estimation of regression and dispersion parameters,respectively. The quasi-score function for β has the following form,

ψβ(β,λ) =

(n∑i=1

∂µi∂β1

C−1i (yi − µi)>, . . . ,n∑i=1

∂µi∂βQ

C−1i (yi − µi)>)>

where ∂µi/∂βj = µixij for j = 1, . . . , Q. The entry (j, k) of the Q × Q sensitivitymatrix for ψβ is given by

Sβjk = E

(∂

∂βkψβj(β,λ)

)= −

n∑i=1

µixijC−1i xikµi. (3)

Page 12: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

In a similar way, the entry (j, k) of the Q×Q variability matrix for ψβ is given by

Vβjk = Var(ψβ(β,λ)) =n∑i=1

µixijC−1i xikµi.

Following Jørgensen and Knudsen (2004); Bonat and Jørgensen (2016), the Pear-son estimating function for the dispersion parameters has the following form,

ψλ(λ,β) =

(n∑i=1

W iφ

[(yi − µi)2 − Ci

]>,

n∑i=1

W ip

[(yi − µi)2 − Ci

]>)>,

where W iφ = −∂C−1i /∂φ and W ip = −∂C−1i /∂p. The Pearson estimating functionsare unbiased estimating functions for λ based on the squared residuals (yi−µi)2 withexpected value Ci.

The entry (j, k) of the 2 × 2 sensitivity matrix for the dispersion parameters isgiven by

Sλjk= E

(∂

∂λkψλj(λ,β)

)= −

n∑i=1

W iλjCiW iλkCi, (4)

where λ1 and λ2 denote either φ or p.Similarly, the cross entries of the sensitivity matrix are given by

Sβjλk = E

(∂

∂λkψβj(β,λ)

)= 0 (5)

and

Sλjβk = E

(∂

∂βkψλj(λ,β)

)= −

n∑i=1

W iλjCiW iβkCi, (6)

where W iβk = −∂C−1i /∂βk. Finally, the joint sensitivity matrix for the parametervector θ is given by

Sθ =

(Sβ 0Sλβ Sλ

whose entries are defined by equations (3), (4), (5) and (6).We now calculate the asymptotic variance of the estimating function estimators

denoted by θ, as obtained from the inverse Godambe information matrix, whosegeneral form for a vector of parameter θ is J−1θ = S−1θ VθS−>θ , where −> denotesinverse transpose. The variability matrix for θ has the form

Vθ =

(Vβ Vβλ

Vλβ Vλ

), (7)

Page 13: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

where Vλβ = V>βλ and Vλ depend on the third and fourth moments of Yi, respectively.In order to avoid this dependence on higher-order moments, we propose to use theempirical versions of Vλ and Vλβ as given by

Vλjk =n∑i=1

ψλj(λ,β)iψλk(λ,β)i and Vλjβk =n∑i=1

ψλj(λ,β)iψβk(λ,β)i.

Finally, the asymptotic distribution of θ is given by

θ ∼ N(θ, J−1θ ), where J−1θ = S−1θ VθS−>θ .

To solve the system of equations ψβ = 0 and ψλ = 0 Jørgensen and Knudsen(2004) proposed the modified chaser algorithm, defined by

β(i+1) = β(i) − S−1β ψβ(β(i),λ(i))

λ(i+1) = λ(i) − αS−1λ ψλ(β(i+1),λ(i)).

The modified chaser algorithm uses the insensitivity property (5), which allows us touse two separate equations to update β and λ. We introduce the tuning constant,α, to control the step-length. A similar version of this algorithm was used by Bonatand Kokonendji (2016) for estimation and inference in the context of Tweedie regres-sion models. Furthermore, this algorithm is a special case of the flexible algorithmpresented by Bonat and Jørgensen (2016) in the context of multivariate covariancegeneralized linear models. Hence, estimation for the Poisson-Tweedie model is easilyimplemented in R through the mcglm (Bonat; 2016) package.

4 Simulation studies

In this section we present two simulation studies designed to explore the flexibilityof the extended Poisson-Tweedie models to deal with over and underdispersed countdata.

4.1 Fitting extended Poisson-Tweedie models to overdisperseddata

In this first simulation study we designed 12 simulation scenarios to explore theflexibility of the extended Poisson-Tweedie model to deal with overdispersed count

Page 14: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

data. For each setting, we considered four different sample sizes, 100, 250, 500and 1000, generating 1000 datasets in each case. We considered three values ofthe Tweedie power parameter, 1.1, 2 and 3, combined with four different degrees ofdispersion as measured by the dispersion index. In the case of p = 1.1, the dispersionparameter was fixed at φ = 0.8, 3.2, 7.2 and 15. Similarly, for p = 2 and p = 3 thedispersion parameter was fixed at φ = 0.1, 0.4, 0.9, 1.9 and φ = 0.01, 0.04, 0.09, 1.9,respectively. These values were chosen so that when the mean is 10 the dispersionindex takes values of 2, 5, 10 and 20, respectively. The probability mass function ofthe Poisson-Tweedie distribution for each parameter combination is as presented inFigure 1.

In order to have a regression model structure, we specified the mean vector asµi = explog(10) + 0.8x1i − 1x2i, where x1i is a sequence from −1 to 1 with lengthequals to the sample size. Similarly, the covariate x2i is a categorical covariate withtwo levels (0 and 1) and length equals sample size. Figure 4 shows the average biasplus and minus the average standard error for the parameters under each scenario.The scales are standardized for each parameter by dividing the average bias and thelimits of the confidence intervals by the standard error obtained for the sample ofsize 100.

The results in Figure 4 show that for all simulation scenarios both the averagebias and standard errors tend to 0 as the sample size is increased. This shows theconsistency and unbiasedness of the estimating function estimators. Figure 5 presentsthe confidence interval coverage rate by sample size and simulation scenarios.

The results presented in Figure 5 show that for the regression parameters theempirical coverage rates are close to the nominal level of 95% for all sample sizesand simulation scenarios. For the dispersion parameter and a small sample size theempirical coverage rates are slightly lower than the nominal level, however, theybecome closer for large samples. On the other hand, for the power parameter theempirical coverage rates were slightly larger than the nominal level, for all samplesizes and simulation scenarios.

4.2 Fitting extended Poisson-Tweedie models to underdis-persed data

As discussed in Section 2, the extended Poisson-Tweedie model can deal with under-dispersed count data by allowing negative values for the dispersion parameter. How-ever, in that case there is no probability mass function associated with the model.Consequently, it is impossible to use such a model to simulate underdispersed data.Thus, we simulated data sets from the COM-Poisson (Sellers and Shmueli; 2010) and

Page 15: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Standardized scale

β0

β1

β2

−1.0 −0.5 0.0 0.5 1.0

1.1;0.8

1.1;3.2

−1.0 −0.5 0.0 0.5 1.0

1.1;7.2

1.1;15

β0

β1

β2

2;0.1

2;0.4

2;0.9

2;1.9

β0

β1

β2

3;0.01−1.0 −0.5 0.0 0.5 1.0

3;0.04

3;0.09−1.0 −0.5 0.0 0.5 1.0

3;0.19

Sample size100 250 500 1000

Figure 4: Average bias and confidence intervals on a standardized scale by samplesize and simulation scenario.

Page 16: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Sample size

Cov

erag

e ra

0.850.900.951.00

100 250 500 1000

1.1;

0.8

100 250 500 1000

1.1;

3.2

0.850.900.951.00

1.1;

7.2

1.1;

0.850.900.951.00

2;0.

0.850.900.951.00

2;0.

2;1.

0.850.900.951.00

3;0.

0.850.900.951.00

3;0.

β0

3;0.

100 250 500 1000

β1

β2

100 250 500 1000

0.850.900.951.00

Figure 5: Coverage rate for each parameter by sample size and simulation scenarios.

Page 17: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Gamma-Count (Zeviani et al.; 2014) distributions. Such models are well known inthe literature for their ability to model underdispersed data.

Following the parametrization used by Sellers and Shmueli (2010), Y ∼ CP (λ, ν)denotes a COM-Poisson distributed random variable. Similarly, we write Y ∼GC(λ, ν) for a Gamma-Count distributed random variable. For both distributionsthe additional parameter ν controls the dispersion structure, with values larger than1 indicating underdispersed count data. An inconvenience of the COM-Poisson andGamma-Count regression models as proposed by Sellers and Shmueli (2010) and Ze-viani et al. (2014), respectively, is that the regression structure is not linked to afunction of E(Y ) as is usual in the generalized linear models framework. To over-come this limitation and obtain parameters that are interpretable in the usual way,i.e. related directly to a function of E(Y ), we take an alternative approach basedon simulation. The procedure consisted of specifying the λ parameter using a re-gression structure, λi = expλ0 + λ1x1 for i = 1, . . . , n where n denotes the samplesize and x1 is a sequence from −1 to 1 and length n. For each value of λ we sim-ulate 1000 values and compute the empirical mean and variance. We denote these

quantities by E(Y ) and var(Y ). Then, we fitted two non-linear models specified as

E(Y ) = exp(β0 + β1x1) and var(Y ) = E(Y ) +φE(Y )p

. From these fits, we obtainedthe expected values of the regression, dispersion and Tweedie power parameters.

We designed four simulation scenarios by introducing different degrees of under-dispersion in the data sets. The parameter ν was fixed at the values ν = 2, 4, 6 and8 for both distributions. In the COM-Poisson case we took λ0 = 8 and λ1 = 4 andfor the Gamma-Count case we fixed λ0 = 2 and λ1 = 1. It is important to highlightthat for all of these selected values the expected value of the dispersion parameterφ is negative. The particular values depend on λ0, λ1 and ν and are presented forboth distributions in Table 2.

For each setting, we generated 1000 data sets for four different sample sizes 100,250, 500 and 1000. The extended Poisson-Tweedie model was fitted using the esti-mating function approach presented in the Section 3. Figure 6 shows the average biasplus and minus the average standard error for the parameters in each scenario. Foreach parameter the scales are standardized by dividing the average bias and limitsof the confidence intervals by the standard error obtained for the sample of size 100.

The results in Figure 4 show that for all simulation scenarios, both the averagebias and standard errors tend to 0 as the sample size is increased for both dispersionand Tweedie power parameters. It shows the consistency of the estimating functionestimators. Concerning the regression parameters, in general the intercept (β0) isunderestimated, while the slope (β1) is overestimated. The bias is larger for theGamma-Count data with strong underdispersion (ν = 8) case. However, it is still

Page 18: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Standardized scale

β0

β1

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

M−

Poi

sson

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

β0

β1

ν = 2

Gam

ma−

Cou

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

ν = 4

ν = 6−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

ν = 8

Sample size100 250 500 1000

Figure 6: Average bias and confidence interval on a standardized scale by samplesize and simulation scenario.

Page 19: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Table 2: Corresponding values of β0, β1, φ and p depending on the values of λ0, λ1and ν for the COM-Poisson and Gamma-Count distributions.

COM-Poissonν λ0 λ1 β0 β1 φ p2 8 4 3.995 2.004 −0.485 1.0084 8 4 1.941 1.047 −0.714 1.0146 8 4 1.206 0.744 −0.790 1.0208 8 4 0.803 0.602 −0.821 1.036

Gamma-Count2 2 1 1.962 1.028 −0.429 1.0454 2 1 1.943 1.042 −0.682 1.0036 2 1 1.936 1.048 −0.779 1.0198 2 1 1.932 1.051 −0.820 1.020

small in its magnitude.

5 Data analyses

In this section we present four examples to illustrate the application of the ex-tended Poisson-Tweedie models. The data and the R scripts used for their analysiscan be obtainedhttp://www.leg.ufpr.br/doku.php/publications:papercompanions:ptw.

5.1 Data set 1: respiratory disease morbidity among chil-dren in Curitiba, Parana, Brazil

The first example concerns monthly morbidity from respiratory diseases among 0to 4 year old children in Curitiba, Parana State, Brazil. The data were collectedfor the period from January 1995 to December 2005, corresponding to 132 months.The main goal of the investigation was to assess the effect of three environmentalcovariates (precipitation, maximum and minimum temperatures) on the morbidityfrom respiratory diseases. Figure 7 presents a time series plot with fitted values (A)and dispersion diagrams of the monthly morbidity from respiratory diseases againstthe covariates precipitation (B), maximum temperature (C) and minimum temper-ature (D), with a simple linear fit indicated by the straight black lines. These plotsindicate a clear seasonal pattern and the essentially linear effect of all covariates (as

Page 20: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

100

200

300

400

500

Month/Year

Mor

bidi

01/95 06/95 11/95 04/96 09/96 02/97 07/97 12/97 05/98 10/98 03/99 08/99 01/00 06/00 11/00 04/01 09/01 02/02 07/02 12/02 05/03 10/03 03/04 08/04 01/05 06/05 11/05

0 100 200 300 400

100

200

300

400

500

Precipitation

Mor

bidi

18 20 22 24 26 28 30

100

200

300

400

500

Maxima

Mor

bidi

6 8 10 12 14 16 18

100

200

300

400

500

Minima

Mor

bidi

Figure 7: Time series plot with fitted values (A) and dispersion diagrams of themonthly morbidity by respiratory diseases against the covariates precipitation (B),maximum temperature (C) and minimum temperature (D), with a simple linear fitindicated by the straight black lines.

suggested by the simple linear fits superimposed in Figure 7). The linear predictoris expressed in terms of Fourier harmonics (seasonal variation) and the effect of thethree environmental covariates. The logarithm of the population size was used as anoffset. To compare the extended Poisson-Tweedie model with the usual Poisson log-linear model, Table 3 shows the corresponding estimates and standard errors (SE),along with the ratios between the both model estimates and standard errors.

The results presented in Table 3 show that the estimates from the extendedPoisson-Tweedie and Poisson models are similar. However, the standard errors fromthe extended Poisson-Tweedie model are in general 3.5 times larger than the onesfrom the Poisson model. This difference is explained by the dispersion structure. Thedispersion parameter φ > 0 indicates overdispersion, which implies that the standarderrors obtained by the Poisson model are underestimated. The Poisson model givesevidence of a significant effect for all covariates, while the Poisson-Tweedie modelonly gives significant effects for the seasonal variation and the temperature maxima

Page 21: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Table 3: Data set 1: Parameter estimates and standard errors (SE) for Poisson-Tweedie and Poisson models (first and second columns). Ratios between Poisson-Tweedie and Poisson estimates and standard errors (third column).

ParameterEstimates (SE)

Poisson-Tweedie Poisson RatioIntercept 2.277 (0.304)∗ 2.226 (0.084)∗ 1.023 (3.598)cos(2*pi*Month/12) −0.223 (0.056)∗ −0.226 (0.016)∗ 0.985 (3.507)sin(2*pi*Month/12) −0.093 (0.048)∗ −0.073 (0.013)∗ 1.279 (3.562)Maxima −0.083 (0.017)∗ −0.083 (0.005)∗ 1.057 (3.590)Minima 0.039 (0.022) 0.034 (0.006)∗ 1.128 (3.592)Precipitation −0.001 (0.000) −0.001 (0.000)∗ 0.978 (3.337)p 1.652 (0.423) − −φ 0.293 (0.036) − −

covariates. The fitted values and 95% confidence interval are shown in Figure 7(A).The model captures the swing in the data and highlights the seasonal behaviour withhigh and low morbidity numbers around winter and summer months, respectively.The negative effect of the covariate temperature maxima agrees with the seasonaleffects and the exploratory analysis presented in Figure 7(C). The power parameterestimate with its corresponding standard error indicate that all Poisson-Tweediemodels with p ∈ [1, 2] are suitable for this data set. In particular, Neyman Type A,Polya-Aeppli and negative binomial distributions can be good choices.

5.2 Data set 2: cotton bolls greenhouse experiment

The second example relates to cotton boll production and is from a completelyrandomized experiment conducted in a greenhouse. The aim was to assess the effectof five artificial defoliation levels (0%, 25%, 50%, 75% and 100%) and five growthstages (vegetative, flower-bud, blossom, fig and cotton boll) on the number of cottonbolls. There were five replicates of each treatment combination, giving a data setwith 125 observations. This data set was analysed in Zeviani et al. (2014) usingthe Gamma-Count distribution, since there was clear evidence of underdispersion.Following Zeviani et al. (2014), the linear predictor was specified by

g(µij) = β0 + β1jdefi + β2jdef2i ,

where µij is the expected number of cotton bolls for the defoliation (def) level i =1, . . . , 5 and growth stage j = 1, . . . , 5, that is, we have a second order effect of

Page 22: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

defoliation in each growth stage. Table 4 presents the estimates and standard errorsfor the Poisson-Tweedie and standard Poisson models along, with the ratios betweenthe respective estimates and standard errors.

Table 4: Data set 2: Parameter estimates and standard errors (SE) for Poisson-Tweedie and Poisson models (first and second columns). Ratios between Poisson-Tweedie and Poisson estimates and standard errors (third column).

ParameterEstimates (SE)

Poisson-Tweedie Poisson RatioIntercept 2.189 (0.030)∗ 2.190 (0.063)∗ 1.000 (0.471)vegetative:des 0.438 (0.243) 0.437 (0.516) 1.003 (0.471)vegetative:des2 −0.806 (0.274)∗ −0.805 (0.584) 1.001 (0.469)flower bud:des 0.292 (0.239) 0.290 (0.508) 1.007 (0.471)flower bud:des2 −0.490 (0.266) −0.488 (0.566) 1.004 (0.470)blossom:des −1.235 (0.281)∗ −1.242 (0.604)∗ 0.994 (0.465)blossom:des2 0.665 (0.316)∗ 0.673 (0.680) 0.989 (0.465)fig:des 0.380 (0.265) 0.365 (0.566) 1.040 (0.468)fig:des2 −1.330 (0.313)∗ −1.310 (0.673) 1.015 (0.465)boll:des 0.011 (0.237) 0.009 (0.504) 1.181 (0.471)boll:des2 −0.021 (0.260) −0.020 (0.553) 1.059 (0.471)p 0.981 (0.137) − −φ −0.810 (0.223) − −

The results in Table 4 show that the estimates are quite similar, however, thestandard errors obtained by the Poisson-Tweedie model are smaller than those fromthe Poisson model. This is explained by the negative estimate of the dispersion pa-rameter, which indicates underdispersion. The value of the power parameter is closeto 1 and explains the similarity of the regression parameter estimates. Appropriateestimation of the standard error is important for this data set, since the Poisson-Tweedie identifies the effect of the defoliation as significant for three of the fivegrowth stages, while the Poisson model only finds the defoliation effect as significantfor the blossom growth stage. Figure 8 presents the observed values and curves offitted values (Poisson in gray and Poisson-Tweedie in black) and confidence intervals(95%) as functions of the defoliation level for each growth stage and supports theabove conclusions.

The results from the Poisson-Tweedie model are consistent with those from theGamma-Count model, fitted by Zeviani et al. (2014), in that both methods indicateunderdispersion and significant effects of defoliation for the vegetative, blossom and

Page 23: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Artificial defoliation level

Num

ber o

f bol

ls p

rodu

ced

0.0 0.2 0.4 0.6 0.8 1.0

vegetative0.0 0.2 0.4 0.6 0.8 1.0

flower bud

0.0 0.2 0.4 0.6 0.8 1.0

blossom0.0 0.2 0.4 0.6 0.8 1.0

fig

0.0 0.2 0.4 0.6 0.8 1.0

cotton boll

Figure 8: Dispersion diagrams of observed values and curves of fitted values (Poisson-gray and Poisson-Tweedie-black) and confidence intervals (95%) as functions of thedefoliation level for each growth stage.

fig growth stages . However, it is important to note that the estimates obtained bythe Gamma-Count model fitted by Zeviani et al. (2014) are not directly comparablewith the ones obtained from the Poisson-Tweedie model, since the latter is modellingthe expectation, while the Gamma-Count distribution models the distribution of thetime between events.

5.3 Data set 3: radiation-induced chromosome aberrationcounts

In this example, we apply the extended Poisson-Tweedie model to describe the num-ber of chromosome aberrations in biological dosimetry. The dataset considered wasobtained after irradiating blood samples with five different doses between 0.1 and 1Gy of 2.1 MeV neutrons. In this case, the frequencies of dicentrics and centric ringsafter a culture of 72 hours are analysed. The dataset in Table 5 was first presentedby Heimers et al. (2006) and analysed by Oliveira et al. (2016) as an example ofzero-inflated data.

We fitted the extended Poisson-Tweedie and Poisson models with the linear pre-

Page 24: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Table 5: Frequency distributions of the number of dicentrics and centric rings bydose levels.

xiyij

0 1 2 3 4 5 6 70.1 2281 130 21 1 0 0 0 00.3 847 127 19 6 1 0 0 00.5 567 165 49 16 2 0 0 00.7 356 167 62 9 5 1 0 01 169 131 72 18 9 0 0 1

dictor specified as a quadratic dose model, i.e

g(µij) = β0 + β1dosei + β2dose2i .

Table 6 presents the estimates and standard errors for the Poisson-Tweedie andPoisson models, along with the ratios between the respective estimates and standarderrors.

Table 6: Data set 3: Parameter estimates and standard errors (SE) for Poisson-Tweedie and Poisson models (first and second columns). Ratios between Poisson-Tweedie and Poisson estimates and standard errors (third column).

ParameterEstimates (SE)

Poisson-Tweedie Poisson RatioIntercept −3.126 (0.106)∗ −3.125 (0.097)∗ 1.000 (1.098)dose 5.514 (0.408)∗ 5.508 (0.369)∗ 1.001 (1.104)dose2 −2.481 (0.342)∗ −2.476 (0.309)∗ 1.002 (1.107)p 1.085 (0.299) − −φ 0.249 (0.100) − −

Results in Table 6 show evidence of weak overdispersion that can be attributedto zero-inflation, since the estimate of the power parameter was close to 1, which inturn implies that the standard errors obtained from the Poisson-Tweedie model arearound 10% larger than those obtained from the Poisson model.

For this data set it is particularly easy to compute the log-likelihood value, sincewe have only a few unique observed counts and dose values. Thus, we can uselog-likelihood values to compare the fit of the Poisson-Tweedie model with the fitobtained by the zero-inflated Poisson and zero-inflated negative binomial models.The log-likelihood value of the Poisson-Tweedie model was −2950.605, while the

Page 25: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

maximised log-likelihood value of the zero-inflated Poisson and zero-inflated negativebinomial models were −2950.462 and −2950.531, respectively. Furthermore, themaximised log-likelihood value of the Poisson model was −2995.389. These resultsshow that the Poisson-Tweedie model can offer a very competitive fit, even withoutan additional linear predictor to describe the excess of zeroes. Furthermore, it isinteresting to note that in spite of the large difference in the log-likelihood values,the Poisson model provides the same interpretation in terms of the significance ofthe covariates as the Poisson-Tweedie model for this data set.

5.4 Data set 4: customers’ profile

The last example corresponds to a data set collected to investigate the customerprofile of a large company of household supplies. During a representative two-weekperiod, in-store surveys were conducted and addresses of customers were obtained.The addresses were then used to identify the metropolitan area census tracts inwhich the customers resident. At the end of the survey period, the total numberof customers who visited the store from each census tract within a 10-mile radiuswas determined and relevant demographic information for each tract was obtained.The data set was analysed in Neter et al. (1996) as an example of Poisson regressionmodel, since it is a classic example of equidispersed count data. Following Neteret al. (1996) we considered the covariates, number of housing units (nhu), averageincome in dollars (aid), average housing unit age in years (aha), distance to thenearest competitor in miles (dnc) and distance to store in miles (ds) for forming thelinear predictor.

For equidispersed data the estimation of the Tweedie power parameter is in gen-eral a difficult task. In this case, the dispersion parameter φ should be estimatedaround zero. Thus, we do not have enough information to distinguish between dif-ferent values of the Tweedie power parameter. Consequently, we can fix the Tweediepower parameter at any value and the corresponding fitted models should be verysimilar. To illustrate this idea, we fitted the extended Poisson-Tweedie model fixingthe Tweedie power parameter at the values 1, 2 and 3, corresponding to the NeymanType A (NTA), negative binomial (NB) and Poisson-inverse Gaussian (PIG) distri-butions, respectively. We also fitted the standard Poisson model for comparison, theestimates and standard errors (SE) are presented in Table 7.

The results presented in Table 7 show clearly that for all fitted models the disper-sion parameter does not differ from zero, which gives evidence of equidispersion. Theregression coefficients and the associated standard errors do not depend on the modelsand in particular do not depend on the power parameter value. This example shows

Page 26: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Table 7: Data set 4: Estimates and standard errors (SE) from different models.Parameter Poisson NTA NB PIGIntercept 2.942 (0.207)∗ 2.942 (0.194)∗ 2.937 (0.197)∗ 2.933 (0.203)∗

nhu 0.061 (0.014)∗ 0.061 (0.013)∗ 0.060 (0.013)∗ 0.060 (0.014)∗

aid −0.012 (0.002)∗ −0.012 (0.002)∗ −0.012 (0.002)∗ −0.012 (0.002)∗

aha −0.004 (0.002)∗ −0.004 (0.002)∗ −0.004 (0.002)∗ −0.004 (0.002)∗

dnc 0.168 (0.026)∗ 0.168 (0.024)∗ 0.165 (0.025)∗ 0.166 (0.025)∗

ds −0.129 (0.016)∗ −0.129 (0.015)∗ −0.127 (0.015)∗ −0.127 (0.016)∗

φ 0 −0.122(0.123) −0.008 (0.010) 0.000 (0.000)p − 1 2 3

that, although a more careful analysis is required, the extended Poisson-Tweediemodel can deal with equidispersed data. Furthermore, the estimation of the extradispersion parameter does not inflate the standard errors associated with the regres-sion coefficients. Thus, there is no loss of efficiency when using the Poisson-Tweediemodel for equidispersed count data.

6 Discussion

We presented a flexible statistical modelling framework to deal with count data.The models are based on the Poisson-Tweedie family of distributions that auto-matically adapts to overdispersed, zero-inflated and heavy-tailed count data. Fur-thermore, we adopted an estimating function approach for estimation and inferencebased only on second-order moment assumptions. Such a specification allows us toextend the Poisson-Tweedie model to deal with underdispersed count data by al-lowing negative values for the dispersion parameter. The main technical advantageof the second-order moment specification is the simplicity of the fitting algorithm,which amounts to finding the root of a set of non-linear equations. The Poisson-Tweedie family encompasses some of the most popular models for count data, such asthe Hermite, Neyman Type A, Polya-Aeppli, negative binomial and Poisson-inverseGaussian distributions. For this reason, the estimation of the power parameter playsan important role in the context of Poisson-Tweedie regression models, since it is anindex that distinguishes between these important distributions. Thus, the estimationof the power parameter can work as an automatic distribution selection.

We conducted a simulation study on the properties of the estimating function

Page 27: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

estimators. The results showed that in general the estimating function estimatorsare unbiased and consistent. We also evaluated the validity of the standard errorsobtained by the estimating function approach by computing the empirical coveragerate. The results showed that for the regression coefficients our estimators provide thespecified level of coverage for all simulation scenarios and sample sizes. Regarding thedispersion parameter, the results showed that for small samples the standard errorsare underestimated, however, the results improve for larger samples. On the otherhand, the standard errors associated with the power parameter are overestimatedfor all simulation scenarios and sample sizes. However, the coverage rate presentedvalues only slightly larger than the specified nominal level of 95%. It is importantto highlight that the under or overestimation of the dispersion and power parame-ters do not affect the estimates and standard errors associated with the regressioncoefficients. This is due to the insensitivity property, see equation (5). Furthermore,we demonstrated the flexibility of the extended Poisson-Tweedie model to deal withunderdispersed count data as generated by the COM-Poisson and Gamma-Countdistribution. It also shows that the model has a good level of robustness againstmodel misspecification.

Discussion of the efficiency of the estimating function estimators is difficult dueto the lack of a closed form for the Fisher information matrix. Bonat and Kokonendji(2016) showed in the context of Tweedie regression models that the quasi-score func-tion provides asymptotically efficient estimators for the regression parameters, thus asimilar result is expected for the Poisson-Tweedie regression model. Concerning thedispersion and power parameters, the fact that the sensitivity and variability matri-ces do not coincide indicates that the Pearson estimating functions are not optimum.Furthermore, the use of empirical third and fourth moments for the calculation of theGodambe information matrix must imply some efficiency loss. On the other hand,it again makes the model robust against misspecification.

We analysed four real data sets to explore and illustrate the flexibility of theextended Poisson-Tweedie model. Data set 1 presented a classical case of overdis-persion. This data set illustrated the most common problem when using the Poissonmodel for overdispersed count data, i.e. the strong underestimation of the standarderrors associated with the regression coefficients. The Poisson-Tweedie model auto-matically adapts to the dispersion in the data by the estimation of the dispersionparameter, while choosing the appropriate distribution in the Poisson-Tweedie fam-ily through the estimation of the power parameter. Furthermore, the uncertaintyaround the data distribution is taken into account and can be assessed based on thestandard errors associated with the power parameter. In particular, for this applica-tion the model shows that any distribution in the family of the Poisson compound

Page 28: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Poisson distributions (1 < p < 2) provides a suitable fit for the data set. Thus, weavoid the need to fit an array of models and the use of measures of goodness-of-fitto choose between them.

Data set 2 presents the less frequent case of underdispersion. In this case, theproblem is that the Poisson model overestimates the standard errors associated withthe regression coefficients. The negative value of the dispersion parameter obtainedby fitting the Poisson-Tweedie model to this data set indicates underdispersion.Thus, the model automatically corrects the standard errors for the regression coeffi-cients, giving standard errors that are smaller than those obtained from the Poissonmodel. The problem of zero-inflated count data was illustrated by the data set 3.In this example, we showed that, in general, zero-inflation introduces overdisper-sion and that the Poisson-Tweedie model can also adapt to zero-inflation providinga very competitive fit when compared with more orthodox approaches such as thezero-inflated Poisson and zero-inflated negative binomial models. Finally, data set 4illustrated the case of equidispersed count data. This case is particularly challengingfor the Poisson-Tweedie model since the dispersion parameter should be zero, whichimplies that any distribution in the family of Poisson-Tweedie distributions can pro-vide a suitable fit for the data. Thus, the estimation of the Tweedie power parameteris very difficult, because the estimating function associated with the Tweedie powerparameter is flat. In this case, our approach was to fit the model with the Tweediepower parameter fixed at the values 1, 2 and 3. We compared the fit of these threemodels with the fit of the Poisson model and, since we have equidispersed data, allmodels provide quite similar estimates and standard errors. Furthermore, all mod-els indicated that the dispersion parameter is not different from zero, which againindicates equidispersion. It is important to emphasize that the estimation of theadditional dispersion parameter does not inflate the standard errors associated withthe regression parameters.

There are many possible extensions to the basic model discussed in the presentpaper, including incorporating penalized splines and the use of regularization for highdimensional data, with important applications in genetics. There is also a need todevelop methods for model checking, such as residual analysis, leverage and outlierdetection. Finally, we can extend the model to deal with multivariate count data,with many potential applications for the analysis of longitudinal and spatial data.These extensions will form the basis of future work.

Page 29: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Acknowledgements

This paper is dedicated in honour and memory of Professor Bent Jørgensen. Thiswork was done while the first author was visiting the Laboratory of Mathematicsof Besancon, France and School of Mathematics of National University of Ireland,Galway, Ireland. The first author is supported by CAPES (Coordenacao de Aper-feicomento de Pessoal de Nıvel Superior), Brazil. The last author was partiallysupported by CNPq, a Brazilian Science Funding Agency.

References

Andersen, D. A. and Bonat, W. H. (2016). Double generalized linear compoundpoisson models to insurance claims data, ArXiv . to appear.

Barabesi, L., Becatti, C. and Marcheselli, M. (2016). The tempered discrete Linnikdistribution, ArXiv e-prints .

Bonat, W. H. (2016). mcglm: Multivariate covariance generalized linear models,http://git.leg.ufpr.br/wbonat/mcglm. R package version 0.3.0.

Bonat, W. H. and Jørgensen, B. (2016). Multivariate covariance generalized linearmodels, Journal of the Royal Statistical Society: Series C (Applied Statistics) . toappear.

Bonat, W. H. and Kokonendji, C. C. (2016). Flexible Tweedie regression models forcontinuous data, ArXiv . to appear.

Dunn, P. K. (2013). tweedie: Tweedie exponential family models. R package version2.1.7.

El-Shaarawi, A. H., Zhu, R. and Joe, H. (2011). Modelling species abundance usingthe Poisson-Tweedie family, Environmetrics 22(2): 152–164.

Heimers, A., Brede, H. J., Giesen, U. and Hoffmann, W. (2006). Chromosomeaberration analysis and the influence of mitotic delay after simulated partial-bodyexposure with high doses of sparsely and densely ionising radiation, Radiation andEnvironmental Biophysics 45(1): 45–54.

Hinde, J. and Demetrio, C. G. B. (1998). Overdispersion: Models and estimation,Computational Statistics & Data Analysis 27(2): 151–170.

Page 30: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Jørgensen, B. (1987). Exponential dispersion models, Journal of the Royal StatisticalSociety. Series B (Methodological) 49(2): 127–162.

Jørgensen, B. (1997). The Theory of Dispersion Models, Chapman & Hall, London.

Jørgensen, B. and Knudsen, S. J. (2004). Parameter orthogonality and bias adjust-ment for estimating functions, Scandinavian Journal of Statistics 31(1): 93–114.

Jørgensen, B. and Kokonendji, C. C. (2016). Discrete dispersion models and theirTweedie asymptotics, AStA Advances in Statistical Analysis 100(1): 43–78.

Kalktawi, H. S., Vinciotti, V. and Yu, K. (2015). A simple and adaptive dispersionregression model for count data, ArXiv e-prints .

Kokonendji, C. C., Demetrio, C. G. B. and Zocchi, S. S. (2007). On hinde-demetrioregression models for overdispersed count data, Statistical Methodology 4(3): 277–291.

Kokonendji, C. C., Dossou-Gbete, S. and Demetrio, C. G. B. (2004). Some dis-crete exponential dispersion models: Poisson-Tweedie and Hinde-Demetrio classes,Statistics and Operations Research Transactions 28(2): 201–214.

Kolossiatis, M., Griffin, J. and Steel, M. (2011). Modeling overdispersion with thenormalized tempered stable distribution, Computational Statistics & Data Analy-sis 55(7): 2288–2301.

Loeys, T., Moerkerke, B., De Smet, O. and Buysse, A. (2012). The analysis of zero-inflated count data: Beyond zero-inflated Poisson regression., British Journal ofMathematical and Statistical Psychology 65(1): 163–180.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models, Journalof the Royal Statistical Society. Series A 135(3): 370–384.

Neter, J., Kutner, M. H., Nachtsheim, C. J. and Wasserman, W. (1996). AppliedLinear Statistical Models, Irwin, Chicago.

Oliveira, M., Einbeck, J., Higueras, M., Ainsbury, E., Puig, P. and Rothkamm, K.(2016). Zero-inflated regression models for radiation-induced chromosome aberra-tion data: A comparative study, Biometrical Journal 58(2): 259–279.

Ridout, M. S., Demetrio, C. G. B. and Hinde, J. P. (1998). Models for count datawith many zeros, Proceedings of the XIXth International Biometrics Conference,Cape Town, South Africa.

Page 31: Extended Poisson-Tweedie: properties and regression models ... › pdf › 1608.06888.pdf · regression models for count data Wagner H. Bonat and Bent J˝rgensenyand C elestin C.

Rigby, R. A., Stasinopoulos, D. M. and Akantziliotou, C. (2008). A framework formodelling overdispersed count data, including the Poisson-shifted generalized in-verse Gaussian distribution, Computational Statistics & Data Analysis 53(2): 381–393.

Sellers, K. F. and Raim, A. (2016). A flexible zero-inflated model to address datadispersion, Computational Statistics & Data Analysis 99: 68–80.

Sellers, K. F. and Shmueli, G. (2010). A flexible regression model for count data,Ann. Appl. Stat. 4(2): 943–961.

Smyth, G. K. and Jørgensen, B. (2002). Fitting Tweedie’s compound Poisson modelto insurance claims data: Dispersion modelling, ASTIN Bulletin: The Journal ofthe International Actuarial Association 32(1): 143–157.

Zeileis, A., Kleiber, C. and Jackman, S. (2008). Regression models for count data inr, Journal of Statistical Software 27(1): 1–25.

Zeviani, W. M., Ribeiro Jr, P. J., Bonat, W. H., Shimakura, S. E. and Muniz, J. A.(2014). The gamma-count distribution in the analysis of experimental underdis-persed data, Journal of Applied Statistics 41(12): 2616–2626.

Zhu, R. and Joe, H. (2009). Modelling heavy-tailed count data using a generalisedPoisson-inverse gaussian family, Statistics & Probability Letters 79(15): 1695–1703.

Top Related