1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of...

253
1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    2

Transcript of 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of...

Page 1: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

1

Discrete and Categorical Data

William N. EvansDepartment of

Economics/MPRCUniversity of Maryland

Page 2: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

2

Part I

Introduction

Page 3: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

3

Introduction

• Workhorse statistical model in social sciences is the multivariate regression model

• Ordinary least squares (OLS)• yi = β0 + x1i β1+ x2i β2+… xki βk+ εi

• yi = xi β + εi

Page 4: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

4

Linear model yi = + xi + i

and are “population” values – represent the true relationship between x and y

• Unfortunately – these values are unknown• The job of the researcher is to estimate

these values• Notice that if we differentiate y with

respect to x, we obtain• dy/dx =

Page 5: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

5

represents how much y will change for a fixed change in x– Increase in income for more education– Change in crime or bankruptcy when

slots are legalized– Increase in test score if you study more

Page 6: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

6

Put some concretenesson the problem

• State of Maryland budget problems – Drop in revenues– Expensive k-12 school spending initiatives

• Short-term solution – raise tax on cigarettes by 34 cents/pack

• Problem – a tax hike will reduce consumption of taxable product

• Question for state – as taxes are raised, how much will cigarette consumption fall?

Page 7: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

7

• Simple model: yi = + xi + i

• Suppose y is a state’s per capita consumption of cigarettes

• x represents taxes on cigarettes• Question – how much will y fall if x is

increased by 34 cents/pack?• Problem – many reasons why people

smoke – cost is but one of them –

Page 8: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

8

• Data– (Y) State per capita cigarette consumption for

the years 1980-1997– (X) tax (State + Federal) in real cents per pack– “Scatter plot” of the data– Negative covariance between variables

• When x>, more likely that y<• When x<, more likely that y>

• Goal: pick values of and that “best fit” the data– Define best fit in a moment

Page 9: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

9

Notation

• True model• yi = + xi + i

• We observe data points (yi,xi)• The parameters and are unknown• The actual error (i) is unknown

• Estimated model• (a,b) are estimates for the parameters (,)• ei is an estimate of i where• ei=yi-a-bxi

• How do you estimate a and b?

Page 10: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

10

Objective: Minimize sum of squared errors

• Min iei2 = i(yi – a – bxi)2

• Minimize sum of squared errors (SSE)• Treat (+) and (-) errors equally

– Over or under predict by “5” is the same magnitude of error

– “Quadratic form”– The optimal value for a and b are those

that make the 1st derivative equal zero– Functions reach min or max values when

derivatives are zero

Page 11: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

11

Cigarette Consumption and Taxes

0

50

100

150

200

250

300

0 20 40 60 80 100 120

Tax per pack (cents)

Per

cap

ita

pac

ks/y

ear

Page 12: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

12

Cigarette Consumption and Taxes

0

50

100

150

200

250

300

0 20 40 60 80 100 120

Tax per pack (cents)

Per

cap

ita

pac

ks/y

ear

Page 13: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

13

• The model has a lot of nice features– Statistical properties easy to establish– Optimal estimates easy to obtain– Parameter estimates are easy to

interpret– Model maximizes prediction

• If you minimize SSE you maximize R2

• The model does well as a first order approximation to lots of problems

Page 14: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

14

Discrete and Qualitative Data

• The OLS model work well when y is a continuous variable– Income, wages, test scores, weight, GDP

• Does not has as many nice properties when y is not continuous

• Example: doctor visits• Integer values• Low counts for most people• Mass of observations at zero

Page 15: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

15

Downside of forcing non-standard outcomes into OLS

world?• Can predict outside the allowable

range– e.g., negative MD visits

• Does not describe the data generating process well– e.g., mass of observations at zero

• Violates many properties of OLS– e.g. heteroskedasticity

Page 16: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

16

This talk

• Look at situations when the data generating process does not lend itself well to OLS models

• Mathematically describe the data generating process

• Show how we use different optimization procedure to obtain estimates

• Describe the statistical properties

Page 17: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

17

• Show how to interpret parameters• Illustrate how to estimate the models

with popular program STATA

Page 18: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

18

Types of data generating processes we will consider

• Dichotomous events (yes or no)– 1=yes, 0=no– Graduate high school? work? Are obese?

Smoke?

• Ordinal data– Self reported health (fair, poor, good,

excel)– Strongly disagree, disagree, agree,

strongly agree

Page 19: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

19

• Count data– Doctor visits, lost workdays, fatality

counts

• Duration data– Time to failure, time to death, time to

re-employment

Page 20: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

20

Recommended Textbooks

• Jeffrey Wooldridge, “Econometric analysis of cross sectional and panel data” – Lots of insight and mathematical/statistical

detail– Very good examples

• William Greene, “Econometric Analysis” – more topics– Somewhat dated examples

Page 21: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

21

Course web page

• www.bsos.umd.edu/econ/evans/jpsm.html

• Contains– These notes– All STATA programs and data sets– A couple of “Introduction to STATA” handouts– Links to some useful web sites

Page 22: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

22

STATA Resources Discrete Outcomes

• “Regression Models for Categorical Dependent Variables Using STATA”– J. Scott Long and Jeremy Freese

• Available for sale from STATA website for $52 (www.stata.com)

• Post-estimation subroutines that translate results– Do not need to buy the book to use the

subroutines

Page 23: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

23

• In STATA command line type• net search spost

• Will give you a list of available programs to download

• One is Spostado from http://ww

w.indiana.edu/~jslsoc/stata

• Click on the link and install the files

Page 24: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

24

Part II

A brief introduction to STATA

Page 25: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

25

STATA

• Very fast, convenient, well-documented, cheap and flexible statistical package

• Excellent for cross-section/panel data projects, not as great for time series but getting better

• Not as easy to manipulate large data sets from flat files as SAS

• I usually clean data in SAS, estimate models in STATA

Page 26: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

26

• Key characteristic of STATA– All data must be loaded into RAM– Computations are very fast– But, size of the project is limited by available

memory

• Results can be generated two different ways– Command line– Write a program, (*.do) then submit from the

command line

Page 27: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

27

Sample program to get you started

• cps87_or.do• Program gets you to the point where

can• Load data into memory• Construct new variables • Get simple statistics• Run a basic regression• Store the results on a disk

Page 28: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

28

Data (cps87_do.dta)

• Random sample of data from 1987 Current Population Survey outgoing rotation group

• Sample selection– Males– 21-64– Working 30+hours/week

• 19,906 observations

Page 29: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

29

Major caveat

• Hardest thing to learn/do: get data from some other source and get it into STATA data set

• We skip over that part• All the data sets are loaded into a

STATA data file that can be called by saying:

use data file name

Page 30: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

30

Housekeeping at the top of the program

• * this line defines the semicolon as the ;• * end of line delimiter;• # delimit ;

• * set memork for 10 meg;• set memory 10m;

• * write results to a log file;• * the replace options writes over old;• * log files;• log using cps87_or.log,replace;

• * open stata data set;• use c:\bill\stata\cps87_or;

• * list variables and labels in data set;• desc;

Page 31: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

31

• ------------------------------------------------------------------------------• > -• storage display value• variable name type format label variable label• ------------------------------------------------------------------------------• > -• age float %9.0g age in years• race float %9.0g 1=white, non-hisp, 2=place,• n.h, 3=hisp• educ float %9.0g years of education• unionm float %9.0g 1=union member, 2=otherwise• smsa float %9.0g 1=live in 19 largest smsa,• 2=other smsa, 3=non smsa• region float %9.0g 1=east, 2=midwest, 3=south,• 4=west• earnwke float %9.0g usual weekly earnings• ------------------------------------------------------------------------------

Page 32: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

32

Constructing new variables

• Use ‘gen’ command for generate new variables

• Syntax– gen new variable name=math statement

• Easily construct new variables via– Algebraic operations– Math/trig functions (ln, exp, etc.)– Logical operators (when true, =1, when false,

=0)

Page 33: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

33

From program• * generate new variables;• * lines 1-2 illustrate basic math functoins;• * lines 3-4 line illustrate logical operators;• * line 5 illustrate the OR statement;• * line 6 illustrates the AND statement;• * after you construct new variables, compress the

data again;• gen age2=age*age;• gen earnwkl=ln(earnwke);• gen union=unionm==1;• gen topcode=earnwke==999;• gen nonwhite=((race==2)|(race==3));• gen big_ne=((region==1)&(smsa==1));

Page 34: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

34

Getting basic statistics

• desc -- describes variables in the data set

• sum – gets summary statistics• tab – produces frequencies (tables)

of discrete variables

Page 35: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

35

From program• * get descriptive statistics;• sum;

• * get detailed descriptics for continuous variables;

• sum earnwke, detail;

• * get frequencies of discrete variables;• tabulate unionm;• tabulate race;

• * get two-way table of frequencies;• tabulate region smsa, row column cell;

Page 36: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

36

Results from sum• Variable | Obs Mean Std. Dev. Min Max• -------------+--------------------------------------------------------• age | 19906 37.96619 11.15348 21 64• race | 19906 1.199136 .525493 1 3• educ | 19906 13.16126 2.795234 0 18• unionm | 19906 1.769065 .4214418 1 2• smsa | 19906 1.908369 .7955814 1 3• -------------+--------------------------------------------------------

Page 37: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

37

Detailed summary• usual weekly earnings• -------------------------------------------------------------• Percentiles Smallest• 1% 128 60• 5% 178 60• 10% 210 60 Obs 19906• 25% 300 63 Sum of Wgt. 19906

• 50% 449 Mean 488.264• Largest Std. Dev. 236.4713• 75% 615 999• 90% 865 999 Variance 55918.7• 95% 999 999 Skewness .668646• 99% 999 999 Kurtosis 2.632356

Page 38: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

38

Results for tab

• 1=union |• member, |• 2=otherwise | Freq. Percent Cum.• ------------+-----------------------------------• 1 | 4,597 23.09 23.09• 2 | 15,309 76.91 100.00• ------------+-----------------------------------• Total | 19,906 100.00

Page 39: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

39

2x2 Table• 1=east, |• 2=midwest, | 1=live in 19 largest smsa,• 3=south, | 2=other smsa, 3=non smsa• 4=west | 1 2 3 | Total• -----------+---------------------------------+----------• 1 | 2,806 1,349 842 | 4,997 • | 56.15 27.00 16.85 | 100.00 • | 38.46 18.89 15.39 | 25.10 • | 14.10 6.78 4.23 | 25.10 • -----------+---------------------------------+----------• 2 | 1,501 1,742 1,592 | 4,835 • | 31.04 36.03 32.93 | 100.00 • | 20.58 24.40 29.10 | 24.29 • | 7.54 8.75 8.00 | 24.29 • -----------+---------------------------------+----------• 3 | 1,501 2,542 1,904 | 5,947 • | 25.24 42.74 32.02 | 100.00 • | 20.58 35.60 34.80 | 29.88 • | 7.54 12.77 9.56 | 29.88 • -----------+---------------------------------+----------• 4 | 1,487 1,507 1,133 | 4,127 • | 36.03 36.52 27.45 | 100.00 • | 20.38 21.11 20.71 | 20.73 • | 7.47 7.57 5.69 | 20.73 • -----------+---------------------------------+----------• Total | 7,295 7,140 5,471 | 19,906 • | 36.65 35.87 27.48 | 100.00 • | 100.00 100.00 100.00 | 100.00 • | 36.65 35.87 27.48 | 100.00

Page 40: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

40

Running a regression

• Syntaxreg dependent-variable independent-variables

• Example from program *run simple regression;

reg earnwkl age age2 educ nonwhite union;

Page 41: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

41

• Source | SS df MS Number of obs = 19906• -------------+------------------------------ F( 5, 19900) = 1775.70• Model | 1616.39963 5 323.279927 Prob > F = 0.0000• Residual | 3622.93905 19900 .182057239 R-squared = 0.3085• -------------+------------------------------ Adj R-squared = 0.3083• Total | 5239.33869 19905 .263217216 Root MSE = .42668

• ------------------------------------------------------------------------------• earnwkl | Coef. Std. Err. t P>|t| [95% Conf. Interval]• -------------+----------------------------------------------------------------• age | .0679808 .0020033 33.93 0.000 .0640542 .0719075• age2 | -.0006778 .0000245 -27.69 0.000 -.0007258 -.0006299• educ | .069219 .0011256 61.50 0.000 .0670127 .0714252• nonwhite | -.1716133 .0089118 -19.26 0.000 -.1890812 -.1541453• union | .1301547 .0072923 17.85 0.000 .1158613 .1444481• _cons | 3.630805 .0394126 92.12 0.000 3.553553 3.708057• ------------------------------------------------------------------------------

Page 42: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

42

Analysis of variance

• R2 = .3085– Variables explain 31% of the variation in

log weekly earnings

• F(5,19900)– Tests the hypothesis that all covariates

(except constant) are jointly zero

Page 43: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

43

Interpret results

• Y = β0 + β1Xi + εi

• dY/dX = β1

• But in this case Y=ln(W) where W weekly wages

• dln(W)/dX = (dW/W)/dX = β1

– Percentage change in wages given a change in x

Page 44: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

44

• For each additional year of education, wages increase by 6.9%

• Non whites earn 17.2% less than whites

• Union members earn 13% more than nonunion members

Page 45: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

45

Part III

Some notes about probability distributions

Page 46: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

46

Continuous Distributions

• Random variables with infinite number of possible values

• Examples -- units of measure (time, weight, distance)

• Many discrete outcomes can be treated as continuous, e.g., SAT scores

Page 47: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

47

How to describe a continuous random variable• The Probability Density Function (PDF)

• The PDF for a random variable x is defined as f(x), where

f(x) 0f(x)dx = 1

• Calculus review: The integral of a function gives the “area under the curve”

Page 48: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

48

0

1

2

3

4

5

y

ax

Graph of y=f(x)

y=f(x)

Page 49: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

49

Cumulative Distribution Function (CDF)

• Suppose x is a “measure” like distance or time

• 0 x

• We may be interested in the Pr(xa) ?

Page 50: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

50

CDF

What if we consider all values?

)Pr()()(0

axdxxfaFa

0

1)()Pr( dxxfx

Page 51: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

51

Properties of CDF

• Note that Pr(x b) + Pr(x>b) =1

• Pr(x>b) = 1 – Pr(x b)

• Many times, it is easier to work with compliments

Page 52: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

52

General notation for continuous distributions

• The PDF is described by lower case such as f(x)

• The CDF is defined as upper case such as F(a)

Page 53: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

53

Standard Normal Distribution

• Most frequently used continuous distribution

• Symmetric “bell-shaped” distribution• As we will show, the normal has

useful properties• Many variables we observe in the

real world look normally distributed.• Can translate normal into ‘standard

normal’

Page 54: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

54

Examples of variables that look normally distributed

• IQ scores• SAT scores• Heights of females• Log income• Average gestation (weeks of pregnancy)• As we will show in a few weeks – sample

means are normally distributed!!!

Page 55: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

55

Standard Normal Distribution

• PDF:

• For - z

f z e zz

( ) ( ) 1

2

1

22

Page 56: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

56

Notation

(z) is the standard normal PDF evaluated at z

[a] = Pr(z a)

P r( ) ( ) ( )z a z dz aa

Page 57: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

57

Standard Normal PDF

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

-3 -2 -1 0 1 2 3

Z

f(Z

)

Page 58: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

58

Standard Normal

• Notice that:– Normal is symmetric: (a) = (-a)– Normal is “unimodal”– Median=mean– Area under curve=1– Almost all area is between (-3,3)

• Evaluations of the CDF are done with– Statistical functions (excel, SAS, etc)– Tables

Page 59: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

59

Standard Normal CDF

• Pr(z -0.98) = [-0.98] = 0.1635

Area Under Standard Normal PDF

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

-3 -2 -1 0 1 2 3

Z

f(Z

)

Page 60: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

60

0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Z0.0559 0.0571 0.0582 0.0594 0.0606 0.0618 0.0630 0.0643 0.0655 0.0668 -1.500.0681 0.0694 0.0708 0.0721 0.0735 0.0749 0.0764 0.0778 0.0793 0.0808 -1.400.0823 0.0838 0.0853 0.0869 0.0885 0.0901 0.0918 0.0934 0.0951 0.0968 -1.300.0985 0.1003 0.1020 0.1038 0.1056 0.1075 0.1093 0.1112 0.1131 0.1151 -1.200.1170 0.1190 0.1210 0.1230 0.1251 0.1271 0.1292 0.1314 0.1335 0.1357 -1.100.1379 0.1401 0.1423 0.1446 0.1469 0.1492 0.1515 0.1539 0.1562 0.1587 -1.000.1611 0.1635 0.1660 0.1685 0.1711 0.1736 0.1762 0.1788 0.1814 0.1841 -0.900.1867 0.1894 0.1922 0.1949 0.1977 0.2005 0.2033 0.2061 0.2090 0.2119 -0.800.2148 0.2177 0.2206 0.2236 0.2266 0.2296 0.2327 0.2358 0.2389 0.2420 -0.700.2451 0.2483 0.2514 0.2546 0.2578 0.2611 0.2643 0.2676 0.2709 0.2743 -0.600.2776 0.2810 0.2843 0.2877 0.2912 0.2946 0.2981 0.3015 0.3050 0.3085 -0.50

Page 61: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

61

• Pr(z 1.41) = [1.41] = 0.9207

Area Under Standard Normal PDF

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

-3 -2 -1 0 1 2 3

Z

f(Z

)

Page 62: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

62

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090.50 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.72240.60 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.75490.70 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.78520.80 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.81330.90 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.83891.00 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.86211.10 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.88301.20 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.90151.30 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.91771.40 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.93191.50 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

Page 63: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

63

• Pr(x>1.17) = 1 – Pr(z 1.17) = 1- [1.17]

• = 1 – 0.8790 = 0.1210Area Under Standard Normal PDF

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

-3 -2 -1 0 1 2 3

Z

f(Z

)

Page 64: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

64

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090.50 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.72240.60 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.75490.70 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.78520.80 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.81330.90 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.83891.00 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.86211.10 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.88301.20 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.90151.30 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.91771.40 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.93191.50 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

Page 65: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

65

• Pr(0.1 z 1.9)

= Pr(z 1.9) – Pr(z 0.1)

= (1.9) - (0.1) = 0.9713 - 0.5398

= 0.4315

Page 66: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

66

Area Under Standard Normal PDF

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

-3 -2 -1 0 1 2 3

Z

f(Z

)

Page 67: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

67

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090.00 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.53590.10 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.57530.20 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.61410.30 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.65170.40 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.68790.50 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.72240.60 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.75490.70 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.78520.80 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.81330.90 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.83891.00 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

Page 68: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

68

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.091.00 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.86211.10 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.88301.20 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.90151.30 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.91771.40 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.93191.50 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.94411.60 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.95451.70 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.96331.80 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.97061.90 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.97672.00 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817

Page 69: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

69

Important Properties of Normal Distribution

• Pr(z A) = [A]• Pr(z > A) = 1 - [A]

• Pr(z - A) = [-A]• Pr(z > -A) = 1 - [-A] = [A]

Page 70: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

70

Section IV

Maximum likelihood estimation

Page 71: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

71

Maximum likelihood estimation

• Observe n independent outcomes, all drawn from the same distribution

• (y1, y2, y3….yn)

• yi is drawn from f(yi; θ) where θ is an unknown parameter for the PDF f

• Recall definition of indepedence. If a and b and independent, Prob(a and b) = Pr(a)Pr(B)

Page 72: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

72

• Because all the draws are independent, the probability these particular n values of Y would be drawn at random is called the ‘likelihood function’ and it equals

• L = Pr(y1)Pr(y2)…Pr(yn)

• L = f(y1; θ)f(y2; θ)…..f(y3; θ)

Page 73: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

73

• MLE: pick a value for θ that best represents the chance these n values of y would have been generated randomly

• To maximize L, maximize a monotonic function of L

• Recall ln(abcd)=ln(a)+ln(b)+ln(c)+ln(d)

Page 74: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

74

• Max L = ln(L) = ln[f(y1; θ)] +ln[f(y2; θ)] +

….. ln[f(yn; θ) = Σi ln[f(yi; θ)]

• Pick θ so that L is maximized

• dL/dθ = 0

Page 75: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

75

L

θθ1 θ2

Page 76: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

76

Example: Poisson

• Suppose y measures ‘counts’ such as doctor visits.

• yi is drawn from a Poisson distribution

• f(yi;λ) =e-λ λyi/yi! For λ>0

• E[yi]= Var[yi] = λ

Page 77: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

77

• Given n observations, (y1, y2, y3….yn)

• Pick value of λ that maximizes L

• Max L = Σi ln[f(yi; θ)] = Σi ln[e-λ λyi/yi!]

= Σi [– λ + yiln(λ) – ln(yi!)] = -n λ + ln(λ) Σi yi – Σi ln(yi!)

Page 78: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

78

• L = -n λ + ln(λ) Σi yi – Σi ln(yi!)

• dL/dθ = -n + (1/ λ )Σi yi = 0

• Solve for λ

• λ = Σi yi /n = = sample mean of y

Page 79: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

79

• In most cases however, cannot find a ‘closed form’ solution for the parameter in ln[f(yi; θ)]

• Must ‘search’ over all possible solutions

• How does the search work?• Start with candidate value of θ.• Calculate dL/dθ

Page 80: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

80

• If dL/dθ > 0, increasing θ will increase L so we increase θ some

• If dL/dθ < 0, decreasing θ will increase L so we decrease θ some

• Keep changing θ until dL/dθ = 0• How far you ‘step’ when you change

θ is determined by a number of different factors

Page 81: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

81

L

θθ1

dL/dθ > 0

Page 82: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

82

L

θθ3

dL/dθ < 0

Page 83: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

83

Properties of MLE estimates

• Sometimes call efficient estimation. Can never generate a smaller variance than one obtained by MLE

• Parameters estimates are distributed as a normal distribution when samples sizes are large

• Therefore, if we divide the parameter by its standard error, should be normally distributed with a mean zero and variance 1 if the null (=0) is correct

Page 84: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

84

Section 5

Dichotomous outcomes

Page 85: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

85

Dichotomous Data

• Suppose data is discrete but there are only 2 outcomes

• Examples– Graduate high school or not– Patient dies or not– Working or not– Smoker or not

• In data, yi=1 if yes, yi =0 if no

Page 86: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

86

How to model the data generating process?

• There are only two outcomes• Research question: What factors

impact whether the event occurs?• To answer, will model the probability

the outcome occurs• Pr(Yi=1) when yi=1 or

• Pr(Yi=0) = 1- Pr(Yi=1) when yi=0

Page 87: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

87

• Think of the problem from a MLE perspective

• Likelihood for i’th observation

• Li= Pr(Yi=1)Yi [1 - Pr(Yi=1)](1-Yi)

• When yi=1, only relevant part is Pr(Yi=1)

• When yi=0, only relevant part is [1 - Pr(Yi=1)]

Page 88: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

88

• L = Σi ln[Li] =

= Σi {yi ln[Pr(yi=1)] + (1-yi)ln[Pr(yi=0)] }

• Notice that up to this point, the model is generic. The log likelihood function will determined by the assumptions concerning how we determine Pr(yi=1)

Page 89: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

89

Modeling the probability

• There is some process (biological, social, decision theoretic, etc) that determines the outcome y

• Some of the variables impacting are observed, some are not

• Requires that we model how these factors impact the probabilities

• Model from a ‘latent variable’ perspective

Page 90: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

90

• Consider a women’s decision to work• yi* = the person’s net benefit to work• Two components of yi*

– Characteristics that we can measure• Education, age, income of spouse, prices of

child care

– Some we cannot measure• How much you like spending time with your

kids• how much you like/hate your job

Page 91: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

91

• We aggregate these two components into one equation• yi* = β0 + x1i β1+ x2i β2+… xki βk+ εi

= xi β + εi

• xi β (measurable characteristics but with uncertain weights)• εi random unmeasured characteristics

• Decision rule: person will work if yi* > 0 (if net benefits are positive)

yi=1 if yi*>0

yi=0 if yi*≤0

Page 92: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

92

• yi=1 if yi*>0• yi* = xi β + εi > 0 only if

• εi > - xi β

• yi=0 if yi*≤0

• yi* = xi β + εi ≤ 0 only if

• εi ≤ - xi β

Page 93: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

93

How to interpret ε?

• When we look at certain people, we have expectations about whether y should equal 1 or 0

• These expectations do not always hold true

• The error ε represents deviations from what we expect

• Go back to the work example, suppose xi β is ‘big.’ We observe a woman with:– High wages– Low husband’s income– Low cost of child care

Page 94: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

94

• We would expect this person to work, UNLESS, there is some unmeasured variable that counteracts this

• For example: – Suppose a mom really likes spending

time with her kids, or she hates her job.– The unmeasured benefit of working is

then a big negative coefficient εi

Page 95: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

95

• If we observe them working, there are a certain range of values that εi must have been in excess of

• yi=1 if εi > - xi β

• If we observe someone not working, then Consider the opposite. Suppose we observe someone NOT working.

• Then εi must not have been big or it was a bigger negative number, since

• yi=0 if εi ≤ - xi β

Page 96: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

96

The Probabilities

• The estimation procedure used is determined by the assumed distribution of ε

• What is the probability we observe someone with y=1?– Use definition of the CDF

– Pr(yi=1) = Pr(yi*>0) = Pr(εi > - xi β)

= 1 – F(-xi β)

Page 97: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

97

• What is the probability we observe someone with y=0?– Use definition of the CDF– Pr(yi=0) = Pr(yi

*≤ 0) = Pr(εi ≤ - xi β) = F(-xi β)

• Two standard models: ε is either – normal or– logistic

Page 98: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

98

Normal (probit) Model

• ε is distributed as a standard normal – Mean zero– Variance 1

• Evaluate probability (y=1)– Pr(yi=1) = Pr(εi > - xi β) = 1 – Ф(-xi β)– Given symmetry: 1 – Ф(-xi β) = Ф(xi β)

• Evaluate probability (y=0)– Pr(yi=0) = Pr(εi ≤ - xi β) = Ф(-xi β)– Given symmetry: Ф(-xi β) = 1 - Ф(xi β)

Page 99: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

99

• Summary– Pr(yi=1) = Ф(xi β)

– Pr(yi=0) = 1 -Ф(xi β)

• Notice that Ф(a) is increasing a. Therefore, is one of the x’s increases the probability of observing y, we would expect the coefficient on that variable to be (+)

Page 100: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

100

• The standard normal assumption (variance=1) is not critical

• In practice, the variance may be not equal t 1, but given the math of the problem, we cannot identify the variance. It is absorbed into parameter estimates

Page 101: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

101

Logit

• CDF: F(a) = exp(a)/(1+exp(a))– Symmetric, unimodal distribution– Looks a lot like the normal– Incredibly easy to evaluate the CDF and PDF– Mean of zero, variance > 1 (more variance

than normal)

• Evaluate probability (y=1)– Pr(yi=1) = Pr(εi > - xi β) = 1 – F(-xi β)

– Given symmetry: 1 – F(-xi β) = F(xi β)

– F(xi β) = exp(xi β)/(1+exp(xi β))

Page 102: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

102

• Evaluate probability (y=0)– Pr(yi=0) = Pr(εi ≤ - xi β) = F(-xi β)

– Given symmetry: F(-xi β) = 1 - F(xi β)

– 1 - F(xi β) = 1 /(1+exp(xi β))

• When εi is a logistic distribution

– Pr(yi =1) = exp(xi β)/(1+exp(xi β))

– Pr(yi=0) = 1/(1+exp(xi β))

Page 103: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

103

Example: Workplace smoking bans

• Smoking supplements to 1991 and 1993 National Health Interview Survey

• Asked all respondents whether they currently smoke

• Asked workers about workplace tobacco policies

• Sample: workers• Key variables: current smoking and

whether they faced by workplace ban

Page 104: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

104

• Data: workplace1.dta• Sample program: workplace1.doc• Results: workplace1.log

Page 105: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

105

Description of variables in data• . desc;

• storage display value• variable name type format label variable label• ------------------------------------------------------------------------• > -• smoker byte %9.0g is current smoking• worka byte %9.0g has workplace smoking bans• age byte %9.0g age in years• male byte %9.0g male• black byte %9.0g black• hispanic byte %9.0g hispanic• incomel float %9.0g log income• hsgrad byte %9.0g is hs graduate• somecol byte %9.0g has some college• college float %9.0g • -----------------------------------------------------------------------

Page 106: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

106

Summary statistics• sum;

• Variable | Obs Mean Std. Dev. Min Max• -------------+--------------------------------------------------------• smoker | 16258 .25163 .433963 0 1• worka | 16258 .6851396 .4644745 0 1• age | 16258 38.54742 11.96189 18 87• male | 16258 .3947595 .488814 0 1• black | 16258 .1119449 .3153083 0 1• -------------+--------------------------------------------------------• hispanic | 16258 .0607086 .2388023 0 1• incomel | 16258 10.42097 .7624525 6.214608 11.22524• hsgrad | 16258 .3355271 .4721889 0 1• somecol | 16258 .2685447 .4432161 0 1• college | 16258 .3293763 .4700012 0 1

Page 107: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

107

Running a probit

• probit smoker age incomel male black hispanic hsgrad somecol college worka;

• The first variable after ‘probit’ is the discrete outcome, the rest of the variables are the independent variables

• Includes a constant as a default

Page 108: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

108

Running a logit

• logit smoker age incomel male black hispanic hsgrad somecol college worka;

• Same as probit, just change the first word

Page 109: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

109

Running linear probability

• reg smoker age incomel male black hispanic hsgrad somecol college worka, robust;

• Simple regression. • Standard errors are incorrect

(heteroskedasticity)• robust option produces standard

errors with arbitrary form of heteroskedasticity

Page 110: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

110

Probit Results• Probit estimates Number of obs = 16258• LR chi2(9) = 819.44• Prob > chi2 = 0.0000• Log likelihood = -8761.7208 Pseudo R2 = 0.0447

• ------------------------------------------------------------------------------• smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval]• -------------+----------------------------------------------------------------• age | -.0012684 .0009316 -1.36 0.173 -.0030943 .0005574• incomel | -.092812 .0151496 -6.13 0.000 -.1225047 -.0631193• male | .0533213 .0229297 2.33 0.020 .0083799 .0982627• black | -.1060518 .034918 -3.04 0.002 -.17449 -.0376137• hispanic | -.2281468 .0475128 -4.80 0.000 -.3212701 -.1350235• hsgrad | -.1748765 .0436392 -4.01 0.000 -.2604078 -.0893453• somecol | -.363869 .0451757 -8.05 0.000 -.4524118 -.2753262• college | -.7689528 .0466418 -16.49 0.000 -.860369 -.6775366• worka | -.2093287 .0231425 -9.05 0.000 -.2546873 -.1639702• _cons | .870543 .154056 5.65 0.000 .5685989 1.172487• ------------------------------------------------------------------------------

Page 111: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

111

How to measure fit?

• Regression (OLS) – minimize sum of squared errors– Or, maximize R2

– The model is designed to maximize predictive capacity

• Not the case with Probit/Logit– MLE models pick distribution parameters so as

best describe the data generating process– May or may not ‘predict’ the outcome well

Page 112: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

112

Pseudo R2

• LLk log likelihood with all variables• LL1 log likelihood with only a constant• 0 > LLk > LL1 so | LLk | < |LL1|

• Pseudo R2 = 1 - |LL1/LLk| • Bounded between 0-1• Not anything like an R2 from a

regression

Page 113: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

113

Predicting Y

• Let b be the estimated value of β

• For any candidate vector of xi , we can predict probabilities, Pi

• Pi = Ф(xib)

• Once you have Pi, pick a threshold value, T, so that you predict

• Yp = 1 if Pi > T

• Yp = 0 if Pi ≤ T

• Then compare, fraction correctly predicted

Page 114: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

114

• Question: what value to pick for T?• Can pick .5

– Intuitive. More likely to engage in the activity than to not engage in it

– However, when the is small, this criteria does a poor job of predicting Yi=1

– However, when the is close to 1, this criteria does a poor job of picking Yi=0

Page 115: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

115

• *predict probability of smoking;• predict pred_prob_smoke;• * get detailed descriptive data about predicted

prob;• sum pred_prob, detail;

• * predict binary outcome with 50% cutoff;• gen pred_smoke1=pred_prob_smoke>=.5;• label variable pred_smoke1 "predicted smoking, 50%

cutoff";

• * compare actual values;• tab smoker pred_smoke1, row col cell;

Page 116: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

116

• . sum pred_prob, detail;

• Pr(smoker)• -------------------------------------------------------------• Percentiles Smallest• 1% .0959301 .0615221• 5% .1155022 .0622963• 10% .1237434 .0633929 Obs 16258• 25% .1620851 .0733495 Sum of Wgt. 16258

• 50% .2569962 Mean .2516653• Largest Std. Dev. .0960007• 75% .3187975 .5619798• 90% .3795704 .5655878 Variance .0092161• 95% .4039573 .5684112 Skewness .1520254• 99% .4672697 .6203823 Kurtosis 2.149247

Page 117: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

117

• Notice two things– Sample mean of the predicted

probabilities is close to the sample mean outcome

– 99% of the probabilities are less than .5– Should predict few smokers if use a 50%

cutoff

Page 118: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

118

• | predicted smoking,• is current | 50% cutoff• smoking | 0 1 | Total• -----------+----------------------+----------• 0 | 12,153 14 | 12,167 • | 99.88 0.12 | 100.00 • | 74.93 35.90 | 74.84 • | 74.75 0.09 | 74.84 • -----------+----------------------+----------• 1 | 4,066 25 | 4,091 • | 99.39 0.61 | 100.00 • | 25.07 64.10 | 25.16 • | 25.01 0.15 | 25.16 • -----------+----------------------+----------• Total | 16,219 39 | 16,258 • | 99.76 0.24 | 100.00 • | 100.00 100.00 | 100.00 • | 99.76 0.24 | 100.00

Page 119: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

119

• Check on-diagonal elements. • The last number in each 2x2 element

is the fraction in the cell • The model correctly predicts 74.75 +

0.15 = 74.90% of the obs• It only predicts a small fraction of

smokers

Page 120: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

120

• Do not be amazed by the 75% percent correct prediction

• If you said everyone has a chance of smoking (a case of no covariates), you would be correct Max[(,(1-)] percent of the time

Page 121: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

121

• In this case, 25.16% smoke. • If everyone had the same chance of

smoking, we would assign everyone Pr(y=1) = .2516

• We would be correct for the 1 - .2516 = 0.7484 people who do not smoke

Page 122: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

122

Key points about prediction

• MLE models are not designed to maximize prediction

• Should not be surprised they do not predict well

• In this case, not particularly good measures of predictive capacity

Page 123: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

123

Translating coefficients in probit:

Continuous Covariates• Pr(yi=1) = Φ[β0 + x1i β1+ x2i β2+… xki βk]

• Suppose that x1i is a continuous variable

• d Pr(yi=1) /d x1i = ?

• What is the change in the probability of an event give a change in x1i?

Page 124: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

124

Marginal Effect

• d Pr(yi=1) /d x1i

• = β1 φ[β0 + x1i β1+ x2i β2+… xki βk]

• Notice two things. Marginal effect is a function of the other parameters and the values of x.

Page 125: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

125

Translating Coefficients:Discrete Covariates

• Pr(yi=1) = Φ[β0 + x1i β1+ x2i β2+… xki βk]

• Suppose that x2i is a dummy variable (1 if yes, 0 if no)

• Marginal effect makes no sense, cannot change x2i by a little amount. It is either 1 or 0.

• Redefine the variable of interest. Compare outcomes with and without x2i

Page 126: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

126

• y1 = Pr(yi=1 | x2i=1)

= Φ[β0 + x1iβ1+ β2 + x3iβ3 +… ]

• y0 = Pr(yi=1 | x2i=0)

= Φ[β0 + x1iβ1+ x3iβ3 … ]

Marginal effect = y1 – y0.

Difference in probabilities with and without x2i?

Page 127: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

127

In STATA

• Marginal effects for continuous variables, and Change in probabilities for dichotomous outcomes, STATA picks sample means for X’s

Page 128: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

128

STATA command for Marginal Effects

• mfx compute;

• Must come after the outcome when estimates are still active in program.

Page 129: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

129

• Marginal effects after probit• y = Pr(smoker) (predict)• = .24093439• ------------------------------------------------------------------------------• variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X• ---------+--------------------------------------------------------------------• age | -.0003951 .00029 -1.36 0.173 -.000964 .000174 38.5474• incomel | -.0289139 .00472 -6.13 0.000 -.03816 -.019668 10.421• male*| .0166757 .0072 2.32 0.021 .002568 .030783 .39476• black*| -.0320621 .01023 -3.13 0.002 -.052111 -.012013 .111945• hispanic*| -.0658551 .01259 -5.23 0.000 -.090536 -.041174 .060709• hsgrad*| -.053335 .01302 -4.10 0.000 -.07885 -.02782 .335527• somecol*| -.1062358 .01228 -8.65 0.000 -.130308 -.082164 .268545• college*| -.2149199 .01146 -18.76 0.000 -.237378 -.192462 .329376• worka*| -.0668959 .00756 -8.84 0.000 -.08172 -.052072 .68514• ------------------------------------------------------------------------------• (*) dy/dx is for discrete change of dummy variable from 0 to 1

Page 130: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

130

Interpret results

• 10% increase in income will reduce smoking by 2.9 percentage points

• 10 year increase in age will decrease smoking rates .4 percentage points

• Those with a college degree are 21.5 percentage points less likely to smoke

• Those that face a workplace smoking ban have 6.7 percentage point lower probability of smoking

Page 131: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

131

• Do not confuse percentage point and percent differences– A 6.7 percentage point drop is 29% of

the sample mean of 24 percent.– Blacks have smoking rates that are 3.2

percentage points lower than others, which is 13 percent of the sample mean

Page 132: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

132

Comparing Marginal Effects

Variable LP Probit Logit

age -0.00040 -0.00048 -0.00048

incomel -0.0289 -0.0287 -0.0276

male 0.0167 0.0168 0.0172

Black -0.0321 -0.0357 -0.0342

hispanic -0.0658 -0.0706 -0.0602

hsgrad -0.0533 -0.0661 -0.0514

college -0.2149 -0.2406 -0.2121

worka -0.0669 -0.0661 -0.0658

Page 133: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

133

Marginal effects for specific characteristics

• Can generate marginal effects for a specific x

• prchange, x(age=40 black=0 hispanic=0 hsgrad=0 somecol=0 worka=0);

• If an x is not specified, STATA will use the sample mean (e.g., log income in this case)

• Make sure when you specify a particular dummy variable (=1) you set the rest to zero

Page 134: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

134

• probit: Changes in Predicted Probabilities for smoker

• min->max 0->1 -+1/2 -+sd/2 MargEfct• age -0.0323 -0.0005 -0.0005 -0.0056 -0.0005• incomel -0.1795 -0.0320 -0.0344 -0.0263 -0.0345• male 0.0198 0.0198 0.0198 0.0097 0.0198• black -0.0385 -0.0385 -0.0394 -0.0124 -0.0394• hispanic -0.0804 -0.0804 -0.0845 -0.0202 -0.0847• hsgrad -0.0625 -0.0625 -0.0648 -0.0306 -0.0649• somecol -0.1235 -0.1235 -0.1344 -0.0598 -0.1351• college -0.2644 -0.2644 -0.2795 -0.1335 -0.2854• worka -0.0742 -0.0742 -0.0776 -0.0361 -0.0777

Page 135: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

135

Testing significance of individual parameters

• In large samples, MLE estimates are normally distributed

• Null hypothesis, βj=0• If the null is true and the sample is

larges, βj is distributed as a normal with variance σj

2.• Using notes from before, if we divide

βj by the standard deviation, we get standard normal

Page 136: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

136

• βj/se(βj) should be N(0,1)

• βj/se(βj) = z-score

• 95% of the distribution of a N(0,1) is between -1.96, 1.96

• Reject null of the z-score > |1.96| • Only age is statistically insignificant

(cannot reject null)

Page 137: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

137

When will results differ?

• Normal and logit CDF look:– Similar in the mid point of the distribution– Different in the tails

• You obtain more observations in the tails of the distribution when – Samples sizes are large approaches 1 or 0

• These situations will produce more differences in estimates

Page 138: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

138

Some nice properties of the Logit

• Outcome, y=1 or 0• Treatment, x=1 or 0• Other covariates, x

• Context, – x = whether a baby is born with a low

weight birth– x = whether the mom smoked or not

during pregnancy

Page 139: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

139

• Risk ratio

RR = Prob(y=1|x=1)/Prob(y=1|x=0)

Differences in the probability of an event when x is and is not observed

How much does smoking elevate the chance your child will be a low weight birth

Page 140: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

140

• Let Yyx be the probability y=1 or 0 given x=1 or 0

• Think of the risk ratio the following way

• Y11 is the probability Y=1 when X=1• Y10 is the probability Y=1 when X=0

• Y11 = RR*Y10

Page 141: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

141

• Odds Ratio OR=A/B = [Y11/Y01]/[Y10/Y00]

A = [Pr(Y=1|X=1)/Pr(Y=0|X=1)] = odds of Y occurring if you are a smoker

B = [Pr(Y=1|X=0)/Pr(Y=0|X=0)] = odds of y happening if you are not a

smoker

What are the relative odds of Y happening if you do or do not experience X

Page 142: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

142

• Suppose Pr(Yi =1) = F(βo+ β1Xi + β2Z) and F is the logistic function

• Can show that

• OR = exp(β1) = e β1

• This number is typically reported by most statistical packages

Page 143: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

143

• Details• Y11 = exp(βo+ β1 + β2Z) /(1+ exp(βo+ β1+ β2Z) )

• Y10 = exp(βo+ β2Z)/(1+ exp(βo+β2Z))

• Y01 = 1 /(1+ exp(βo+ β1 + β2Z) )

• Y00 = 1/(1+ exp(βo+β2Z)

• [Y11/Y01] = exp(βo+ β1 + β2Z)

• [Y10/Y00] = exp(βo+ β2Z)

• OR=A/B = [Y11/Y01]/[Y10/Y00]

= exp(βo+ β1 + β2Z)/ exp(βo + β2Z)

= exp(β1)

Page 144: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

144

• Suppose Y is rare, close to 0– Pr(Y=0|X=1) and Pr(Y=0|X=0) are both

close to 1, so they cancel

• Therefore, when is close to 0– Odds Ratio = Risk Ratio

• Why is this nice?

Page 145: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

145

Population attributable risk• Average outcome in the population

= (1-) Y10 + Y11 = (1- )Y10 + (RR)Y10

• Average outcomes are a weighted average of outcomes for X=0 and X=1

• What would the average outcome be in the absence of X (e.g., reduce smoking rates to 0)

• Ya = Y10

Page 146: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

146

Population Attributable Risk

• PAR• Fraction of outcome attributed to X• The difference between the current

rate and the rate that would exist without X, divided by the current rate

• PAR = ( – Ya)/

= (RR – 1)/[(1-) + RR]

Page 147: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

147

Example: Maternal Smoking and Low Weight Births

• 6% births are low weight– < 2500 grams (– Average birth is 3300 grams (5.5 lbs)

• Maternal smoking during pregnancy has been identified as a key cofactor– 13% of mothers smoke – This number was falling about 1

percentage point per year during 1980s/90s

– Doubles chance of low weight birth

Page 148: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

148

Natality detail data

• Census of all births (4 million/year)• Annual files starting in the 60s• Information about

– Baby (birth weight, length, date, sex, plurality, birth injuries)

– Demographics (age, race, marital, educ of mom)

– Birth (who delivered, method of delivery)– Health of mom (smoke/drank during preg,

weight gain)

Page 149: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

149

• Smoking not available from CA or NY• ~3 million usable observations• I pulled .5% random sample from

1995• About 12,500 obs• Variables: birthweight (grams),

smoked, married, 4-level race, 5 level education, mothers age at birth

Page 150: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

150

• ------------------------------------------------------------------------------• > -• storage display value• variable name type format label variable label• ------------------------------------------------------------------------------• > -• birthw int %9.0g birth weight in grams• smoked byte %9.0g =1 if mom smoked during• pregnancy• age byte %9.0g moms age at birth• married byte %9.0g =1 if married• race4 byte %9.0g 1=white,2=black,3=asian,4=other• educ5 byte %9.0g 1=0-8, 2=9-11, 3=12, 4=13-15,• 5=16+• visits byte %9.0g prenatal visits• ------------------------------------------------------------------------------

Page 151: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

151

• dummy |• variable, |• =1 | =1 if mom smoked• ifBW<2500 | during pregnancy• grams | 0 1 | Total• -----------+----------------------+----------• 0 | 11,626 1,745 | 13,371 • | 86.95 13.05 | 100.00 • | 94.64 89.72 | 93.96 • | 81.70 12.26 | 93.96 • -----------+----------------------+----------• 1 | 659 200 | 859 • | 76.72 23.28 | 100.00 • | 5.36 10.28 | 6.04 • | 4.63 1.41 | 6.04 • -----------+----------------------+----------• Total | 12,285 1,945 | 14,230 • | 86.33 13.67 | 100.00 • | 100.00 100.00 | 100.00 • | 86.33 13.67 | 100.00

Page 152: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

152

• Notice a few things– 13.7% of women smoke– 6% have low weight birth

• Pr(LBW | Smoke) =10.28%• Pr(LBW |~ Smoke) = 5.36%• RR = Pr(LBW | Smoke)/ Pr(LBW |~ Smoke) = 0.1028/0.0536 = 1.92

Page 153: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

153

Logit results• Log likelihood = -3136.9912 Pseudo R2 = 0.0330

• ------------------------------------------------------------------------------• lowbw | Coef. Std. Err. z P>|z| [95% Conf. Interval]• -------------+----------------------------------------------------------------• smoked | .6740651 .0897869 7.51 0.000 .4980861 .8500441• age | .0080537 .006791 1.19 0.236 -.0052564 .0213638• married | -.3954044 .0882471 -4.48 0.000 -.5683654 -.2224433• _Ieduc5_2 | -.1949335 .1626502 -1.20 0.231 -.5137221 .1238551• _Ieduc5_3 | -.1925099 .1543239 -1.25 0.212 -.4949791 .1099594• _Ieduc5_4 | -.4057382 .1676759 -2.42 0.016 -.7343769 -.0770994• _Ieduc5_5 | -.3569715 .1780322 -2.01 0.045 -.7059081 -.0080349• _Irace4_2 | .7072894 .0875125 8.08 0.000 .5357681 .8788107• _Irace4_3 | .386623 .307062 1.26 0.208 -.2152075 .9884535• _Irace4_4 | .3095536 .2047899 1.51 0.131 -.0918271 .7109344• _cons | -2.755971 .2104916 -13.09 0.000 -3.168527 -2.343415• ------------------------------------------------------------------------------

Page 154: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

154

Odds Ratios

• Smoked– exp(0.674) = 1.96– Smokers are twice as likely to have a

low weight birth

• _Irace4_2 (Blacks)– exp(0.707) = 2.02– Blacks are twice as likely to have a low

weight birth

Page 155: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

155

Asking for odds ratios

• Logistic y x1 x2;

• In this case

• xi: logistic lowbw smoked age married i.educ5 i.race4;

Page 156: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

156

• Log likelihood = -3136.9912 Pseudo R2 = 0.0330

• ------------------------------------------------------------------------------• lowbw | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]• -------------+----------------------------------------------------------------• smoked | 1.962198 .1761796 7.51 0.000 1.645569 2.33975• age | 1.008086 .0068459 1.19 0.236 .9947574 1.021594• married | .6734077 .0594262 -4.48 0.000 .5664506 .8005604• _Ieduc5_2 | .8228894 .1338431 -1.20 0.231 .5982646 1.131852• _Ieduc5_3 | .8248862 .1272996 -1.25 0.212 .6095837 1.116233• _Ieduc5_4 | .6664847 .1117534 -2.42 0.016 .4798043 .9257979• _Ieduc5_5 | .6997924 .1245856 -2.01 0.045 .4936601 .9919973• _Irace4_2 | 2.028485 .1775178 8.08 0.000 1.70876 2.408034• _Irace4_3 | 1.472001 .4519957 1.26 0.208 .8063741 2.687076• _Irace4_4 | 1.362817 .2790911 1.51 0.131 .9122628 2.035893• ------------------------------------------------------------------------------

Page 157: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

157

PAR

• PAR = (RR – 1)/[(1-) + RR]

= 0.137• RR = 1.96

• PAR = 0.116• 11.6% of low weight births attributed

to maternal smoking

Page 158: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

158

Hypothesis Testing in MLE models

• MLE are asymptotically normally distributed, one of the properties of MLE

• Therefore, standard t-tests of hypothesis will work as long as samples are ‘large’

• What ‘large’ means is open to question• What to do when samples are ‘small’ –

table for a moment

Page 159: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

159

Testing a linear combination of parameters

• Suppose you have a probit model• Φ[β0 + x1iβ1+ x2i β2 + x3iβ3 +… ]

• Test a linear combination or parameters• Simplest example, test a subset are zero• β1= β2 = β3 = β4 =0• To fix the discussion

• N observations• K parameters• J restrictions (count the equals signs, j=4)

Page 160: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

160

Wald Test

• Based on the fact that the parameters are distributed asymptotically normal

• Probability theory review– Suppose you have m draws from a

standard normal distribution (zi)

– M = z12 + z2

2 + …. zm2

– M is distributed as a Chi-square with m degrees of freedom

Page 161: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

161

• Wald test constructs a ‘quadratic form’ suggested by the test you want to perform

• This combination, because it contains squares of the true parameters, should, if the hypothesis is true, be distributed as a Chi square with J degrees of freedom.

• If the test statistic is ‘large’, relative to the degrees of freedom of the test, we reject, because there is a low probability we would have drawn that value at random from the distribution

Page 162: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

162

Reading critical values from a table

• All stats books will report the ‘percentiles’ of a chi-square– Vertical axis (degrees of freedom)– Horizontal axis (percentiles)– Entry is the value where ‘percentile’ of

the distribution falls below

Page 163: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

163

• Example: Suppose 4 restrictions• 95% of a chi-square distribution falls

below 9.488. • So there is only a 5% a number drawn

at random will exceed 9.488• If your test statistic is below, cannot

reject null• If your test statistics is above, reject

null

Page 164: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

164

Chi-square

Percentiles of the Chi-squaredDOF 0.500 0.750 0.800 0.900 0.950 0.990 0.995

1 0.455 1.323 1.642 2.706 3.841 6.635 7.8792 1.386 2.773 3.219 4.605 5.991 9.210 10.5973 2.366 4.108 4.642 6.251 7.815 11.345 12.8384 3.357 5.385 5.989 7.779 9.488 13.277 14.8605 4.351 6.626 7.289 9.236 11.070 15.086 16.7506 5.348 7.841 8.558 10.645 12.592 16.812 18.5487 6.346 9.037 9.803 12.017 14.067 18.475 20.2788 7.344 10.219 11.030 13.362 15.507 20.090 21.9559 8.343 11.389 12.242 14.684 16.919 21.666 23.589

10 9.342 12.549 13.442 15.987 18.307 23.209 25.188

Page 165: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

165

Wald test in STATA

• Default test in MLE models• Easy to do. Look at program

• test hsgrad somecol college

• Does not estimate the ‘restricted’ model

• ‘Lower power’ than other tests, i.e., high chance of false negative

Page 166: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

166

• . test hsgrad somecol college;

• ( 1) hsgrad = 0• ( 2) somecol = 0• ( 3) college = 0

• chi2( 3) = 504.78• Prob > chi2 = 0.0000

Page 167: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

167

• Notice the higher value of the test statistic. There is a low chance that a variable, drawn at random from a ch-square with three degrees of freedom will be this large.

• Reject null

Page 168: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

168

-2 Log likelihood test• * how to run the same tests with a -2 log like test;

• * estimate the unresticted model and save the estimates ;

• * in urmodel;• probit smoker age incomel male black hispanic • hsgrad somecol college worka;• estimates store urmodel;

• * estimate the restricted model. save results in rmodel;

• probit smoker age incomel male black hispanic • worka;• estimates store rmodel;

• lrtest urmodel rmodel;

Page 169: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

169

• I prefer -2 log likelihood test– Estimates the restricted and

unrestricted model– Therefore, has more power than a Wald

test

• In most cases, they give the same ‘decision’ (reject/not reject)

Page 170: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

170

Section VI

Categorical Data

Page 171: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

171

Ordered Probit

• Many discrete outcomes are to questions that have a natural ordering but no quantitative interpretation:

• Examples:– Self reported health status

• (excellent, very good, good, fair, poor)

– Do you agree with the following statement• Strongly agree, agree, disagree, strongly

disagree

Page 172: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

172

• Can use the same type of model as in the previous section to analyze these outcomes

• Another ‘latent variable’ model

• Key to the model: there is a monotonic ordering of the qualitative responses

Page 173: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

173

Self reported health status

• Excellent, very good, good, fair, poor• Coded as 1, 2, 3, 4, 5 on National Health

Interview Survey• We will code as 5,4,3,2,1 (easier to

think of this way)• Asked on every major health survey• Important predictor of health outcomes,

e.g. mortality• Key question: what predicts health

status?

Page 174: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

174

• Important to note – the numbers 1-5 mean nothing in terms of their value, just an ordering to show you the lowest to highest

• The example below is easily adapted to include categorical variables with any number of outcomes

Page 175: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

175

Model

• yi* = latent index of reported health

• The latent index measures your own scale of health. Once yi* crosses a certain value you report poor, then good, then very good, then excellent health

Page 176: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

176

• yi = (1,2,3,4,5) for (fair, poor, VG, G, excel)

• Interval decision rule

• yi=1 if yi* ≤ u1

• yi=2 if u1 < yi* ≤ u2

• yi=3 if u2 < yi* ≤ u3

• yi=4 if u3 < yi* ≤ u4

• yi=5 if yi* > u4

Page 177: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

177

• As with logit and probit models, we will assume yi

* is a function of observed and unobserved variables

• yi* = β0 + x1i β1 + x2i β2 …. xki βk + εi

• yi* = xi β + εi

Page 178: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

178

• The threshold values (u1, u2, u3, u4) are unknown. We do not know the value of the index necessary to push you from very good to excellent.

• In theory, the threshold values are different for everyone

• Computer will not only estimate the β’s, but also the thresholds – average across people

Page 179: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

179

• As with probit and logit, the model will be determined by the assumed distribution of ε

• In practice, most people pick nornal, generating an ‘ordered probit’ (I have no idea why)

• We will generate the math for the probit version

Page 180: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

180

Probabilities

• Lets do the outliers, Pr(yi=1) and Pr(yi=5) first

• Pr(yi=1) • = Pr(yi* ≤ u1) • = Pr(xi β +εi ≤ u1 ) • =Pr(εi ≤ u1 - xi β) • = Φ[u1 - xi β] = 1- Φ[xi β – u1]

Page 181: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

181

• Pr(yi=5)

• = Pr(yi* > u4)

• = Pr(xi β +εi > u4 )

• =Pr(εi > u4 - xi β)

• = 1 - Φ[u4 - xi β] = Φ[xi β – u4]

Page 182: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

182

Sample one for y=3

• Pr(yi=3) = Pr(u2 < yi* ≤ u3)

= Pr(yi* ≤ u3) – Pr(yi* ≤ u2)

= Pr(xi β +εi ≤ u3) – Pr(xi β +εi ≤ u2)

= Pr(εi ≤ u3- xi β) - Pr(εi ≤ u2 - xi β)

= Φ[u3- xi β] - Φ[u2 - xi β]

= 1 - Φ[xi β - u3] – 1 + Φ[xi β - u2]

= Φ[xi β - u2] - Φ[xi β - u3]

Page 183: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

183

Summary

• Pr(yi=1) = 1- Φ[xi β – u1]

• Pr(yi=2) = Φ[xi β – u1] - Φ[xi β – u2]

• Pr(yi=3) = Φ[xi β – u2] - Φ[xi β – u3]

• Pr(yi=4) = Φ[xi β – u3] - Φ[xi β – u4]

• Pr(yi=5) = Φ[xi β – u4]

Page 184: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

184

Likelihood function

• There are 5 possible choices for each person

• Only 1 is observed

• L = Σi ln[Pr(yi=k)] for k

Page 185: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

185

Programming example

• Cancer control supplement to 1994 National Health Interview Survey

• Question: what observed characteristics predict self reported health (1-5 scale)

• 1=poor, 5=excellent• Key covariates: income, education, age,

current and former smoking status• Programs

• sr_health_status.do, .dta, .log

Page 186: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

186

• desc;

• male byte %9.0g =1 if male• age byte %9.0g age in years• educ byte %9.0g years of education• smoke byte %9.0g current smoker• smoke5 byte %9.0g smoked in past 5 years• black float %9.0g =1 if respondent is black• othrace float %9.0g =1 if other race (white is ref)• sr_health float %9.0g 1-5 self reported health,• 5=excel, 1=poor• famincl float %9.0g log family income

Page 187: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

187

• tab sr_health;

• 1-5 self |• reported |• health, |• 5=excel, |• 1=poor | Freq. Percent Cum.• ------------+-----------------------------------• 1 | 342 2.65 2.65• 2 | 991 7.68 10.33• 3 | 3,068 23.78 34.12• 4 | 3,855 29.88 64.00• 5 | 4,644 36.00 100.00• ------------+-----------------------------------• Total | 12,900 100.00

Page 188: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

188

In STATA

• oprobit sr_health male age educ famincl black othrace smoke smoke5;

Page 189: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

189

• Ordered probit estimates Number of obs = 12900• LR chi2(8) = 2379.61• Prob > chi2 = 0.0000• Log likelihood = -16401.987 Pseudo R2 = 0.0676

• ------------------------------------------------------------------------------• sr_health | Coef. Std. Err. z P>|z| [95% Conf. Interval]• -------------+----------------------------------------------------------------• male | .1281241 .0195747 6.55 0.000 .0897583 .1664899• age | -.0202308 .0008499 -23.80 0.000 -.0218966 -.018565• educ | .0827086 .0038547 21.46 0.000 .0751535 .0902637• famincl | .2398957 .0112206 21.38 0.000 .2179037 .2618878• black | -.221508 .029528 -7.50 0.000 -.2793818 -.1636341• othrace | -.2425083 .0480047 -5.05 0.000 -.3365958 -.1484208• smoke | -.2086096 .0219779 -9.49 0.000 -.2516855 -.1655337• smoke5 | -.1529619 .0357995 -4.27 0.000 -.2231277 -.0827961• -------------+----------------------------------------------------------------• _cut1 | .4858634 .113179 (Ancillary parameters)• _cut2 | 1.269036 .11282 • _cut3 | 2.247251 .1138171 • _cut4 | 3.094606 .1145781 • ------------------------------------------------------------------------------

Page 190: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

190

Interpret coefficients

• Marginal effects/changes in probabilities are now a function of 2 things– Point of expansion (x’s)– Frame of reference for outcome (y)

• STATA– Picks mean values for x’s– You pick the value of y

Page 191: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

191

Continuous x’s

• Consider y=5

• d Pr(yi=5)/dxi

= d Φ[xi β – u4]/dxi = βφ[xi β – u4]

• Consider y=3

• d Pr(yi=3)/dxi = βφ[xi β – u3] - βφ[xi β – u4]

Page 192: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

192

Discrete X’s

• xi β = β0 + x1i β1 + x2i β2 …. xki βk

– X2i is yes or no (1 or 0)

• ΔPr(yi=5) =

• Φ[β0 + x1i β1 + β2 + x3i β3 +.. xki βk]

- Φ[β0 + x1i β1 + x3i β3 …. xki βk]

• Change in the probabilities when x2i=1 and x2i=0

Page 193: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

193

Ask for marginal effects

• mfx compute, predict(outcome(5));

Page 194: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

194

• mfx compute, predict(outcome(5));

• Marginal effects after oprobit• y = Pr(sr_health==5) (predict, outcome(5))• = .34103717• ------------------------------------------------------------------------------• variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X• ---------+--------------------------------------------------------------------• male*| .0471251 .00722 6.53 0.000 .03298 .06127 .438062• age | -.0074214 .00031 -23.77 0.000 -.008033 -.00681 39.8412• educ | .0303405 .00142 21.42 0.000 .027565 .033116 13.2402• famincl | .0880025 .00412 21.37 0.000 .07993 .096075 10.2131• black*| -.0781411 .00996 -7.84 0.000 -.097665 -.058617 .124264• othrace*| -.0843227 .01567 -5.38 0.000 -.115043 -.053602 .04124• smoke*| -.0749785 .00773 -9.71 0.000 -.09012 -.059837 .289147• smoke5*| -.0545062 .01235 -4.41 0.000 -.078719 -.030294 .081395• ------------------------------------------------------------------------------• (*) dy/dx is for discrete change of dummy variable from 0 to 1

Page 195: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

195

Interpret the results

• Males are 4.7 percentage points more likely to report excellent

• Each year of age decreases chance of reporting excellent by 0.7 percentage points

• Current smokers are 7.5 percentage points less likely to report excellent health

Page 196: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

196

Minor notes about estimation

• Wald tests/-2 log likelihood tests are done the exact same was as in PROBIT and LOGIT

• Tests of individual parameters are done the same way (z-score)

Page 197: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

197

• Use PRCHANGE to calculate marginal effect for a specific person

prchange, x(age=40 black=0 othrace=0 smoke=0 smoke5=0 educ=16);

– When a variable is NOT specified (famincl), STATA takes the sample mean.

Page 198: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

198

• PRCHANGE will produce results for all outcomes

• male• Avg|Chg| 1 2 3 4• 0->1 .0203868 -.0020257 -.00886671 -.02677558 -.01329902

• 5• 0->1 .05096698

Page 199: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

199

• age• Avg|Chg| 1 2 3 4• Min->Max .13358317 .0184785 .06797072 .17686112 .07064757• -+1/2 .00321942 .00032518 .00141642 .00424452 .00206241• -+sd/2 .03728014 .00382077 .01648743 .04910323 .0237889• MargEfct .00321947 .00032515 .00141639 .00424462 .00206252

Page 200: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

200

Section VII

Count Data Models

Page 201: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

201

Introduction

• Many outcomes of interest are integer counts– Doctor visits– Low work days– Cigarettes smoked per day– Missed school days

• OLS models can easily handle some integer models

Page 202: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

202

• Example– SAT scores are essentially integer values– Few at ‘tails’– Distribution is fairly continuous– OLS models well

• In contrast, suppose– High fraction of zeros– Small positive values

Page 203: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

203

• OLS models will– Predict negative values– Do a poor job of predicting the mass of

observations at zero

• Example– Dr visits in past year, Medicare patients(65+)– 1987 National Medical Expenditure Survey– Top code (for now) at 10– 17% have no visits

Page 204: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

204

• visits | Freq. Percent Cum.• ------------+-----------------------------------• 0 | 915 17.18 17.18• 1 | 601 11.28 28.46• 2 | 533 10.01 38.46• 3 | 503 9.44 47.91• 4 | 450 8.45 56.35• 5 | 391 7.34 63.69• 6 | 319 5.99 69.68• 7 | 258 4.84 74.53• 8 | 216 4.05 78.58• 9 | 192 3.60 82.19• 10 | 949 17.81 100.00• ------------+-----------------------------------• Total | 5,327 100.00

Page 205: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

205

Poisson Model

• yi is drawn from a Poisson distribution

• Poisson parameter varies across observations

• f(yi;λi) =e-λi λi yi/yi! For λi>0

• E[yi]= Var[yi] = λi = f(xi, β)

Page 206: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

206

• λi must be positive at all times

• Therefore, we CANNOT let λi = xiβ

• Let λi = exp(xiβ)

• ln(λi) = (xiβ)

Page 207: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

207

• d ln(λi)/dxi = β

• Remember that d ln(λi) = dλi/λi

• Interpret β as the percentage change in mean outcomes for a change in x

Page 208: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

208

Problems with Poisson

• Variance grows with the mean– E[yi]= Var[yi] = λi = f(xi, β)

• Most data sets have over dispersion, where the variance grows faster than the mean

• In dr. visits sample, = 5.6, s=6.7• Impose Mean=Var, severe restriction

and you tend to reduce standard errors

Page 209: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

209

Negative Binomial Model

• Where γi = exp(xiβ) and δ ≥ 0

• E[yi] = δγi = δexp(xiβ)

• Var[yi] = δ (1+δ) γi

• Var[yi]/ E[yi] = (1+δ)

ii y

ii

iii y

yy

11

1

)1()(

)()Pr(

Page 210: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

210

• δ must always be ≥ 0• In this case, the variance grows

faster than the mean• If δ=0, the model collapses into the

Poisson• Always estimate negative binomial• If you cannot reject the null that δ=0,

report the Poisson estimates

Page 211: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

211

• Notice that ln(E[yi]) = ln(δ) + ln(γi), so

• d ln(E[yi]) /dxi = β

• Parameters have the same interpretation as in the Poisson model

Page 212: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

212

In STATA

• POISSON estimates a MLE model for poisson– Syntax

POISSON y independent variables

• NBREG estimates MLE negative binomial– Syntax

NBREG y independent variables

Page 213: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

213

Interpret results for Poisson

• Those with CHRONIC condition have 50% more mean MD visits

• Those in EXCELent health have 78% fewer MD visits

• BLACKS have 33% fewer visits than whites

• Income elasticity is 0.021, 10% increase in income generates a 2.1% increase in visits

Page 214: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

214

Negative Binomial

• Interpret results the same was as Poisson• Look at coefficient/standard error on delta• Ho: delta = 0 (Poisson model is correct)• In this case, delta = 5.21 standard error is

0.15, easily reject null.• Var/Mean = 1+delta = 6.21, Poisson is

mis-specificed, should see very small standard errors in the wrong model

Page 215: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

215

Selected Results, Count ModelsParameter (Standard Error)

Variable Poisson Negative Binomial

Age65 0.214 (0.026) 0.103 (0.055)

Age70 0.787 (0.026) 0.204 (0.054)

Chronic 0.500 (0.014) 0.509 (0.029)

Excel -0.784 (0.031) -0.527 (0.059)

Ln(Inc). 0.021 (0.007) 0.038 (0.016)

Page 216: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

216

Section VIII

Duration Data

Page 217: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

217

Introduction

• Sometimes we have data on length of time of a particular event or ‘spells’– Time until death– Time on unemployment– Time to complete a PhD

• Techniques we will discuss were originally used to examine lifespan of objects like light bulbs or machines. These models are often referred to as “time to failure”

Page 218: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

218

Notation

• T is a random variable that indicates duration (time til death, find a new job, etc)

• t is the realization of that variable• f(t) is a PDF that describes the process

that determines the time to failure• CDF is F(t) represents the probability

an event will happen by time t

Page 219: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

219

• F(t) represents the probability that the event happens by ‘t’.

• What is the probability a person will die on or before the 65th birthday?

F t s t f s d st

( ) P r( ) ( ) 0

Page 220: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

220

• Survivor function, what is the chance you live past (t)

• S(t) = 1 – F(t)• If 10% of a cohort dies by their 65th

birthday, 90% will die sometime after their 65th birthday

Page 221: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

221

• Hazard function, h(t)• What is the probability the spell will

end at time t, given that it has already lasted t

• What is the chance you find a new job in month 12 given that you’ve been unemployed for 12 months already( ) lim

P r( | )t

t T t h T t

hh

0

Page 222: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

222

• PDF, CDF (Failure function), survivor function and hazard function are all related

• λ(t) = f(t)/S(t) = f(t)/(1-F(t))

• We focus on the ‘hazard’ rate because its relationship to time indicates ‘duration dependence’

Page 223: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

223

• Example: suppose the longer someone is out of work, the lower the chance they will exit unemployment – ‘damaged goods’

• This is an example of duration dependence, the probability of exiting a state of the world is a function of the length

Page 224: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

224

• Mathematically• d λ(t) /dt = 0 then there is no duration dep.

• d λ(t) /dt > 0 there is + duration dependence

the probability the spell will end

increases with time

• d λ(t) /dt < 0 there is – duration dependence

the probability the spell will end

decreases over time

Page 225: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

225

• Your choice, is to pick values for f(t) that have +, - or no duration dependence

Page 226: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

226

Different Functional Forms

• Exponential– λ(t)= λ– Hazard is the same over time, a ‘memory less’

process

• Weibull– F(t) = 1 – exp(-γtρ) where ρ,γ > 0– λ(t) = ργρ-1

– if ρ>1, increasing hazard– if ρ<1, decreasing hazard– if ρ=1, exponential

Page 227: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

227

• Others: Lognormal, log-logistic, Gompertz.

• Little more difficult – can examine when you get comfortable with Weibull

Page 228: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

228

A note about most data sets

• Most data sets have ‘censored’ spells– Follow people over time – All will eventually die, but some do not

in your period of analysis– Incomplete spells or censored data

• Must build into the log likelihood function

Page 229: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

229

• Let ti be the duration we observe for all people

• Some people die, and their they lived until period ti

• Others are observed for ti periods, but do not

• Let di=1 if data is complete spell

• di=1 if incomplete

Page 230: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

230

• Recall, that f(s) is the PDF for someone who dies at period s

• F(t) is the probability you die by t• 1-F(t) = the probability you die after

(t)

F t s t f s d st

( ) P r( ) ( ) 0

Page 231: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

231

• If di=1 then we observe f(ti), someone who died in period ti

• If di=0 then someone lived past period ti and the probability of that is [1-F(ti)]

• L = Σi {di ln[f(ti)] + (1-di)ln[1-F(ti)]}

Page 232: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

232

Introducing covariates

• Look at exponential• λ(t)= λ• Allow this to vary across people

• λ i(t)= λi

• But like Poisson, λi is always positive

• Let λi = exp(β0 + x1i β1 + x2i β2 …. xki

βk)

Page 233: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

233

• In the Weibull • λ(t) = αγtα-1

• Allow it to vary across people

• λ i(t) = αγi tα-1

• γi = exp(β0 + x1i β1 + x2i β2 …. xki βk)

Page 234: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

234

Interpreting Coefficients

• This is the same for both Weibull and Exponential

• In Weibull, λ(ti ) = αγitα-1

• Suppose x1i is a dummy variable• When xi1=1, then • γi1 = eβ0 + β1 + x2i β2 …. xki βk

• When xi1=0, then • γi0 = eβ0 + x2i β2 …. xki βk

Page 235: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

235

• When you construct the ratio of γi1/ γi0, all the others parameters cancel, so

• (αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1

• Percentage change in the hazard when x1i turns from 0 to 1.

• STATA prints out eβ1, just subtract 1

Page 236: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

236

Suppose x2i is continuous

• Suppose we increase x2i by 1 unit• γi1 = exp(β0 + β1x1i + x2i β2 …. xki βk)• γi2 = exp(β0 + β1 (x1i+1) + x2i β2 …. xki βk)

• Can show that• (αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1• = exp(β2) – 1• Percentage change in the hazard for 1 unit

increase in x

Page 237: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

237

NHIS Multiple Cause of Death

• NHIS– annual survey of 60K households– Data on individuals– Self-reported healthm DR visits, lost workdays,

etc.

• MCOD– Linked NHIS respondents from 1986-1994 to

National Death Index through Dec 31, 1995– Identified whether respondent died and of what

cause

Page 238: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

238

• Our sample– Males, 50-70, who were married at the

time of the survey– 1987-1989 surveys– Give everyone 5 years (60 months) of

followup

Page 239: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

239

Key Variables

• max_mths maximum months in the survey.

• Diedin5 respondent died during the 5 years of followup

• Note if diedn5=0, the max_mths=60. Diedin5 identifies whether the data is censored or not.

Page 240: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

240

• Variable | Obs Mean Std. Dev. Min Max• -------------+--------------------------------------------------------• age_s_yrs | 26654 59.42586 5.962435 50 70• max_mths | 26654 56.49077 11.15384 0 60• black | 26654 .0928566 .2902368 0 1• hispanic | 26654 .0454716 .20834 0 1• income | 26654 3.592181 1.327325 1 5• -------------+--------------------------------------------------------• educ | 26654 2.766677 .961846 1 4• diedin5 | 26654 .1226082 .3279931 0 1

Page 241: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

241

Duration Data in STATA

• Need to identify which is the duration data

stset length, failure(failvar)

• Length=duration variable• Failvar=1 when durations end in failure, =0

for censored values

• If all data is uncensored, omit failure(failvar)

Page 242: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

242

• In our case

• stset max_mths, failure(diedin5)

Page 243: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

243

Getting Kaplan-Meier Curves

• Tabular presentation of results sts list

• Graphical presentation sts graph

• Results by subgroup sts graph, by(educ)

Page 244: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

244

0.0

00.

25

0.5

00.

75

1.0

0

0 20 40 60analysis time

Kaplan-Meier survival estimate

Page 245: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

245

0.0

00.

25

0.5

00.

75

1.0

0

0 20 40 60analysis time

educ = 1 educ = 2educ = 3 educ = 4

Kaplan-Meier survival estimates, by educ

Page 246: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

246

MLE of duration model with Covariates

• Basic syntax• streg covariates, d(distribution)

• streg age_s_yrs black hispanic _Ie* _Ii*, d(weibull);

• In this model, STATA will print out exp(β)

• If you want the coefficients, add ‘nohr’ option (no hazard ratio)

Page 247: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

247

Weibull coefficients• No. of subjects = 26631 Number of obs = 26631• No. of failures = 3245• Time at risk = 1505705• LR chi2(10) = 595.74• Log likelihood = -12425.055 Prob > chi2 = 0.0000

• ------------------------------------------------------------------------------• _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]• -------------+----------------------------------------------------------------• age_s_yrs | .0452588 .0031592 14.33 0.000 .0390669 .0514508• black | .4770152 .0511122 9.33 0.000 .3768371 .5771932• hispanic | .1333552 .082156 1.62 0.105 -.0276676 .294378• _Ieduc_2 | .0093353 .0591918 0.16 0.875 -.1066786 .1253492• _Ieduc_3 | -.072163 .0503131 -1.43 0.151 -.1707748 .0264488• _Ieduc_4 | -.1301173 .0657131 -1.98 0.048 -.2589126 -.0013221• _Iincome_2 | -.1867752 .0650604 -2.87 0.004 -.3142914 -.0592591• _Iincome_3 | -.3268927 .0688635 -4.75 0.000 -.4618627 -.1919227• _Iincome_4 | -.5166137 .0769202 -6.72 0.000 -.6673747 -.3658528• _Iincome_5 | -.5425447 .0722025 -7.51 0.000 -.684059 -.4010303• _cons | -9.201724 .2266475 -40.60 0.000 -9.645945 -8.757503• -------------+----------------------------------------------------------------• /ln_p | .1585315 .0172241 9.20 0.000 .1247729 .1922901• -------------+----------------------------------------------------------------• p | 1.171789 .020183 1.132891 1.212022• 1/p | .8533961 .014699 .8250675 .8826974• ------------------------------------------------------------------------------

Page 248: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

248

• The sign of the parameters is informative– Hazard increasing in age– Blacks, hispanics have higher mortality rates– Hazard decreases with income and age

• The parameter ρ = 1.17. – Check 95% confidence interval (1.13, 1.21).

Can reject null p=1 (exponential)– Hazard is increasing over time

Page 249: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

249

Hazard ratios• No. of subjects = 26631 Number of obs = 26631• No. of failures = 3245• Time at risk = 1505705• LR chi2(10) = 595.74• Log likelihood = -12425.055 Prob > chi2 = 0.0000

• ------------------------------------------------------------------------------• _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]• -------------+----------------------------------------------------------------• age_s_yrs | 1.046299 .0033055 14.33 0.000 1.03984 1.052797• black | 1.611258 .082355 9.33 0.000 1.457667 1.781032• hispanic | 1.142656 .093876 1.62 0.105 .9727116 1.342291• _Ieduc_2 | 1.009379 .059747 0.16 0.875 .8988145 1.133544• _Ieduc_3 | .9303792 .0468103 -1.43 0.151 .8430114 1.026802• _Ieduc_4 | .8779924 .0576956 -1.98 0.048 .7718905 .9986788• _Iincome_2 | .8296302 .0539761 -2.87 0.004 .7303062 .9424625• _Iincome_3 | .7211611 .0496617 -4.75 0.000 .6301089 .8253706• _Iincome_4 | .5965372 .0458858 -6.72 0.000 .5130537 .6936049• _Iincome_5 | .5812672 .041969 -7.51 0.000 .5045648 .6696297• -------------+----------------------------------------------------------------• /ln_p | .1585315 .0172241 9.20 0.000 .1247729 .1922901• -------------+----------------------------------------------------------------• p | 1.171789 .020183 1.132891 1.212022• 1/p | .8533961 .014699 .8250675 .8826974• ------------------------------------------------------------------------------

Page 250: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

250

Interpret coefficients

• Age: every year hazard increases by 4.6%

• Black: have 61% greater hazard than whites

• Hispanics: 14% greater hazard than non-hispanics

• Educ 2, 3, 4 are some 9-11, 12-15 and 16+ years of school

Page 251: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

251

• Educ 3: those with 12-15 years of educ have .93 – 1 = -0.07 or a 7% lower hazard than those with <9 years of school

• Educ 4: those with a college degree have 0.88 – 1 = -0.12 or a 12% lower hazard than those with <9 years of school

Page 252: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

252

• Income 2-5 are dummies for people with $10-$20K, $20-$30K, $30-$40K, >$40K

• Income 2: Those with $10-$20K have 0.83 – 1 = -0.17 or a 17% lower hazard than those with income <$10K

• Income 5, those with >$40K in income have a 0.58 – 1 = -0.42 or a 42% lower hazard than those with income <$10K

Page 253: 1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

253

Topics not covered

• Time varying covariates• Competing risk models

– Die from multiple causes

• Cox proportional hazard model– Heterogeneity in baseline hazard