Structure of the class 1.The linear probability model 2.Maximum likelihood estimations 3.Binary...

Structure of the class

1. The linear probability model

2. Maximum likelihood estimations

3. Binary logit models and some other models

4. Multinomial models

The Linear Probability Model

The linear probability model

When the dependent variable is binary (0/1, for example, Y=1 if the firm innovates, 0 otherwise), OLS is called the linear probability model.

0 1 1 2 2Y x x u

How should one interpret βj? Provided that E(u|X)=0 holds true, then:

0 1 1 2 2E(Y | X) x x β measures the variation of the probability of success for a one-unit

variation of X (ΔX=1)

E(Y | X) Pr(Y 1| X)Pr(Y 1| X)

X X

1. Non normality of errors

2. Heteroskedastic errors

3. Fallacious predictions 0 E Y | X 1

21 2 kVar u x ,x , ,x

2u Normal(0, )

Limits of the linear probability model

Overcoming the limits of the LPM1. Non normality of errors Increase sample size

2. Heteroskedastic errors Use robust estimators

3. Fallacious prediction Perform non linear or constrained regressions

Persistent use of LPM

Although it has limits, the LPM is still used

1. In the process of data exploration (early stages of the research)

2. It is a good indicator of the marginal effect of the representative observation (at the mean)

3. When dealing with very large samples, least squares can overcome the complications imposed by maximum likelihood techniques.

Time of computation Endogeneity and panel data problems

The LOGIT/PROBIT Model

Probability, odds and logit/probit We need to explain the occurrence of an event: the LHS

variable takes two values : y={0;1}.

In fact, we need to explain the probability of occurrence of the event, conditional on X: P(Y=y | X) [0 ; 1]∈ .

OLS estimations are not adequate, because predictions can lie outside the interval [0 ; 1].

We need to transform a real number, say z to ]-∞;∈+∞[ into P(Y=y | X) [0 ; 1]∈ .

The logit/probit transformation links a real number z ]-∞;∈+∞[ to P(Y=y | X) [0 ; 1]∈ .It is also called the link function

Binary Response Models: Logit - Probit Link function approach

Maximum likelihood estimations OLS can be of much help. We will use Maximum Likelihood

Estimation (MLE) instead.

MLE is an alternative to OLS. It consists of finding the parameters values which is the most consistent with the data we have.

The likelihood is defined as the joint probability to observe a given sample, given the parameters involved in the generating function.

One way to distinguish between OLS and MLE is as follows:

OLS adapts the model to the data you have : you only have one model derived from your data. MLE instead supposes there is an infinity of

models, and chooses the model most likely to explain your data.

Let us assume that you have a sample of n random observations. Let f(yi ) be the probability that yi = 1 or yi = 0. The joint probability to observe jointly n values of yi is given by the likelihood function:

1 21

, ,..., ( )n

n ii

f y y y f y

Logit likelihood

Likelihood functions

i i

i i

i i

n ny 1 y

ii 1 i 1

y 1 yzn n

i z zi 1 i 1

y 1 yn n

ii 1 i 1

L y f (y ) p 1 p

e 1

,

L y, z f (y , z)1 e 1 e

e 1L y, x, f (y , )

1 e 1 e

Xβ

Xβ XβX β


Knowing p (as the logit), having defined f(.), we come up with the likelihood function:

i i

i i

i i

n ny 1 y

ii 1 i 1

y 1 yzn n

i z zi 1 i 1

y 1 yn n

ii 1 i 1

L y f (y ) p 1 p

e 1

,

L y, z f (y , z)1 e 1 e

e 1L y, x, f (y , )

1 e 1 e

Xβ

Xβ XβX β

The log transform of the likelihood function (the log likelihood) is much easier to manipulate, and is written:

n nz

ii 1 i 1

n n

ii 1 i 1

n

ii 1

LL y,z y z ln 1 e

LL y, x, y ln 1 e

LL y, x, ln 1 e y

Xβ

Xβ

Xβ

Xβ

Log likelihood (LL) functions

The LL function can yield an infinity of values for the parameters β.

Given the functional form of f(.) and the n observations at hand, which values of parameters β maximize the likelihood of my sample?

In other words, what are the most likely values of my unknown parameters β given the sample I have?

Maximum likelihood estimations

n

i i i zi 1

i zn

i i i ii 1

LLy x 0

ewhere

1 e²LL1 x x

However, there is not analytical solutions to this non linear problem. Instead, we rely on a optimization algorithm (Newton-Raphson)

The LL is globally concave and has a maximum. The gradient is used to compute the parameters of interest, and the hessian is used to compute the variance-covariance matrix.

Maximum likelihood estimations

You need to imagine that the computer is going to generate all possible values of β, and is going to compute a likelihood value for each (vector of ) values to then choose (the vector of) β such that the likelihood is

highest.

Binary Dependent Variable – Research questions We want to explore the factors affecting the probability of

being successful innovator (inno = 1): Why?

Instruction Stata : logit

logit y x1 x2 x3 … xk [if] [weight] [, options]

Options

noconstant : estimates the model without the constant

robust : estimates robust variances, also in case of heteroscedasticity

if : it allows to select the observations we want to include in the analysis

weight : it allows to weight different observations

Logistic Regression with STATA

A positive coefficient indicates that the probability of innovation success increases with the corresponding explanatory variable.

A negative coefficient implies that the probability to innovate decreases with the corresponding explanatory variable.

Warning! One of the problems encountered in interpreting probabilities is their non-linearity: the probabilities do not vary in the same way according to the level of regressors

This is the reason why it is normal in practice to calculate the probability of (the event occurring) at the average point of the sample

Interpretation of Coefficients


_cons -11.63447 1.937191 -6.01 0.000 -15.43129 -7.837643 biotech 3.799953 .577509 6.58 0.000 2.668056 4.93185 spe .4252844 .4204924 1.01 0.312 -.3988654 1.249434 lassets .997085 .1368534 7.29 0.000 .7288574 1.265313 lrdi .7527497 .2110683 3.57 0.000 .3390634 1.166436 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -163.45352 Pseudo R2 = 0.2039 Prob > chi2 = 0.0000 LR chi2(4) = 83.71Logistic regression Number of obs = 431

Iteration 4: log likelihood = -163.45352Iteration 3: log likelihood = -163.45376Iteration 2: log likelihood = -163.57746Iteration 1: log likelihood = -167.71312Iteration 0: log likelihood = -205.30803

. logit inno lrdi lassets spe biotech

Let’s run the more complete model logit inno lrdi lassets spe biotech

-11.63 0.75 0.99 0.43 3.79

-11.63 0.75 0.99 0.43 3.79

eP

1 e

rdi lassets spe biotech

rdi lassets spe bi

otech

Using the sample mean values of rdi, lassets, spe and biotech, we compute the conditional probability :

-11.63 0.75 0.99 0.43 3.79

-11.63 0

rdi lassets spe biotech

rdi lassets s.75 0.99 0.43 3.79pe biotech

eP

1 e

e0,8758

1 e

1.953

1.953


It is often useful to know the marginal effect of a regressor on the probability that the event occur (innovation)

As the probability is a non-linear function of explanatory variables, the change in probability due to a change in one of the explanatory variables is not identical if the other variables are at the average, median or first quartile, etc. level.

Marginal Effects

Goodness of Fit Measures

In ML estimations, there is no such measure as the R2

But the log likelihood measure can be used to assess the goodness of fit. But note the following : The higher the number of observations, the lower the joint probability, the

more the LL measures goes towards -∞ Given the number of observations, the better the fit, the higher the LL

measures (since it is always negative, the closer to zero it is)

The philosophy is to compare two models looking at their LL values. One is meant to be the constrained model, the other one is the unconstrained model.


A model is said to be constrained when the observed set the parameters associated with some variable to zero.

A model is said to be unconstrained when the observer release this assumption and allows the parameters associated with some variable to be different from zero.

For example, we can compare two models, one with no explanatory variables, one with all our explanatory variables. The one with no explanatory variables implicitly assume that all parameters are equal to zero. Hence it is the constrained model because we (implicitly) constrain the parameters to be nil.

The likelihood ratio test (LR test) The most used measure of goodness of fit in ML estimations is the

likelihood ratio. The likelihood ratio is the difference between the unconstrained model and the constrained model. This difference is distributed 2.

If the difference in the LL values is (no) important, it is because the set of explanatory variables brings in (un)significant information. The null hypothesis H0 is that the model brings no significant information as

follows:

High LR values will lead the observer to reject hypothesis H0 and accept

the alternative hypothesis Ha that the set of explanatory variables does

significantly explain the outcome.

unc cLR 2 ln L ln L

The McFadden Pseudo R2

We also use the McFadden Pseudo R2 (1973). Its interpretation is analogous to the OLS R2. However its is biased doward and remain generally low.

Le pseudo-R2 also compares The likelihood ratio is the difference between the unconstrained model and the constrained model and is comprised between 0 and 1.

c unc2 uncMF

unc c

ln L ln L ln LPseudo R 1

ln L ln L


_cons -11.63447 1.937191 -6.01 0.000 -15.43129 -7.837643 biotech 3.799953 .577509 6.58 0.000 2.668056 4.93185 spe .4252844 .4204924 1.01 0.312 -.3988654 1.249434 lassets .997085 .1368534 7.29 0.000 .7288574 1.265313 lrdi .7527497 .2110683 3.57 0.000 .3390634 1.166436 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -163.45352 Pseudo R2 = 0.2039 Prob > chi2 = 0.0000 LR chi2(4) = 83.71Logistic regression Number of obs = 431

. logit inno lrdi lassets spe biotech, nolog

_cons 1.494183 .1244955 12.00 0.000 1.250177 1.73819 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -205.30803 Pseudo R2 = 0.0000 Prob > chi2 = . LR chi2(0) = 0.00Logistic regression Number of obs = 431

Iteration 0: log likelihood = -205.30803

. logit inno

Constrained model

Unconstrained model

unc cLR 2 ln L ln L

2 163.5 205.3

83.8

2MF unc cPs.R 1 ln L ln L

1 163.5 205.3

0.204

The Logit model is only one way of modeling binary choice models

The Probit model is another way of modeling binary choice models. It is actually more used than logit models and assume a normal distribution (not a logistic one) for the z values.

The complementary log-log models is used where the occurrence of the event is very rare, with the distribution of z being asymetric.

Other Binary Choice models

Other Binary Choice models

Probit model

Complementary log-log model

22 2z 2z e e

Pr(Y 1| X) dz dz t dz2 2

X β

X β X βXβ

Pr(Y 1| X) 1 exp exp( ) X β X β

Likelihood functions and Stata commands

1

1 1

1

1 1

1

1( , , ) ( , , )

1 1

( , , ) ( , , ) ( ) 1 ( )

( , , ) ( , , ) 1 exp( exp( )) exp( exp(

i i

i i

i

y yn n

i ii i

n ny y

i ii i

ny

i ii

eL y x f y x

e e

L y x f y x

L y x f y x

X β

X β X β

X β X β

X β

Logit :

Probit :

Log-log comp : 11

)) in

y

i

X β

Example logit inno rdi lassets spe pharmaprobit inno rdi lassets spe pharmacloglog inno rdi lassets spe pharma

Probability Density Functions

0.1

.2.3

.4y

-4 -2 0 2 4x

Probit Transformation Logit TransformationComplementary log log Transformation

Cumulative Distribution Functions

0.2

.4.6

.81

y

-4 -2 0 2 4x

Probit Transformation Logit TransformationComplementary log log Transformation

Comparison of modelsOLS Logit Probit C log-log

Ln(R&D intensity) 0.110 0.752 0.422 354

[3.90]*** [3.57]*** [3.46]*** [3.13]***

ln(Assets) 0.125 0.997 0.564 0.493

[8.58]*** [7.29]*** [7.53]*** [7.19]***

Spe 0.056 0.425 0.224 0.151

[1.11] [1.01] [0.98] [0.76]

BiotechDummy 0.442 3.799 2.120 1.817

[7.49]*** [6.58]*** [6.77]*** [6.51]***

Constant -0.843 -11.634 -6.576 -6.086

[3.91]** [6.01]*** [6.12]*** [6.08]***

Observations 431 431 431 431

Absolute t value in brackets (OLS) z value for other models.

* 10%, ** 5%, *** 1%

Comparison of marginal effects

OLS Logit Probit C log-log

Ln(R&D intensity) 0.110 0.082 0.090 0.098

ln(Assets) 0.125 0.110 0.121 0.136

Specialisation 0.056 0.046 0.047 0.042

Biotech Dummy 0.442 0.368 0.374 0.379

For all models logit, probit and cloglog, marginal effects have been computed for a one-unit variation (around the mean) of the variable at stake, holding all other variables at the sample mean values.

Multinomial LOGIT Models

Multinomial modelsLet us now focus on the case where the dependent variable has

several outcomes (or is multinomial). For example, innovative firms

may need to collaborate with other organizations. One can code this

type of interactions as follows Collaborate with university (modality 1) Collaborate with large incumbent firms (modality 2) Collaborate with SMEs (modality 3) Do it alone (modality 4)

Or, studying firm survival Survival (modality 1) Liquidation (modality 2) Mergers & acquisition (modality 3)

36

Multiple alternatives without obvious ordering

Choice of a single alternative out of a number of distinct alternativese.g.: which means of transportation do you use to get to work?

bus, car, bicycle etc.

example for ordered structure:

how do you feel today: very well, fairly well, not too well, miserably

Random Utility Model RUM underlies economic interpretation of discrete choice

models. Developed by Daniel McFadden for econometric applications see JoEL January 2001 for Nobel lecture; also Manski

(2001) Daniel McFadden and the Econometric Analysis of Discrete Choice, Scandinavian Journal of Economics, 103(2), 217-229

Preferences are functions of biological taste templates, experiences, other personal characteristics Some of these are observed, others unobserved Allows for taste heterogeneity

Discussion below is in terms of individual utility (e.g. migration, transport mode choice) but similar reasoning applies to firm choices

Random Utility Model Individual i’s utility from a choice j can be

decomposed into two components:

Vij is deterministic – common to everyone, given the same characteristics and constraints representative tastes of the population e.g.

effects of time and cost on travel mode choice

ij is random reflects idiosyncratic tastes of i and unobserved

attributes of choice j

ijijij VU

Random Utility Model Vij is a function of attributes of alternative j

(e.g. price and time) and observed consumer and choice characteristics.

ij ij ij ijV t p z • We are interested in finding , , • Lets forget about z now for simplicity

RUM and binary choices Consider two choices e.g. bus or car We observe whether an individual uses one or the

other Define

1 if chooses bus

0 if chooses cari

i

y i

y i

• What is the probability that we observe an individual

choosing to travel by bus?• Assume utility maximisation• Individual chooses bus (y=1) rather than car (y=0) if

utility of commuting by bus exceeds utility of commuting

by car

RUM and binary choices So choose bus if 01 ii UU

10011 iiii VV

01101 iiii VV

• So the probability that we observe an individual choosing

bus travel is

1 0 1 0

1 0 1 0 1 0

Pr ob

Pr ob

i i i i

i i i i i i

V V

t t p p

The linear probability model• Assume probability depends linearly on observed

characteristics (price and time)

• Then you can estimate by linear regression

1 0 1 0Pr ob chooses bus i i i ii t t p p

1 1 0 1 0 1i i i i i iy t t p p

• Where is the “dummy variable” for mode choice (1

if bus, 0 if car)• Other consumer and choice characteristics can be

included (the zs in the first slide in this section)

1iy

Probits and logits Common assumptions:

Cumulative normal distribution function – “Probit” Logistic function – “Logit”

expPr ob chooses bus

1 expi

i

Vi

V

• Estimation by maximum likelihood

1

Pr ob 1

Prob 0 1

ln ln 1 1

i

i

i n

i ii

y F

y F

L y F y F

i

i

i i

x β

x β

x β x β

45

A discrete choice underpinning• choice between M alternatives• decision is determined by the utility level Uij, an

individual i derives from choosing alternative j • Let:

where i=1,…,N individuals; j=0,…,J alternatives

ijjijij xU '

(1)

The alternative providing the highest level of utility will be chosen.

46

The probability that alternative j will be chosen is:

In general, this requires solving multidimensional integrals analytical solutions do not exist

),|(

),|()('' jkxxxP

jkxUUPjyP

kijjijijik

ikiji

47

Exception: If the error terms εij in are assumed to be independently & identically standard extreme value distributed, then an analytical solution exists.

In this case, similar to binary logit, it can be shown that the choice probabilities are

'ij j

i 'ik k

k

exp(x )P(y j)

exp(x )

Let us assume that you have a sample of n random observations. Let f(yj ) be the probability that yi = j. The joint probability to observe jointly n values of yj is given by the likelihood function:

1 21

, ,..., ( )n

n ii

f y y y f y

We need to specify function f(.). It comes from the empirical discrete distribution of an event that can have several outcomes. This is the multinomial distribution. Hence:

j0 1 k ki i i i idYdY dY dY dY

j 0 1 j k jj K

f (y ) p p p p p


The maximum likelihood function The maximum likelihood function reads:

ji

j0i i

( j|0)

( j|0) ( j|0)

n n kdY

i ji 1 i 1 j 1

dY dY

xn n k( j|0)

i i j k j kx xi 1 i 1 j 1

j 1 j 1

L(y) f y p

1 eL(y) f y , x ,

1 e 1 e

The maximum likelihood functionThe log transform of the likelihood yields

( j|0)i

( j|0) ( j|0)i i

( j|0)i i

xn k( j|0) 0 j

i ij k j kx xi 1 j 1

j 0 j 0

j kx x( j|0) j ( j|0)

i ij 0

1 eLL(y, x, ) dy ln dy ln

1 e 1 e

LL(y, x, ) ln 1 e dy x ln 1 e

( j|0)

( j|0)i

j kn k

i 1 j 1 j 0

j kn k kx( j|0) j ( j|0)

i ii 1 j 1 j 1 j 0

LL(y, x, ) dy x k 1 ln 1 e

Multinomial logit models

Stata Instruction : mlogit

mlogit y x1 x2 x3 … xk [if] [weight] [, options]

Options : noconstant : omits the constant

robust : controls for heteroskedasticity

if : select observations

weight : weights observations

use mlogit.dta, clear mlogit type_exit log_time log_labour entry_age entry_spin cohort_*

Base outcome, chosen by STATA, with the highest empirical frequency

Goodness of fit

Parameter estimates, Standard errors and z values

Multinomial logit models

Interpretation of coefficientsThe interpretation of coefficients always refer to the base category

Does the probability of being bought-out decrease overtime ?

No!Relative to survival the probability of being bought-out decrease overtime

Interpretation of coefficientsThe interpretation of coefficients always refer to the base category

Is the probability of being bought-out lower for spinoff?

No!Relative to survival the probability of being bought-out is lower for spinoff

55

J 1ij

ij jk im mkm 1ik

PP P , j 1, , J 1

x

Marginal Effects

ElasticitiesJ 1

ij ikik jk im mk

m 1ik ij

P xx P , j 1, , J 1

x P

relative change of pij if x increases by 1 per cent

Independence of irrelevant alternatives - IAA The model assumes that each pair of outcome is independent from

all other alternatives. In other words, alternatives are irrelevant.

From a statistical viewpoint, this is tantamount to assuming independence of the error terms across pairs of alternatives

A simple way to test the IIA property is to estimate the model taking off one modality (called the restrained model), and to compare the parameters with those of the complete model

If IIA holds, the parameters should not change significantly

If IIA does not hold, the parameters should change significantly

Multinomial logit and “IIA” Many applications in economic and geographical journals

(and other research areas) The multinomial logit model is the workhorse of multiple

choice modelling in all disciplines. Easy to compute But it has a drawback

Independence of Irrelevant Alternatives Consider market shares

Red bus 20% Blue bus 20% Train 60%

IIA assumes that if red bus company shuts down, the market shares become Blue bus 20% + 5% = 25% Train 60% + 15% = 75%

Because the ratio of blue bus trips to train trips must stay at 1:3

Independence of Irrelevant Alternatives Model assumes that ‘unobserved’ attributes of all

alternatives are perceived as equally similar But will people unable to travel by red bus really

switch to travelling by train? Most likely outcome is (assuming supply of bus seats

is elastic) Blue bus: 40% Train: 60%

This failure of multinomial/conditional logit models is called the

Independence of Irrelevant Alternatives assumption (IIA)

H0: The IIA property is valid

H1: The IIA property is not valid

1* * *

R C R C R Cˆ ˆ ˆ ˆ ˆ ˆH var var

The H statistics (H stands for Hausman) follows a χ² distribution with M degree of freedom (M being the number of parameters)

Independence of irrelevant alternatives - IAA

STATA application: the IIA test



mlogtest, hausman

Omitted variable

Application de IIA

mlogtest, hausmanWe compare the parameters of the model

“liquidation relative bought-out”estimated simultaneously with “survival relative to bought-out”

avec

the parameters of the model

“liquidation relative bought-out”estimated without

“survival relative to bought-out”



Application de IIA

mlogtest, hausman

The conclusion is that outcome survival significantly alters the choice between

liquidation and bought-out.

In fact for a company, being bought-out must be seen as a way to remain active with a cost of losing control on economic decision, notably

investment.



64

Cramer-Ridder Test Often you want to know whether certain alternatives can be merged into one:e.g., do you have to distinguish between employment states such as “unemployment” and “nonemployment”

The Cramer-Ridder tests the null hypothesis that the alternatives can be merged. It has the form of a LR test:

2(logLU-logLR)~χ²

65

Derive the log likelihood value of the restricted model where two alternatives (here, A and N) have been merged:

R A A N N

PA N A N

logL =n logn +n logn

-(n +n )log(n +n )+logL

RL

PL

where log is the log likelihood of the

of the pooled model, and nA and nN are the

number of times A and N have been chosen

restricted model, log is the log likelihood

Exercise

use http://www.stata-press.com/data/r8/sysdsn3

tabulate insure mlogit insure age male nonwhite site2 site3

Structure of the class 1.The linear probability model 2.Maximum likelihood estimations 3.Binary...

Documents

Transcript of Structure of the class 1.The linear probability model 2.Maximum likelihood estimations 3.Binary...