ECON312 PS1 Solutions - Koç Hastanesi PS1 Solutions.pdf ·...
Click here to load reader
Transcript of ECON312 PS1 Solutions - Koç Hastanesi PS1 Solutions.pdf ·...
ECON 312 Problem Set 1 – Suggested Solutions Spring Semester 2014
1. a) Show that vi = ui +wi The original regression model in terms of the true but unobservable Yi is given by: Yi = β0 + β1Xi + ui and the regression model we actually estimate using the observable variable
Yi is given by: Yi = β0 + β1Xi + v By assumption,
Yi = Yi +wi and so Yi = Yi −wi . Substituting this into the first regression model gives:
Yi −wi = β0 + β1Xi + uiYi = β0 + β1Xi + ui +wi
and so vi = ui +wi b) Show that the regression
Yi = β0 + β1Xi + vi satisfies the standard LS assumptions in Key Concept 4.3. (Assume that wi is independent of Xi and Yi for all i and has finite 4th moment.) i) We must show that E(vi | Xi ) = 0 or equivalently, cov(vi | Xi ) = 0 • From part (a), vi = ui +wi and so:
cov(vi | Xi ) = cov(ui +wi ,Xi ) = cov(ui ,Xi )+ cov(wi ,Xi ) • We are told to assume that wi is independent of Xi , which implies cov(wi ,Xi ) = 0
and we are also told that the regression model Yi = β0 + β1Xi + ui satisfies the LS assumptions, which implies cov(ui ,Xi ) = 0 . These facts imply that cov(vi ,Xi ) = 0 and therefore the first assumption is satisfied.
ii) We must show that (Xi , Yi ) are i.i.d. draws from their joint distribution • Given that the regression model Yi = β0 + β1Xi + ui satisfies the LS assumptions,
(Xi ,Yi ) must be i.i.d. • Yi = Yi +wi and by assumption the measurement errors wi are i.i.d. and are independent of Yi . This implies that Yi must also be i.i.d.
• Finally, Yi and Xj must be independent for i ≠ j , since both Yi and wi are
independent of Xj
• Therefore, (Xi , Yi ) must be i.i.d. iii) We must show that
Yi and Xi have finite fourth moment (no large outliers)
• Xi and Yi must have finite fourth moment because the regression model Yi = β0 + β1Xi + ui satisfies the LS assmptions
• Again, by definition Yi = Yi +wi . Both Yi and wi have finite fourth moments and are
mutually independent, so Yi has a finite fourth moment.
c) The OLS estimators are consistent, because the standard LS assumptions are satisfied.
d) Yes, confidence intervals and hypothesis tests can be performed in the normal way, because the LS assumptions are satisfied. e) In the case of measurement error that is i.i.d. (i.e. classic measurement error), measurement error in the independent variable makes the OLS estimator inconsistent and biased, but if there is measurement error in the dependent variable the OLS estimator is still unbiased and consistent. In this sense, the statement is partially true. However, the measurement error in the dependent variable still makes the OLS estimator inefficient compared to the case with no measurement error. Furthermore, if the form of measurement error in the dependent variable is different, then this result may no longer hold and OLS may be biased and inconsistent. 2. As we discussed during the classes, the effects of missing data on the OLS estimator depend on why the data are missing. We saw three possible cases: when data are missing at random, when data are missing based on the value of one of the regressor and finally when data are missing based on the value of the dependent variable. In cases 1 and 2, the OLS estimator will remain unbiased and consistent, but will be inefficient (you can explain intuitively why this is the case for some extra credit). In the final case however the OLS estimator will typically be biased and inconsistent – known as sample selection bias. Again you can try to explain why or give some brief examples for additional credit. 3. a) The STATA commands to generate the log wage, perform the regression and display the adjusted R2 are:
generate lwage = ln(wage) regress lwage exper, robust display "Adjusted R-squared = " _result(8)
NOTE: we should include the robust option in the regress command, to tell STATA to use the heteroskedasticity robust standard errors.
The positive sign of the estimated coefficient on exper is consistent with economic theory, since on average a positive relationship would be expected between years of
working experience and the (log) wage. However, the coefficient estimate above will only be a reliable estimate of the true causal effect if the model is internally valid. There are several possible threats to internal validity -‐ in particular, given the very simple single regressor model used, it is quite possible that there is significant omitted variables bias, which would make the OLS estimator biased and inconsistent. b) For the log-‐log specification, the relevant STATA commands and estimation output are: generate lexper = ln(exper) regress lwage lexper, robust
display "Adjusted R-squared = " _result(8)
For the second quadratic functional form:
generate exper2 = exper^2 generate exper3 = exper^3 regress lwage exper exper2, robust display "Adjusted R-squared = " _result(8)
The adjusted R2 values imply that both the log-‐log and quadratic specifications provide an improvement in predictive ability compared to the log-‐linear specification in (a). The quadratic specification for exper however results in a higher adjusted R2 than the log-‐log specification and so we select the quadratic model. You could also check a cubic specification for exper, but you should find that the simpler quadratic specification is actually preferable. NOTE: it is typically better to use the adjusted R2 to compare different models and not the standard R2 – this is true generally and not only for this specific question. c) Extending the quadratic model above by adding years of education, educ:
regress lwage exper exper2 educ, robust
The estimated coefficients on all included regressors are statistically significant. The estimated coefficient on educ is positive (consistent with economic theory) and the value of 0.073 implies that (holding exper constant) an extra year of education leads to a predicted increase of 7.3% in wages (remember we are using the log wage). We can also calculate the predicted effect of a change in years of working experience (holding educ constant), but because exper enters the regression model nonlinearly we have to use the method discussed in Section 8.1 of the textbook. Note however that the coefficient on the squared value of exper is very small, suggesting that the relationship between working experience and the log wage is close to linear. Compared to the earlier model in (b), the estimated coefficients on exper and exper squared have the same signs, but their sizes change quite substantially. Combined with the fact that the estimated coefficient on educ is statistically significant, this suggests that the model in (b) probably suffers from some omitted variable bias and so is not internally valid. d) The minimum you should check is whether any of the binary variables appear to be relevant when included linearly in the regression model as additional regressors. Given that the number of binary variables is small relative to the sample size, the best way to begin would be to include all of the binary variables as regressors and check which are statistically significant:
regress lwage exper exper2 educ female married nonwhite south union, robust
The estimated coefficients on female, nonwhite and union are all statistically significant, suggesting that gender, race and trade union membership all seem to be relevant explanatory variables for the log wage. The estimated coefficients on married
and south however are not individually or jointly significant and so we can probably safely remove these variables and re-‐estimate the model:
When we remove married and south from the regression model, the estimated coefficients on the remaining variables do not change substantially. Combined with the statistically insignificant estimated coefficients we obtained above for married and south, this suggests that we can probably safely remove these variables without introducing any substantial omitted variables bias. We could also check whether any interaction terms are relevant -‐ one obvious example is to interact educ and exper with female, to allow the expected returns to working experience and education to differ according to gender. To do this we create the interaction terms femeduc and femexper and then include them in the previous regression model:
generate femeduc = female*educ generate femexper = female*exper regress lwage exper exper2 educ female femeduc femexper nonwhite union, robust
From these estimation results, there is some evidence that the returns to working experience are lower for women than for men, but the returns to education do not appear to vary significantly with gender.
A second obvious example is to allow the expected returns to working experience and education to differ according to race. There are of course many other possible interaction terms you could include, some of which may turn out to be statistically significant. However, checking all possible interaction terms is not really practical and any additional interaction terms you do check should be justifiable in some way according to economic or other theory (such the examples above). e) Depending on which binary variables and interaction terms you looked at in part (d), you may obtain a different final model for part (e). As a result, there is not really a single ‘correct’ regression model for this part of the question – what is important is whether you can correctly interpret and discuss the estimation results for the specific model you used. For example, what is the expected effect wages from an additional year of education? Does this vary according to gender or race? If it does, how much does it differ? All other things being equal, what is the difference in expected wage between men and women? What about white and non-‐white individuals? All of these questions can be answered using the material from ECON311 and so they will not be discussed in detail here. f) Here you can briefly consider the various threats to internal validity and discuss whether you feel each is likely to be a problem in the current context. We have controlled for several qualitative factors that are typically found to be important determinants of wages (such as gender and race), but it is possible that omitted variables bias is still a problem. If you think this may be true, try to suggest one or two potentially relevant variables that are not included in the data set and why they could satisfy the conditions for omitted variable bias. You are told at the start of the question that the data are from the US Current Population Survey (CPS) and so were collected and processed by the US government. This makes simple recording errors unlikely and the sampling system for the survey has probably been designed to avoid serious sample selection problems. The data are however still survey based and so could possibly suffer from measurement error if survey respondents give inaccurate responses – whether or not this is a problem depends on the form of measurement error. g) Again, the exact estimation results you obtain will depend on the final model specification you selected in part (e). Compare your estimation results from 1978 and 1985 using identical specifications for the regression function – if the estimated coefficient values and their statistical significance are similar for the two time periods, then this provides some evidence of external validity over time. If there are large differences between the results for 1978 and 1985, then this suggests that the results from 1978 may not be generalisable to other time periods. Of course, obtaining similar estimation results for 1978 and 1985 does not provide any guarantee of external validity more generally -‐ either across time (for example, to other years like 2005 or 1965) or to other countries.