Midterm Project A problem of interest to health officials an
Midterm Project
A problem of interest to health officials (and others) is to determine the effects of smoking during pregnancy on infant health. One measure of infant health is birth weight; a birth weight that is too low can put an infant at risk for contracting various illnesses.
In this project, you will use a dataset named “BWGHT.dta” to analyze the causal effect of smoking on infant health. The variables in the dataset should be self-explanatory. If you have any questions regarding their definitions, feel free to contact me for clarifications.
Before diving into the regressions, you want to first analyze the data in the following ways.
1. Get some summary statistics about the two variables “bwght” and “cigs” including their means and standard deviations. Does the variable “bwght” look like a normal random variable? Hint: a normal random variable is symmetric (skewness=0), and its kurtosis is equal to 3.
2. Get the correlation between the variables “bwght” and “cigs”. Is there a negative relationship between the two variables?
3. Divide the data into two groups. In the first group, mothers don’t smoke during pregnancy, and in the second group, mothers smoke during pregnancy. Conduct a difference-in-mean test on these two groups. Investigate whether the birth weight of infants from the first group is statistically different from that of the second group.
Now you want to run a simple regression to investigate how smoking during pregnancy affects infant health by regressing “bwght” on “cigs”.
4. In this simple regression, what is the estimated effect of smoking on infant health? Is this effect large or small? Is it statistically significant the 5% level?
5. Provide a plot similar to Figure 4.3 in the textbook, i.e., plot “bwght” against “cigs”, and plot the fitted value of “bwght” against “cigs”. Does the plot suggest homoskedasticity or heteroskedasticity to you?
6. You further investigate how much variation in “bwght” can be explained by “cigs”. The R-squared turns out to be very small. Comment on this statement: since the R-squared from this simple regression is very small, smoking during pregnancy seems to be irrelevant to infant health; in other words, health officials (and others) shouldn’t devote any attention to this issue at all.
7. To double-check your answer in question 3, you decide to run a simple regression of “bwght” on a dummy variable indicating whether mothers smoke or not. Is your regression result consistent with your difference-in-mean result provided in question 3? Relate the constant estimate and the slope estimate in this simple regression to the results provided in question 3.
8. You suspect that the simple regression suffers an omitted variable bias problem. Based on the data, what are potential omitted variables? Why do they qualify as omitted variables?
9. Run a multiple regression of “bwght” on “cigs”, “faminc”, “fatheduc”, “motheduc”, “male”, and “white”. What is the new estimated effect of smoking on infant health? What do the coefficients of “male” and “white” tell you?
10. Since “faminc”, “fatheduc”, “motheduc” are not individually significant at the 5% level, you decide to conduct a test with the null that these three variables jointly have no effect on infant health. What kind of test should you use? Can you reject the null at the 5% significant level?
11. Although “faminc” is not significant, you still decide to include it in the regression to avoid the omitted variable bias. Higher income families tend to smoke higher quality cigarettes, which tend to have a less severe impact on the health of an infant, your further conjecture that the effects of smoking on infant health should be decreasing with the family income. Run a regression to incorporate this new conjecture, including all variables in question 9. Do you see such a decreasing effect in the regression? Is it statistically significant at the 5% level?
12. You suspect that the effect of smoking on birth weight is stronger for infant boys (male=1) than for infant girls (male=0). Run a multiple regression to incorporate this effect, including all variables in question 9. Does the regression result suggest such an asymmetric effect? Is it statistically significant at the 5% level?
NOTE: Please answer all relevant questions in this project, and submit a copy of your results before class on Oct 27th.
Solution
naarbhaiya@gmail.com
