* Lab 4, Stat 539 * To accompany the second half of lec3.pdf * Transformations * from the course website download and save the brain weight dataset (brain.dta) use brain // use the brain dataset list // print to screen list in 15 replace brain_wt = . in 15 // Set the Star-nosed mole brain_wt of 1 to missing to predict later list in 15 scatter brain_wt body_wt // scatterplot (extreme outliers in y direction, logarithmic growth) sort brain_wt // look at values by brain_wt list sort body_wt // look at values by body_wt list scatter br bo if(body_wt <=200) // scatterplot (x direction, logarithmic growth) graph box body_wt ,name(bodbox) // boxplots, instead of saving to disk to combine graph box brain_wt ,name(brbox) // first, name them in memory graph combine bodbox brbox // then, combine them by the names you gave them generate lbody_wt = log(body_wt ) // transform x using natural log generate lbrain_wt = log(brain_wt ) // transform y using natural log scatter lbrain_wt lbody_wt // after transformation, data is (incredibly) linear regress lbrain_wt lbody_wt // perform regression on transformed and linearly related variables predict lbrain_wt_hat, xb // predict the fitted values * This set of code we've used before to create the Confidence and Prediction Bands (Intervals) * Make sure you set the df=degrees-of-freedom to be the residual degrees-of-freedom (or dfE, E for error=residual) * in the invttail(df,alpha/2) expression * because we're in the log scale, I put an 'L' in front of the interval variables predict se_line, stdp predict se_pred, stdf generate llci = lbrain_wt_hat - invttail(59,0.025) * se_line generate luci = lbrain_wt_hat + invttail(59,0.025) * se_line generate llpi = lbrain_wt_hat - invttail(59,0.025) * se_pred generate lupi = lbrain_wt_hat + invttail(59,0.025) * se_pred list lbrain_wt_hat, llci, luci, llpi, lupi * Plot the result graph twoway (scatter lbrain_wt lbody_wt) (line lbrain_wt_hat lbody_wt) /// (line llci lbody_wt, sort) (line luci lbody_wt, sort) /// (line llpi lbody_wt, sort) (line lupi lbody_wt, sort) /// , title(Brain Weight Data) subtitle(CI for Line and Prediction Interval for log-log regression) * Create a four-in-one plot predict residual, r quietly qnorm residual, saving(probplot, replace) nodraw /// title(Normal Prob. Plot of Residuals) quietly rvfplot, saving(respredplot, replace) nodraw /// title(Residuals vs. Fitted Values) quietly hist residual, freq saving(hist, replace) nodraw /// title(Histogram of the Residuals) generate obs_order = _n quietly twoway connect residual obs_order, saving(obs_order, replace) /// nodraw title(Residuals vs. Order of the Data) drop obs_order graph combine probplot.gph respredplot.gph hist.gph obs_order.gph, /// title(Residual Plots) swilk residual // print the Shapiro-Wilks normality test results * Cook's distance to examine if any observations have great influence on the regression predict cooksd,cooksd gene obs_order = _n twoway spike cooksd obs_order * To summarize, in this example, all of the model assumptions appear to be satisfied. * Residual plot shows 1. constant variance, no outliers, no patterns. * Residuals appear normal in histogram, and nearly normal in normal probability plot. * The Shapiro-Wilks normality test give a big p-value, so the data appear normal. * Cook's distance does not indicate any observations with large influence (all much less than 1). * We have a good model, but we're in the log scale... * What we really want, are the estimates and CIs and PIs in the original scale (not log) * We take the antilog, or exponentiate, the five values for the line and intervals. generate brain_wt_hat = exp(lbrain_wt_hat) generate lci = exp(llci) generate uci = exp(luci) generate lpi = exp(llpi) generate upi = exp(lupi) * Plot the result on the original scale graph twoway (scatter brain_wt body_wt) (line brain_wt_hat body_wt) /// (line lci body_wt, sort) (line uci body_wt, sort) /// (line lpi body_wt, sort) (line upi body_wt, sort) /// , title(Brain Weight Data) subtitle(CI for Line and Prediction Interval in original scale) * Plot the result on the original scale for body_wt <=200 graph twoway (scatter brain_wt body_wt if(body_wt <=200)) (line brain_wt_hat body_wt if(body_wt <=200)) /// (line lci body_wt if(body_wt <=200), sort) (line uci body_wt if(body_wt <=200), sort) /// (line lpi body_wt if(body_wt <=200), sort) (line upi body_wt if(body_wt <=200), sort) /// , title(Brain Weight Data) subtitle(CI for Line and Prediction Interval in original scale) * How was the estimate of the brain_wt of our dear Star-nosed mole? list if body_wt <.07 & body_wt > .05 // We know it was 1, and it was estimated at 1.02115. Pretty good. * Also, you can get CI and PI values, too, using lci, uci, lpi, upi (on the original scale -- not log) * END