* Lab 9, Stat 539 * To guide the steps to an ANCOVA analysis -- to accompany lec8.pdf * ANCOVA -- ANalysis of COVAriance ******************************************************************************** * chds.dta -- child birth weight data clear * load the dataset use chds.dta * print to screen describe * create the mother smoking groups (0=no smoke, 1=some smoke) generate ms_gp=0 if msmoke == 0 replace ms_gp=1 if msmoke > 0 * create the father smoking groups (0=no smoke, 1=some smoke) generate ps_gp=0 if psmoke == 0 replace ps_gp=1 if psmoke > 0 * table of means, standard deviations, and frequencies tabulate ms_gp ps_gp, summarize(weight) * boxplots of the two factor variables quietly graph box weight, over(ms_gp) name(box1,replace) nodraw b1title("Mother smoke 0=none, 1=some") quietly graph box weight, over(ps_gp) name(box2,replace) nodraw b1title("Father smoke 0=none, 1=some") graph combine box1 box2, title(Box Plots of factor variables) * fit the anova with covariates * I doubly specify which are categorial and which are continuous * if you specify one variable type for some of the variables (eg. cat), * it will assume the other variables are the other (eg. cont) * without any specification, anova assumes variables are categorical, cat. anova weight ps_gp ms_gp ps_gp*ms_gp head length gest mage mheight mweight page ped pheight, cat(ps_gp ms_gp) cont(head length gest mage mheight mweight page ped pheight) * same analysis using xi regress (different parameter constraints) * specify the categorical variables with the prefix "i." xi:regress weight i.ps_gp i.ms_gp i.ps_gp*i.ms_gp head length gest mage mheight mweight page ped pheight * significance of individual parameters (same as in xi:regress parameter estimate table because only two levels) testparm _Ims_gp_1 testparm _Ips_gp_1 testparm _Ips_Xms__1_1 * Variable selection -- removing nonsignificant covariates and factors until all are significant * Because we are interested in ms_gp, retain that variable regardless of significance * Backward selection steps below * Begin with full model * remove page * remove pheight * remove ps_gp*ms_gp * remove ps_gp * remove ped * remove mage * remove mheight anova weight ps_gp ms_gp ps_gp*ms_gp head length gest mage mheight mweight page ped pheight, cat(ps_gp ms_gp) cont(head length gest mage mheight mweight page ped pheight) anova weight ps_gp ms_gp ps_gp*ms_gp head length gest mage mheight mweight ped pheight, cat(ps_gp ms_gp) cont(head length gest mage mheight mweight ped pheight) anova weight ps_gp ms_gp ps_gp*ms_gp head length gest mage mheight mweight ped , cat(ps_gp ms_gp) cont(head length gest mage mheight mweight ped ) anova weight ps_gp ms_gp head length gest mage mheight mweight ped , cat(ps_gp ms_gp) cont(head length gest mage mheight mweight ped ) anova weight ms_gp head length gest mage mheight mweight ped , cat( ms_gp) cont(head length gest mage mheight mweight ped ) anova weight ms_gp head length gest mage mheight mweight , cat( ms_gp) cont(head length gest mage mheight mweight ) anova weight ms_gp head length gest mheight mweight , cat( ms_gp) cont(head length gest mheight mweight ) anova weight ms_gp head length gest mweight , cat( ms_gp) cont(head length gest mweight ) * all remaining factors and covariates significant at the 0.05 level anova weight ms_gp head length gest mweight, cat(ms_gp) cont(head length gest mweight) * same analysis using xi regress (different parameter constraints) xi:regress weight i.ms_gp head length gest mweight * significance of individual parameters (same as in xi:regress parameter estimate table because only two levels) testparm _Ims_gp_1 * need to run anova again for lincom statement below (doesn't like the regress for the ps_gp[1]) anova weight ms_gp head length gest mweight, cat(ms_gp) cont(head length gest mweight) tabstat weight, by(ms_gp) stat(mean semean) adjust head length gest mweight, by(ms_gp) se * reproduce the adjust values tabstat head length gest mweight lincom( _b[_cons] + _b[ms_gp[1]] + _b[head]*13.21912 + _b[length]*20.27941 + _b[gest]*39.77059 + _b[mweight]*126.8956 ) lincom( _b[_cons] + _b[ms_gp[2]] + _b[head]*13.21912 + _b[length]*20.27941 + _b[gest]*39.77059 + _b[mweight]*126.8956 ) * create fitted values, yhat drop yhat predict yhat, xb * create residuals, residual drop residual predict residual,r * fit lines to groups on scatter plots * This plot looks strange (not straight best fit line) because there are many * other covariates in the model other than mweight2. * Because everything depends on everything else, this is not a meaningful plot. twoway (scatter weight mweight, mlabel(ms_gp)) (line yhat mweight if ms_gp==0, sort) (line yhat mweight if ms_gp==1, sort) * check for equal variances by ms_gp robvar(weight), by(ms_gp) * Create a four-in-one plot quietly qnorm residual, name(probplot,replace) nodraw title(Normal Prob. Plot of Residuals) quietly rvfplot, name(respredplot,replace) nodraw title(Residuals vs. Fitted Values) quietly hist residual, freq name(hist,replace) nodraw title(Histogram of the Residuals) generate obs_order = _n quietly twoway connect residual obs_order, name(obs_order,replace) nodraw title(Residuals vs. Order of the Data) drop obs_order graph combine probplot respredplot hist obs_order, title(Residual Plots) * print the Shapiro-Wilks normality test results swilk residual * Cook's distance to examine if any observations have great influence on the regression predict cooksd,cooksd gene obs_order = _n twoway spike cooksd obs_order drop obs_order cooksd * END