############## EVALUATING NORMALITY & VARIANCE ############## # Data are often not normally-distributed, nor do they have homogeneous variance (i.e, normal # curve 1 is wider than normal curve 2). Both these problems violate assumptions of parametric # statistics – the statistics we use this semester. If you analyze data that violate the assumptions, # you may get the wrong answer to your question! ##### Start up RStudio and import and attach the helicopter data from the course web page: # http://jenkins.cos.ucf.edu/wordpress/wp-content/uploads/copter-data-F16.csv # Below we assume you called it "data" # remove rows with NAs because calculation of normal curves etc. cannot work with those data <- na.omit(data) # Make boxplots to squint at the data FOR EACH Design. Use R code you learned last week. # Is each boxplot centered (i.e., medians are in the middle of boxes, whiskers of same length # above and below)? If so, then each data set may be normal. # Are the sizes of boxplots and whisker lengths about the same among Designs? # If so, then variances may be homogeneous. # Generate a subset for EACH helicopter design. # IMPORTANT: We do this because we need to evaluate normality for EACH treatment. # Remember, we wish to compare data sets, where the assumption is that each data set # (e.g., Design) is normally distributed and that variances of Designs are equal. Commands here # are shown for only one subset – you need to repeat these for each of the subsets. For example: data4 <- subset(data, data$Design == "4") # This identifies a subset for design 4 only in the data file named “data”. # Note: If you had called this “4data” you get an error code – vectors, etc. cannot start with a number ### Normality ### # First we evaluate normality, then homogeneity of variance. # Calculate the mean and SD of each design: mnT <- mean(data4$Time) sdT <- sd(data4$Time) # Calculate and draw the normal curve on the histogram of the data h<-hist(data4$Time, breaks=10, density=10, col="gray", xlab="Copter Times") xfit<-seq(min(data4$Time),max(data4$Time),length=40) yfit<-dnorm(xfit,mean=mnT,sd=sdT) yfit <- yfit*diff(h$mids[1:2])*length(data4$Time) lines(xfit, yfit, col="black", lwd=2) # Are the helicopter treatments each looking normal? Any that are not so normal-ish? # Now let's make a QQ plot (aka normality plot) of the data) qqnorm(data4$Time) qqline(data4$Time) # Lastly, let's run a Shapiro-Wilk test on the data (null hypothesis = normal): shapiro.test(data4$Time) ### Homogeneity of Variance ### # This assumption is even more important for parametric statistics than normality # (you might often hear that statistics are “robust to violations of assumptions”). # That only goes so far, and you should not push that boundary for homogeneity of variance as # much as you may for normality. # Two tests are common for homogeneity of variance: Bartlett's and Levene's. Bartlett's works # well if data are normal, but not if data are non-normal. Levene's test is more robust to # heterogeneity of variance than Bartlett's. Thus ### Choose the right test based on normality tests above. # To run Bartlett's test simply enter bartlett.test(data$Time ~ data$Design) # To run Levene's test, first install and then load the package “car”, then enter fDesign <- factor(data$Design) # This command tells R that our numeric Design codes (31, etc.) are factors – i.e., # experimental categories, rather than quantitative variables. Then enter leveneTest(data$Time ~ fDesign) # So how do our data look? # Specifically, can we assume Designs are normally distributed? # And do Designs have homogeneous variance?