############## EVALUATING NORMALITY & VARIANCE ############## # Data are often not normally-distributed or have homogeneous variance (i.e, normal # curve 1 is wider than normal curve 2). Both these problems violate assumptions of “frequentist” statistics (i.e., statistics that assume a frequency distribution). We are frequentists this semester – we pay attention to frequency distributions. If you analyze data that violate the assumptions, you may get the wrong answer to your question! library(tidyverse) # turns on readr, dplyr, ggplot2, and other packages all in one command # Import the helicopter data from the course web page: # https://sciences.ucf.edu/biology/d4lab/wp-content/uploads/sites/23/2021/09/helicopter-data.csv # Make simple boxplots to squint at the data per Group (because there are only 3). Use R code you learned last week. # Is each boxplot centered (i.e., medians are in the middle of boxes, whiskers of same length above and below)? If so, then each data set may be normal. # Are the sizes of boxplots and whisker lengths about the same among IDs? If so, then variances may be homogeneous. But let’s run an objective statistical test. ### Normality - this evaluates if distributions are about balanced (i.e., symmetrical bell curves). # We do this for Each Data Group we wish to compare. ###This is Important###: # To fairly compare means later, the stats will assume that a group is normally distributed and that variances are similar. Here we first work with normality of Groups of students who dropped helicopters because there are only 3 (B, W, Y). You can do the same thing for other groupings (e.g., ID). ggplot(data, aes(x = Time, fill = Group, color=Group)) + facet_wrap("Group") + geom_histogram(aes(y=..density..), alpha = 0.5) + # plots bars geom_density(fill=NA, size=2) + # plots a smoothed curve of the bars theme_classic() + stat_function(fun = dnorm, color="black", args = with(data, c(mean = mean(Time), sd = sd(Time)))) # plots an idealized normal curve for data mean and SD # Notice that we kinda did in ggplot what lattice could do? # Play with the above code by choosing some lines but not all. For example: ggplot(data, aes(x = Time, fill = Group, color=Group)) + facet_wrap("Group") + geom_histogram(aes(y=..density..), alpha = 0.5) + # plots bars stat_function(fun = dnorm, color="black", args = with(data, c(mean = mean(Time), sd = sd(Time)))) # plots an idealized normal curve for data mean and SD # This skips the density plot to overlay a normal curve on the histogram # Are the Group data each looking normal? Any that are not so normal-ish? # Now let's make a QQ plot (aka normality plot) of the data) for a more careful view dataB <- filter(data, Group =="B") # makes a data set for only the B group. ##### You should make matching data sets for W and Y groups too # Now make a quick QQ plot - - for only one group - - qqnorm(dataB$Time) qqline(dataB$Time, col="red") # You can edit and re-run to see other Groups one at a time # And you can overlay all three QQ plots in ggplot: ggplot(data, aes(sample = Time, fill = Group, color=Group)) + stat_qq() + stat_qq_line() # Lastly, we run a stats test for normality: shapiro.test(dataB$Time) # repeat for the other groups # Now compare the Shapiro test to your graphs – How well did your visual appraisal match the objective stats test? # You should graph AND run a Shapiro test to evaluate normality per group – neither one is perfect ### Homogeneity of Variance – this assumption is More Important than normality # That means you should be more careful about meeting this assumption. # Two tests are common: Bartlett's and Levene's. # Bartlett's test works IF data are normal, but not if data are non-normal. # Levene's test is more robust to non-normal data than Bartlett's. # I typically just use Levene’s but let’s compare them below. # To run Bartlett's test simply enter bartlett.test(data$Time ~ data$Group) # To run Levene's test, first install and then load the package “car” (for Companion to Applied Regression – nothing to do with cars data) then enter install.packages("car") # check first - this may already be installed library(car) data$fGroup <- factor(data$Group) # To ensure groups are factors. leveneTest(data$Time ~ data$fGroup) # How do these statistics compare to your graphs? Did you think histograms had about the same spreads? # So overall: How do our data look? Specifically: # Can we assume Groups are normally distributed? # And do Groups have homogeneous variance? # What about another treatment? For example, ID. Notice that you will have to make IDs a factor first. # You have now evaluated important statistical assumptions graphically and by statistical tests – these tools will be important for many analyses hereafter.