############## EVALUATING NORMALITY & VARIANCE ##############

# Data are often not normally-distributed or have homogeneous variance (i.e, normal # curve 1 is wider than normal curve 2). Both these problems violate assumptions of “frequentist” statistics (i.e., statistics that assume a frequency distribution). We are frequentists this semester – we pay attention to frequency distributions. If you analyze data that violate the assumptions, you may get the wrong answer to your question!

library(tidyverse) # turns on readr, dplyr, ggplot2, and other packages all in one command

# Import the helicopter data from the course web page:  
# https://sciences.ucf.edu/biology/d4lab/wp-content/uploads/sites/23/2021/09/helicopter-data.csv

# Make simple boxplots to squint at the data per Group (because there are only 3). Use R code you learned last week.
# Is each boxplot centered (i.e., medians are in the middle of boxes, whiskers of same length above and below)? If so, then each data set may be normal.
#  Are the sizes of boxplots and whisker lengths about the same among IDs? If so, then variances may be homogeneous. But let’s run an objective statistical test.

### Normality - this evaluates if distributions are about balanced (i.e., symmetrical bell curves).
# We do this for Each Data Group we wish to compare. ###This is Important###:
# To fairly compare means later, the stats will assume that a group is normally distributed and that variances are similar. Here we first work with normality of Groups of students who dropped helicopters because there are only 3 (B, W, Y). You can do the same thing for other groupings (e.g., ID).

ggplot(data, aes(x = Time, fill = Group, color=Group)) + 
  facet_wrap("Group") +
  geom_histogram(aes(y=..density..), alpha = 0.5) + # plots bars
  geom_density(fill=NA, size=2) +  # plots a smoothed curve of the bars
  theme_classic() +
  stat_function(fun = dnorm, color="black", args = with(data, c(mean = mean(Time), sd = sd(Time)))) # plots an idealized normal curve for data mean and SD

# Notice that we kinda did in ggplot what lattice could do?
# Play with the above code by choosing some lines but not all. For example:

ggplot(data, aes(x = Time, fill = Group, color=Group)) + 
  facet_wrap("Group") +
  geom_histogram(aes(y=..density..), alpha = 0.5) + # plots bars
  stat_function(fun = dnorm, color="black", args = with(data, c(mean = mean(Time), sd = sd(Time)))) # plots an idealized normal curve for data mean and SD

# This skips the density plot to overlay a normal curve on the histogram

# Are the Group data each looking normal? Any that are not so normal-ish?

# Now let's make a QQ plot (aka normality plot) of the data) for a more careful view

dataB <- filter(data, Group =="B") # makes a data set for only the B group. 
##### You should make matching data sets for W and Y groups too

# Now make a quick QQ plot  - - for only one group - - 
qqnorm(dataB$Time)
qqline(dataB$Time, col="red")

# You can edit and re-run to see other Groups one at a time

# And you can overlay all three QQ plots in ggplot:
ggplot(data, aes(sample = Time, fill = Group, color=Group)) + 
  stat_qq() + 
  stat_qq_line()

# Lastly, we run a stats test for normality:

shapiro.test(dataB$Time)  # repeat for the other groups

# Now compare the Shapiro test to your graphs – How well did your visual appraisal match the objective stats test?

# You should graph AND run a Shapiro test to evaluate normality per group – neither one is perfect

### Homogeneity of Variance – this assumption is More Important than normality
# That means you should be more careful about meeting this assumption.

# Two tests are common: Bartlett's and Levene's. 
# Bartlett's test works IF data are normal, but not if data are non-normal. 
# Levene's test is more robust to non-normal data than Bartlett's.
# I typically just use Levene’s but let’s compare them below.

# To run Bartlett's test simply enter

bartlett.test(data$Time ~ data$Group)

# To run Levene's test, first install and then load the package “car” (for Companion to Applied Regression – nothing to do with cars data) then enter
install.packages("car") # check first - this may already be installed
library(car)

data$fGroup <- factor(data$Group) # To ensure groups are factors. 

leveneTest(data$Time ~ data$fGroup)

# How do these statistics compare to your graphs? Did you think histograms had about the same spreads?

# So overall: How do our data look? Specifically:
# Can we assume Groups are normally distributed? 
# And do Groups have homogeneous variance?

# What about another treatment? For example, ID. Notice that you will have to make IDs a factor first.

# You have now evaluated important statistical assumptions graphically and by statistical tests – these tools will be important for many analyses hereafter.