############## EVALUATING NORMALITY & VARIANCE ##############

# Data are often not normally-distributed, nor do they have homogeneous variance (i.e, normal
# curve 1 is wider than normal curve 2). Both these problems violate assumptions of parametric
# statistics – the statistics we use this semester. If you analyze data that violate the assumptions,
# you may get the wrong answer to your question!

##### Start up RStudio and import and attach the helicopter data from the course web page:
# http://jenkins.cos.ucf.edu/wordpress/wp-content/uploads/copter-data-F16.csv
# Below we assume you called it "data"

# remove rows with NAs because calculation of normal curves etc. cannot work with those
data <- na.omit(data)

# Make boxplots to squint at the data FOR EACH Design. Use R code you learned last week.
# Is each boxplot centered (i.e., medians are in the middle of boxes, whiskers of same length 
# above and below)? If so, then each data set may be normal.

#  Are the sizes of boxplots and whisker lengths about the same among Designs? 
# If so, then variances may be homogeneous. 

# Generate a subset for EACH helicopter design. 
# IMPORTANT: We do this because we need to evaluate normality for EACH treatment.
# Remember, we wish to compare data sets, where the assumption is that each data set 
# (e.g., Design) is normally distributed and that variances of Designs are equal. Commands here
# are shown for only one subset – you need to repeat these for each of the subsets. For example: 

data4 <- subset(data, data$Design == "4") 
 
# This identifies a subset for design 4 only in the data file named “data”. 
# Note: If you had called this “4data” you get an error code – vectors, etc. cannot start with a number

### Normality ###
# First we evaluate normality, then homogeneity of variance.

# Calculate the mean and SD of each design:

mnT <- mean(data4$Time)
sdT <- sd(data4$Time)

# Calculate and draw the normal curve on the histogram of the data
h<-hist(data4$Time, breaks=10, density=10, col="gray", xlab="Copter Times") 
xfit<-seq(min(data4$Time),max(data4$Time),length=40) 
yfit<-dnorm(xfit,mean=mnT,sd=sdT) 
yfit <- yfit*diff(h$mids[1:2])*length(data4$Time) 
lines(xfit, yfit, col="black", lwd=2)

# Are the helicopter treatments each looking normal? Any that are not so normal-ish?

# Now let's make a QQ plot (aka normality plot) of the data)

qqnorm(data4$Time)
qqline(data4$Time)

# Lastly, let's run a Shapiro-Wilk test on the data (null hypothesis = normal):

shapiro.test(data4$Time)

### Homogeneity of Variance ###
# This assumption is even more important for parametric statistics than normality 
# (you might often hear that statistics are “robust to violations of assumptions”).
# That only goes so far, and you should not push that boundary for homogeneity of variance as 
# much as you may for normality.

# Two tests are common for homogeneity of variance: Bartlett's and Levene's. Bartlett's works
# well if data are normal, but not if data are non-normal. Levene's test is more robust to
# heterogeneity of variance than Bartlett's. Thus 

### Choose the right test based on normality tests above.

# To run Bartlett's test simply enter

bartlett.test(data$Time ~ data$Design)

# To run Levene's test, first install and then load the package “car”, then enter

fDesign <- factor(data$Design)

# This command tells R that our numeric Design codes (31, etc.) are factors – i.e.,
# experimental categories, rather than quantitative variables. Then enter

leveneTest(data$Time ~ fDesign)

# So how do our data look? 
# Specifically, can we assume Designs are normally distributed? 
# And do Designs have homogeneous variance?