# TRANSFORMING DATA TO BETTER FIT ASSUMPTIONS # Non-normal distributions - and especially heterogeneous variances - cause fundamental problems in frequentist stats. # This applies to a response variable (Y axis). # One option to handle this to transform data (e.g., log[Y], or sqrt[Y]). Then your results are in log or sqrt units. # This is similar to using different measures (e.g., meters vs. feet) for the same object. # If you think about it that way, transformations are no big deal. And common in biology for some good reasons. # But not are not always best - we will use a second [better?] option later. # Meanwhile, try this to get used to the idea: # Import the copter data again, from # https://sciences.ucf.edu/biology/d4lab/wp-content/uploads/sites/23/2021/09/helicopter-data.csv # Here we assume you called it "data" # select a Group – You should try a few until you find histogram that is most skewed, based on a histogram of Time my.choice <- data %>% filter(Group == "W") hist(my.choice$Time) # Now compute two transformations of that same data (e.g., dataW below): my.choice$logTime <- log10(my.choice$Time) # notice that we specify log base-10. "log" is the natural log. my.choice$srTime <- sqrt(my.choice$Time) # square root transformation of the data # And plot them par(mfrow=c(2,1)) # stacks graphs to follow - 2 rows, one column hist(my.choice$logTime) hist(my.choice$srTime) # Which one is more normal? Can you run a stats test to find out? [hint, hint: our prior class] # Let's make data that are way skewy and then try transformations set.seed(91929) # Set seed for reproducibility N <- 10000 # sample size y_rlnorm <- rlnorm(N) # make a log-normally distributed data frame with N values hist(y_rlnorm, breaks = 100) # and graph it # Now try some different transformations on those data, where # we already know a log-transform should be best because data were made using a log-normal distribution # use code sorta like rows 19 & 20 above to make simple: # 1. square-root transform, e.g., sqrt(X) # 2. log-base-e transform, e.g. log(X) # 3. log-base-10 transform, e.g. log10(X) # 4. log-base-Your-Choice transform for data that may include zeroes, e.g., log10(X + C), where C is a constant (often 1) # Now repeat stacked histograms (similar to row 23 but for those 4 graphs), and run normality tests again on those four options. # Now try your choice of transformations (like in rows 19 & 20) on a funky data set with a skewed but different shape funkydata <- as.data.frame(rnbinom(N, size=5.855, mu=1/exp(-3.689))) hist(funky, breaks = 100) # and graph it # Which transformation works best? # REMEMBER: If you transform a variable , you have to do so for all groups to be compared. # ALSO: Be very careful with back-transforming data to report means, etc. The log of a mean is not the same as the mean of a log. # What does that mean? log.of.mean <- log(mean(y_rlnorm)) mean.of.log <- mean(log(y_rlnorm)) log.of.mean mean.of.log percent.dif <- 100*(log.of.mean + mean.of.log)/log.of.mean percent.dif # This difference is especially great for very skewy (log-normal) data, but can be substantial for other data too