# TRANSFORMING DATA TO BETTER FIT ASSUMPTIONS # Based on our last lab time, did our copter designs differ in their fit to a normal distribution? If # so, this can get awkward. One option is to transform data (e.g., logX, or sqrtX), but you cannot # transform one set (e.g. log10(Time) for Design 4) while leaving another untransformed. # Otherwise, you will be comparing very different values (e.g., 10 vs. log10(10) = 1) and find a # significant difference for designs merely because you converted some. # IMPORTANT: If you transform a variable, you have to do so for all groups to be compared. # ALSO: Be very careful with back-transforming data to report means, etc. The log of a mean is # not the same as the mean of a log. # But transforms can make distributions normal and variances among sets homogeneous. # Or nearly so. And so transormations are common in parametric statistics ??? you may already # have seen papers reporting log or square-root versions of data. A second option is to use more # advanced methods (we get those in a few weeks). # Meanwhile, try this to get used to the idea: # import the copter data again, from # http://jenkins.cos.ucf.edu/wordpress/wp-content/uploads/copter-data-F16.csv # Here we assume you called it "data" # Use dplyr to select a design ??? you choose which one. # Plot a histogram of Time for that design # Now compute two transformations of that same data, like this (same as in excel, etc.) logTime <- log10(Time) # notice that we specify log base-10. Plain old "log" is the natural log. srTime <- sqrt(Time) # square root transformation of the data # now make boxplots of those data and compare to the original: More normal? Worse? # Now look at the skewness plots in the Introduction here: # http://en.wikipedia.org/wiki/Skewness # Do some of our copter data look like these? If so, you can fix it with transformations. # This is similar to using different measures (e.g., meters vs. feet) for the same object. # If you think about it that way, transformations are no big deal ??? # Here are some guidelines, where we name the new variable newX for any original variable X: # If your data have Then use this transformation # Moderately positive skewness Square-Root - e.g., newX = sqrt(X) # Substantially positive skewness Logarithmic - e.g., newX = log10(X) # Substantially positive skewness Logarithmic - e.g., newX = log10(X+C), # (with zero values) where C is a constant, often 1 # Moderately negative skewness Square-Root - e.g., newX = sqrt(K-X) # where K is a constant, often max(X)+1 # Substantially negative skewness Logarithmic - newX = log10(K-X) # where K is a constant, often max(X)+1 # Let's calculate some transforms. Compute a log10 and a square-root transformation of Time for # our copter data. lg <- log10(Time) sr <- sqrt(Time) # Both of these variables appear in the Environment window (upper right), but are not yet # combined with our data. That means our subset command won't work on them yet. # Combine these transformations with your data file by using cbind to "bind columns": data <- cbind(data,lg,sr) # Now repeat histograms, and then run normality and homoscedasticity tests (like you did in the # last lab) to evaluate plain Time vs. log-transformed Time vs. square-root-transformed Time. # Did a log- or square-root-transform help make data fit better to our assumptions? # If so, then you can expect to use that transformation in subsequent analyses. #You might already imagine this iterative process for each variable in a large study can be # tedious. That's why more sophisticated analyses that permit other distributions (e.g., negative # binomial) and relax assumptions of normality and homoscedasticity are verrrrrry convenient, # much more robust, and often better detect effects you study. But those will come later.