##### Modeling with Generalized Linear Models (GLMs) ##### Here we: # a. use several packages to compute GLMs - where we Should get the same results # b. try several different assumed distributions # c. evaluate assumed distributions to test for which one is most appropriate # For a. we use classic lm, glm, glm.nb (in the MASS package), and gamlss. # For b. we try gaussian (normal), Poisson, negative binomial, Gamma, and beaucoup more via gamlss. # For c. we use the performance and gamlss packages # You should already have installed MASS and glm is part of base R. A new package today is: install.packages("gamlss") # this runs GLMs and GAMs (GAMs use a smoothing function we skip here) ### Then load needed packages library(tidyverse) library(MASS) library(gamlss) library(performance) ### Download and attach the beetles.csv file from the class web site: https://is.gd/Hud8Xb beetlemeans <- read_csv("~/Downloads/beetlemeans.csv") attach(beetlemeans) # this file includes 21 sites with: # mean beetle count through ~ 1.5 years of sampling # PC1... PC4 values. Principal Components Analysis summarized many landscape variables around each site, where: # PC_1 represented residential vs. forested wetlands (34% of variation); # PC_2 separated non-forested wetlands and low-density residential lands from other land uses (24% of variation); # PC_3 further separated low- and medium-density residential lands from other land (11% more variance); # PC_4 identified unique combinations of aquatic habitats (11% more variance); # Combined the 4 PCs represented 80% of land cover variation among sample sites. ##### THE GOAL: identify the distribution that best meets model assumptions so we can reliably predict beetle abundance based on land cover. ### First we run a classic lm assuming Gaussian distribution. lm.modelg <- lm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans) summary(lm.modelg) check_model(lm.modelg) # What do you see? Do you like the fit to assumptions? Can we do better? ### Now we try a Poisson distribution - built for integers >0, and for overdispersed (right-skewed) data glm.modelp <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=poisson, data=beetlemeans) summary(glm.modelp) check_model(glm.modelp) ### Any better? Compare summary tables: Why are Estimates so different? ### Now let's try a GLM based on negative binomial distribution (supposed to work better for count data). glm.modeln <- glm.nb(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans) # notice the .nb? summary(glm.modeln) check_model(glm.modeln) ### Are assumptions of residuals any better? ### Now let's try a flexible distribution: Gamma distribution # A problem = it often works well, but takes some computation to translate back to original units # default link is an inverse (i.e., 1/Y) # A good summary is at https://en.wikipedia.org/wiki/Gamma_distribution glm.modelG <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma) summary(glm.modelG) check_model(glm.modelG) # Now try gamma with a log link to more closely match Poisson and negative binomial above glm.modelGl <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma(link="log")) summary(glm.modelGl) check_model(glm.modelGl) ### FYI - the link functions to obtain a model for the mean of the raw data by transforming in the model. ### If you log-transform, your linear model is on the log data - very different! ### there are default links and you can specify others for a family. See ? family # for some basic info; For more info see https://www.r-bloggers.com/?p=170983 ######## Take a look back at summary outputs: Notice the different Estimates and p values???? ######## ######## And which distribution would you use? ######## # Above work is using classic approach - Now we get more inventive, using gamlss. ### GAM is a Generalized Additive Model, which adds a smoothing function to a GLM. If we don't smooth we get a GLM. ### The gamlss package provides a blizzard of family options: ? gamlss.family ### First run gamlss to assure it gets you the same result as plain old lm, assuming *Normal* error distribution: gamlss.model <- gamlss(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans) summary(gamlss.model) plot(gamlss.model) # instead of check_model wp(gamlss.model) # Kinda like the Homog. of Variance plot - flat? within bounds? ### Now let's gamlss find a "best" family in its vast menu, based on residual deviance: alt.fits <- fitDist(mncount, data=beetlemeans) alt.fits$fits # lists deviance per distribution - less residual deviance is best alt.fits$failed # shows distributions that did not work ### So it looks like Gamma was "best" - let's ensure we get the same as using glm with Gamma: gamlss.model <- gamlss(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=GA, data=beetlemeans) summary(gamlss.model) Rsq(gamlss.model) ### and compare to log-link based glm using Gamma summary(glm.modelGl) # How do coefficients compare? And are you confident that gamlss is giving you a GLM? ### And let's check assumptions: plot(gamlss.model) # instead of check_model, but showing roughly the same info wp(gamlss.model) # more flat than the Normal? within bounds? ### And let's compare predicted values to observed mncount: plot(exp(predict(gamlss.model)) ~ beetlemeans$mncount) # not a tight fit but greater predicted with > observed # and the model seems to break down with counts > ~600; I bet other, new predictors could help ### NOTE: we "antilog" predicted values to get back to raw data scales ##### So what have you learned? # A big problem with lm is its assumptions, which can make your results wrong. # glm (and related packages) helps get past those simple assumptions, but has limits. # alternative packages get you about the same answer with the same models - that's reassuring! # gamlss opens the door to many different distribution families for GLMs (and GAMs).