##### Modeling with Generalized Linear Models (GLMs)

##### Here we:
# a. use several packages to compute GLMs - where we Should get the same results
# b. try several different assumed distributions
# c. evaluate assumed distributions to test for which one is most appropriate

# For a. we use classic lm, glm, glm.nb (in the MASS package), and gamlss. 
# For b. we try gaussian (normal), Poisson, negative binomial, Gamma, and beaucoup more via gamlss.
# For c. we use the performance and gamlss packages

# You should already have installed MASS and glm is part of base R. A new package today is:
install.packages("gamlss") # this runs GLMs and GAMs (GAMs use a smoothing function we skip here)

### Then load needed packages
library(tidyverse)
library(MASS)
library(gamlss)
library(performance)

### Download and attach the beetles.csv file from the class web site: https://is.gd/Hud8Xb
beetlemeans <- read_csv("~/Downloads/beetlemeans.csv")
attach(beetlemeans)

# this file includes 21 sites with:
#   mean beetle count through ~ 1.5 years of sampling
#   PC1... PC4 values. Principal Components Analysis summarized many landscape variables around each site, where:
#     PC_1 represented residential vs. forested wetlands (34% of variation);
#     PC_2 separated non-forested wetlands and low-density residential lands from other land uses (24% of variation);
#     PC_3 further separated low- and medium-density residential lands from other land (11% more variance);
#     PC_4 identified unique combinations of aquatic habitats (11% more variance);
#     Combined the 4 PCs represented 80% of land cover variation among sample sites.

##### THE GOAL: identify the distribution that best meets model assumptions so we can reliably predict beetle abundance based on land cover. 

### First we run a classic lm assuming Gaussian distribution. 
lm.modelg <- lm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans)
summary(lm.modelg)
check_model(lm.modelg) # What do you see? Do you like the fit to assumptions? Can we do better?

### Now we try a Poisson distribution - built for integers >0, and for overdispersed (right-skewed) data
glm.modelp <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=poisson, data=beetlemeans)
summary(glm.modelp)
check_model(glm.modelp)

### Any better? Compare summary tables: Why are Estimates so different?

### Now let's try a GLM based on negative binomial distribution (supposed to work better for count data).
glm.modeln <- glm.nb(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans) # notice the .nb?
summary(glm.modeln)
check_model(glm.modeln)

### Are assumptions of residuals any better? 

### Now let's try a flexible distribution: Gamma distribution
# A problem = it often works well, but takes some computation to translate back to original units  
# default link is an inverse (i.e., 1/Y)
# A good summary is at https://en.wikipedia.org/wiki/Gamma_distribution

glm.modelG <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma)
summary(glm.modelG)
check_model(glm.modelG)

# Now try gamma with a log link to more closely match Poisson and negative binomial above
glm.modelGl <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma(link="log"))
summary(glm.modelGl)
check_model(glm.modelGl)


### FYI - the link functions to obtain a model for the mean of the raw data by transforming in the model.
### If you log-transform, your linear model is on the log data - very different!
### there are default links and you can specify others for a family. See
? family # for some basic info; For more info see https://www.r-bloggers.com/?p=170983

######## Take a look back at summary outputs: Notice the different Estimates and p values???? ######## 
######## And which distribution would you use?  ######## 

# Above work is using classic approach - Now we get more inventive, using gamlss.

### GAM is a Generalized Additive Model, which adds a smoothing function to a GLM. If we don't smooth we get a GLM.
### The gamlss package provides a blizzard of family options:
? gamlss.family

### First run gamlss to assure it gets you the same result as plain old lm, assuming *Normal* error distribution:
gamlss.model <- gamlss(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans)
summary(gamlss.model)
plot(gamlss.model) # instead of check_model
wp(gamlss.model) # Kinda like the Homog. of Variance plot - flat? within bounds?

### Now let's gamlss find a "best" family in its vast menu, based on residual deviance:
alt.fits <- fitDist(mncount, data=beetlemeans)
alt.fits$fits # lists deviance per distribution - less residual deviance is best
alt.fits$failed # shows distributions that did not work

### So it looks like Gamma was "best" - let's ensure we get the same as using glm with Gamma:
gamlss.model <- gamlss(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=GA, data=beetlemeans)
summary(gamlss.model)
Rsq(gamlss.model)
### and compare to log-link based glm using Gamma
summary(glm.modelGl)
# How do coefficients compare? And are you confident that gamlss is giving you a GLM?

### And let's check assumptions:
plot(gamlss.model) # instead of check_model, but showing roughly the same info
wp(gamlss.model) # more flat than the Normal? within bounds? 

### And let's compare predicted values to observed mncount:
plot(exp(predict(gamlss.model)) ~ beetlemeans$mncount) # not a tight fit but greater predicted with > observed
# and the model seems to break down with counts > ~600; I bet other, new predictors could help 
### NOTE: we "antilog" predicted values to get back to raw data scales

##### So what have you learned?
# A big problem with lm is its assumptions, which can make your results wrong.
# glm (and related packages) helps get past those simple assumptions, but has limits.
# alternative packages get you about the same answer with the same models - that's reassuring!
# gamlss opens the door to many different distribution families for GLMs (and GAMs).