##### Modeling with Generalized Linear Models (GLMs)
# Here we do a few things:
### a. We use several packages to compute GLMs - where we Should get the same results
### b. We try several different assumed distributions
### c. We evaluate assumed distributions to test for which one is most appropriate
### All this is done BEFORE we can begin comparing models with AICs

##### THE GOAL: identify the distribution that best enables model assumptions to be satisfied. We defer matters of AIC comparisons and R2 "critiques" of models until next week, because those must wait until assumptions are OK.

# For a. we use classic lm, glm, glm.nb, and glmmTMB. The last one is built for mixed effects but will run here.
# For b. we try gaussian (normal), Poisson, negative binomial, and Gamma distributions.
# For c. we use the DHARMa and performance packages

# You should already have installed some of these packages. New ones today should be:
install.packages("DHARMa") # this tests your data against randomizing simulations based on your data

# Then load needed packages
library(tidyverse)
library(MASS)
library(glmmTMB)
library(DHARMa)
library(performance)

# Download the beetles.csv file from the class web site
# this file includes 21 sites with:
#   mean beetle count through ~ 1.5 years of sampling
#   PC1... PC4 values. Principal Components Analysis summarized many landscape variables around each site, where:
#     PC_1 represented residential vs. forested wetlands (34% of variation);
#     PC_2 separated non-forested wetlands and low-density residential lands from other land uses (24% of variation);
#     PC_3 further separated low- and medium-density residential lands from other land (11% more variance);
#     PC_4 identified unique combinations of aquatic habitats (11% more variance);
#     Combined the 4 PCs represented 80% of land cover variation among sample sites.

attach(beetlemeans)

### First we run a classic lm assuming gaussian distribution. 
lm.modelg <- lm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans)
summary(lm.modelg)
simulationOutput <- simulateResiduals(fittedModel = lm.modelg)
testDispersion(simulationOutput) # using DHARMa. NOTE - this will not work in some cases below - no problem - we also have performance.
check_overdispersion(lm.modelg)  # using DHARMa
check_model(lm.modelg) # uisng performance

### Notice: DHARMA did not detect overdispersion (i.e., too many points to the right of the normal curve, but performance disagrees:
#   Points in the Linearity & H of V plot are more scattered to the right and curves are not very flat - that suggests a log-based family (like a transform) is needed
# Normality of Residuals does not look to straight and certainly does not fit a normal curve

### Now we try a Poisson distribution - built for integers >0, and for overdispersed (right-skewed) data
glm.modelp <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=poisson, data=beetlemeans)
summary(glm.modelp)
simulationOutput <- simulateResiduals(fittedModel = glm.modelp)
testDispersion(simulationOutput)
check_overdispersion(glm.modelp)
check_model(glm.modelp)

### Any better? 

### How about trying that in glmmTMB - because no one package has a patent on getting this right
glmm.modelp <- glmmTMB(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=poisson, data=beetlemeans)
summary(glmm.modelp)
simulationOutput <- simulateResiduals(fittedModel = glmm.modelp)
testDispersion(simulationOutput)
check_overdispersion(glmm.modelp)
check_model(glmm.modelp)

### Let's try a negative binomial distribution instead - which is supposed to work better for count data. First a glm:
glm.modeln <- glm.nb(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, data=beetlemeans)
summary(glm.modeln)
simulationOutput <- simulateResiduals(fittedModel = glm.modeln)
testDispersion(simulationOutput)
check_overdispersion(glm.modeln)
check_model(glm.modeln)

### Any better? 

### How about trying that in glmmTMB again:
glmm.modeln <- glmmTMB(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=nbinom2, data=beetlemeans)
summary(glmm.modeln)
simulationOutput <- simulateResiduals(fittedModel = glmm.modeln)
testDispersion(simulationOutput)
check_overdispersion(glmm.modeln)
check_model(glmm.modeln)

### Notice any differences between glm.nb and glmmTMB this time?

### Now let's try a flexible distribution. Gamma distributions can fit lots of different data shapes by tweaking parameters for the curve.
# default is an inverse link, which is similar to a transformation. 
# A good summary is at https://en.wikipedia.org/wiki/Gamma_distribution
# A problem - though it often works well, it does not easily translate back to original units  

glm.modelG <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma)
summary(glm.modelG)
simulationOutput <- simulateResiduals(fittedModel = glm.modelG)
testDispersion(simulationOutput)
check_model(glm.modelG)

glmm.modelG <- glmmTMB(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma, data=beetlemeans)
summary(glmm.modelG)
simulationOutput <- simulateResiduals(fittedModel = glmm.modelG)
testDispersion(simulationOutput)
check_overdispersion(glmm.modelG)
check_model(glmm.modelG)

# Now we try gamma with a different link - here log to more clsely match Poisson and negative binomial above
glm.modelGl <- glm(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma(link="log"))
summary(glm.modelGl)
simulationOutput <- simulateResiduals(fittedModel = glm.modelGl)
testDispersion(simulationOutput)
check_model(glm.modelGl)

glmm.modelGl <- glmmTMB(mncount ~ PC_1 + PC_2 + PC_3 + PC_4, family=Gamma(link="log"), data=beetlemeans)
summary(glmm.modelGl)
simulationOutput <- simulateResiduals(fittedModel = glmm.modelGl)
testDispersion(simulationOutput)
check_overdispersion(glmm.modelGl)
check_model(glmm.modelGl)

### What other distribution families and links are available? Check out Help with these: 
? family
? family.glm
? family_glmmTMB

### Try a few!

### So which one would you choose to model these data? 
### Now try a stepAIC on the distribution family and link you choose to obtain a most-parsimonious model
# for example:
stepAIC(glmm.modelGl)

### The End