Multiple Regression

Bio300B Lecture 8

Richard J. Telford (Richard.Telford@uib.no)

Institutt for biovitenskap, UiB

8 October 2024

Outline

Types of models
Interactions
Model selection
- Exploratory data analysis
- Anova for hypothesis testing
Multi-collinearity
Autocorrelation
Reporting statistics

Adding more variables

# Two way anova - two categorical predictors
mod_anova2 <- lm(bill_length_mm ~ sex + species, data = penguins)

# ancova - one categorical one continuous predictor
mod_ancova <- lm(bill_length_mm ~ sex + flipper_length_mm, data = penguins)

# Multiple regression - two continuous
mod_mult <- lm(bill_length_mm ~ body_mass_g + flipper_length_mm, data = penguins)

summary(mod_anova2)


Call:
lm(formula = bill_length_mm ~ sex + species, data = penguins)

Residuals:
   Min     1Q Median     3Q    Max 
-6.087 -1.377 -0.071  1.225 11.013 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        36.977      0.231   160.2   <2e-16 ***
sexmale             3.694      0.255    14.5   <2e-16 ***
speciesChinstrap   10.010      0.341    29.3   <2e-16 ***
speciesGentoo       8.698      0.287    30.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.32 on 329 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.821, Adjusted R-squared:  0.819 
F-statistic:  503 on 3 and 329 DF,  p-value: <2e-16

summary(mod_ancova)


Call:
lm(formula = bill_length_mm ~ sex + flipper_length_mm, data = penguins)

Residuals:
   Min     1Q Median     3Q    Max 
-9.720 -2.812 -0.812  2.021 19.764 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -4.4669     3.2364   -1.38     0.17    
sexmale             2.0727     0.4568    4.54    8e-06 ***
flipper_length_mm   0.2359     0.0163   14.46   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.03 on 330 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.46,  Adjusted R-squared:  0.457 
F-statistic:  141 on 2 and 330 DF,  p-value: <2e-16

summary(mod_mult)


Call:
lm(formula = bill_length_mm ~ body_mass_g + flipper_length_mm, 
    data = penguins)

Residuals:
   Min     1Q Median     3Q    Max 
-8.806 -2.590 -0.705  1.991 18.829 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3.436694   4.580553   -0.75     0.45    
body_mass_g        0.000662   0.000567    1.17     0.24    
flipper_length_mm  0.221865   0.032348    6.86  3.3e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.12 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.433, Adjusted R-squared:  0.43 
F-statistic:  129 on 2 and 339 DF,  p-value: <2e-16

Different formula

What is an interaction

Effect of one predictor depends on value of another

Interaction between categorical predictors

mod <- lm(flipper_length_mm ~ species * sex, data = penguins)
summary(mod)


Call:
lm(formula = flipper_length_mm ~ species * sex, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.795  -3.411   0.088   3.459  17.589 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               187.795      0.662  283.72  < 2e-16 ***
speciesChinstrap            3.941      1.174    3.36  0.00088 ***
speciesGentoo              24.912      0.995   25.04  < 2e-16 ***
sexmale                     4.616      0.936    4.93  1.3e-06 ***
speciesChinstrap:sexmale    3.560      1.661    2.14  0.03278 *  
speciesGentoo:sexmale       4.218      1.397    3.02  0.00274 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.66 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.84,  Adjusted R-squared:  0.837 
F-statistic:  342 on 5 and 327 DF,  p-value: <2e-16

Understanding coefficients

#| label: coef-explain-app
#| standalone: true
#| viewerHeight: 650

library(shiny)
library(bslib)

ui <- page_sidebar(
  sidebar = sidebar(checkboxGroupInput("pred",
    "Predictor",
    choiceValues = c(
      "Food", "Temperature",
      "Interaction"
    ), choiceNames = c(
      "Food (A vs. B)",
      "Temperature (Low vs. High)", "Interaction"
    )
  )), card(
    textOutput("formula")
  ), card(
    card_header("Coefficients"),
    tableOutput("coef_table")
  ), card(plotOutput("plot")),
  tags$head(tags$style("#coef_table td{\n                     position:relative;\n                     };\n\n                     ")),
)
server <- function(input, output, session) {
  data <- reactive({
    b0 <- 5
    b1 <- 3
    b2 <- 2
    b3 <- 2
    food <- factor(rep(c("A", "B"), times = 24))
    temperature <- factor(rep(c("Low", "High"), each = 24),
      levels = c("Low", "High")
    )
    response <- b0 + b1 * (food == "B") + b2 * (temperature ==
      "High") + b3 * (food == "B" & temperature ==
      "High") + rnorm(48)
    data.frame(
      food = food, temperature = temperature,
      response = response
    )
  })
  form <- "response ~"
  form2 <- reactive({
    if ("Interaction" %in% input$pred) {
      paste(form, "food * temperature")
    } else if ("Food" %in% input$pred & "Temperature" %in%
      input$pred) {
      paste(form, "food + temperature")
    } else if ("Food" %in% input$pred) {
      paste(form, "food")
    } else if ("Temperature" %in% input$pred) {
      paste(form, "temperature")
    } else {
      paste(form, "1")
    }
  })
  observe({
    if ("Interaction" %in% input$pred) {
      updateCheckboxGroupInput(session, "pred", selected = c(
        "Food",
        "Temperature", "Interaction"
      ))
    }
  })
  model <- reactive({
    lm(form2(), data = data())
  })
  coefs <- reactive({
    coef(model())
  })
  coef_colours <- reactive({
    c(
      `(Intercept)` = "skyblue", foodB = "red", temperatureHigh = "green",
      `foodB:temperatureHigh` = "orange"
    )[names(coefs())]
  })
  coef_table <- reactive({
    c1 <- "<div style=\"width: 100%; height: 100%; z-index: 0; background-color: "
    c2 <- "; position:absolute; top: 0; left: 0; padding:5px;\">\n<span>"
    c3 <- "</span></div>"
    tab <- data.frame(
      Beta = paste0(
        c1, coef_colours(),
        c2, "β", seq_along(coefs()), c3
      ), Coefficent = names(coefs()),
      Estimate = coefs()
    )
    tab
  })
  output$formula <- renderText(paste("Model formula:", form2()))
  output$coef_table <- renderTable(coef_table(), sanitize.text.function = function(x) x)
  output$plot <- renderPlot({
    set.seed(1)
    f <- as.formula(form2())
    fc <- as.character(f)
    ylim <- c(0, max(data()$response))
    par(mar = c(1.5, 2.5, .5, .5), mgp = c(1.5, 0.5, 0))
    if (fc[3] == "1") {
      stripchart(data()$response,
        method = "jitter",
        jitter = 0.1, vertical = TRUE, pch = 1, ylim = ylim,
        ylab = "response"
      )
    } else {
      stripchart(f,
        data = data(), method = "jitter",
        jitter = 0.1, vertical = TRUE, pch = 1, ylim = ylim
      )
    }
    cols <- c("food", "temperature")[c(grepl(
      "food",
      fc[3]
    ), grepl("temperature", fc[3]))]
    if (length(cols) == 0) {
      cols <- "food"
    }
    pred <- predict(model(), newdata = unique(data()[,
      cols,
      drop = FALSE
    ]))
    points(seq_along(pred), pred,
      col = "#832424", pch = 16,
      cex = 4
    )
    xs <- seq_along(pred) - 0.2
    arrows(xs, rep(0, length(pred)), xs, rep(
      coefs()[1],
      length(pred)
    ),
    col = coef_colours()[1], lwd = 4,
    length = 0.1
    )
    pos <- 2 - 0.2
    if ("foodB" %in% names(coefs())) {
      arrows(pos, coefs()[1], pos, coefs()[1] + coefs()["foodB"],
        col = coef_colours()["foodB"], lwd = 4, length = 0.1
      )
      pos <- pos + 1
    }
    if ("temperatureHigh" %in% names(coefs())) {
      arrows(pos, coefs()[1], pos, coefs()[1] + coefs()["temperatureHigh"],
        col = coef_colours()["temperatureHigh"], lwd = 4,
        length = 0.1
      )
      pos <- pos + 1
    }
    if (all(c("foodB", "temperatureHigh") %in% names(coefs()))) {
      arrows(pos, coefs()[1], pos, coefs()[1] + coefs()["foodB"],
        col = coef_colours()["foodB"], lwd = 4, length = 0.1
      )
      arrows(pos, coefs()[1] + coefs()["foodB"], pos,
        coefs()[1] + coefs()["foodB"] + coefs()["temperatureHigh"],
        col = coef_colours()["temperatureHigh"], lwd = 4,
        length = 0.1
      )
      if ("foodB:temperatureHigh" %in% names(coefs())) {
        arrows(pos, coefs()[1] + coefs()["foodB"] +
          coefs()["temperatureHigh"], pos, coefs()[1] +
          coefs()["foodB"] + coefs()["temperatureHigh"] +
          coefs()["temperatureHigh"],
        col = coef_colours()["foodB:temperatureHigh"],
        lwd = 4, length = 0.1
        )
      }
    }
  })
}
shinyApp(ui, server)

Interaction between continuous and categorical predictors

mod_interact <- lm(bill_length_mm ~ sex * flipper_length_mm, data = adelie)
summary(mod_interact)


Call:
lm(formula = bill_length_mm ~ sex * flipper_length_mm, data = adelie)

Residuals:
   Min     1Q Median     3Q    Max 
-6.395 -1.329  0.029  1.427  5.438 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                39.7935     8.3528    4.76  4.6e-06 ***
sexmale                   -20.2163    11.0648   -1.83    0.070 .  
flipper_length_mm          -0.0135     0.0445   -0.30    0.762    
sexmale:flipper_length_mm   0.1217     0.0583    2.09    0.039 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.11 on 142 degrees of freedom
Multiple R-squared:  0.385, Adjusted R-squared:  0.372 
F-statistic: 29.6 on 3 and 142 DF,  p-value: 6.43e-15

Interaction between two continuous predictors

mod <- lm(flipper_length_mm ~ body_mass_g * bill_length_mm, data = adelie)
summary(mod)


Call:
lm(formula = flipper_length_mm ~ body_mass_g * bill_length_mm, 
    data = adelie)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.653  -3.101  -0.003   3.369  16.585 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 2.04e+02   5.72e+01    3.58  0.00048 ***
body_mass_g                -6.63e-03   1.52e-02   -0.44  0.66330    
bill_length_mm             -9.19e-01   1.48e+00   -0.62  0.53531    
body_mass_g:bill_length_mm  3.17e-04   3.89e-04    0.82  0.41546    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.79 on 142 degrees of freedom
Multiple R-squared:  0.229, Adjusted R-squared:  0.212 
F-statistic:   14 on 3 and 142 DF,  p-value: 4.62e-08

Power needed for interactions

Main effect = \(\bar{y_1}\) - \(\bar{y_2}\)
Interaction = (\(\bar{y_1}\) - \(\bar{y_2}\)) - (\(\bar{y_3}\) - \(\bar{y_4}\))

set.seed(42)
sigma <- 10
n <- 1000
random <- tibble(
  x = sample(c("A", "B"), n, replace = TRUE),
  z = sample(factor(1:2), n, replace = TRUE),
  y = rnorm(n, mean = 0, sd = sigma)
)

lm(y ~ x * z, data = random) |> tidy()

# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.281     0.615     0.457  0.647 
2 xB            -1.90      0.864    -2.20   0.0280
3 z2             0.759     0.880     0.863  0.388 
4 xB:z2          1.06      1.24      0.849  0.396

Formula for interactions

y ~ x + z + x:z

y ~ x * z

y ~ (x + z)^2

Use y ~ x + I(x^2) to get a quadratic. Or y ~ poly(x, 2)

mod1 <- lm(flipper_length_mm ~ species + sex + species:sex, data = penguins)
mod2 <- lm(flipper_length_mm ~ species * sex, data = penguins)
mod3 <- lm(flipper_length_mm ~ (species + sex)^2, data = penguins)

coef(mod1)

             (Intercept)         speciesChinstrap            speciesGentoo 
                 187.795                    3.941                   24.912 
                 sexmale speciesChinstrap:sexmale    speciesGentoo:sexmale 
                   4.616                    3.560                    4.218

coef(mod2)

             (Intercept)         speciesChinstrap            speciesGentoo 
                 187.795                    3.941                   24.912 
                 sexmale speciesChinstrap:sexmale    speciesGentoo:sexmale 
                   4.616                    3.560                    4.218

coef(mod3)

             (Intercept)         speciesChinstrap            speciesGentoo 
                 187.795                    3.941                   24.912 
                 sexmale speciesChinstrap:sexmale    speciesGentoo:sexmale 
                   4.616                    3.560                    4.218

Model Selection

You want to the best model!

The best model for what?

Exploratory data analysis
Inference (hypothesis testing)
Predictions

Exploratory analysis

Consider all plausible models
P-values not meaningful
High type I error rate
Suggest hypotheses for hypothesis testing with independent data

The more biology you include in the model

Automatic model selection

Last resort - many problems and biases

Forward selection
Backwards selection
All possible models

library(MuMIn)
library(conflicted)
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")

penguins2 <- penguins |>
  drop_na() |>
  select(-year)
full <- lm(body_mass_g ~ ., data = penguins2, na.action = "na.fail")

AIC

AIC Akaike information criterion \(2k - 2 \times log(likelihood)\)
(k the number of parameters)

Measure of how well the model fit the data

Penalised for model complexity

AICc correction for small sample sizes

AIC weights - probability model is best of those tested

dredge(full) |> gt::gt()

(Intercept)	bill_depth_mm	bill_length_mm	flipper_length_mm	island	sex	species	df	logLik	AICc	delta	weight
-1461.0	67.22	18.204	15.95	NA	+	+	8	-2354	4724	0.000	7.682e-01
-1500.0	67.58	18.189	16.24	+	+	+	10	-2354	4728	3.543	1.307e-01
-1211.5	74.38	NA	17.54	NA	+	+	7	-2357	4729	4.537	7.948e-02
-1245.4	74.51	NA	17.85	+	+	+	9	-2357	4732	8.027	1.388e-02
-759.1	NA	21.633	17.85	NA	+	+	7	-2360	4734	9.536	6.526e-03
-785.6	NA	21.528	18.14	+	+	+	9	-2359	4738	13.153	1.070e-03
-365.8	NA	NA	20.02	NA	+	+	6	-2364	4741	16.649	1.863e-04
-390.0	NA	NA	20.32	+	+	+	8	-2364	4745	20.118	3.287e-05
844.0	87.93	26.537	NA	NA	+	+	7	-2369	4752	27.263	9.232e-07
829.8	88.33	26.664	NA	+	+	+	9	-2369	4756	31.392	1.172e-07
1577.0	102.04	NA	NA	NA	+	+	6	-2375	4763	38.571	3.236e-09
1576.2	102.15	NA	NA	+	+	+	8	-2375	4767	42.751	4.000e-10
2169.3	NA	32.537	NA	NA	+	+	6	-2378	4768	43.901	2.252e-10
2167.9	NA	32.535	NA	+	+	+	8	-2378	4772	48.086	2.778e-11
-4282.1	141.77	39.718	20.23	NA	NA	+	7	-2385	4784	59.631	8.643e-14
3372.4	NA	NA	NA	NA	+	+	5	-2388	4786	61.394	3.580e-14
-4324.0	141.82	39.599	20.58	+	NA	+	9	-2384	4786	62.049	2.580e-14
3375.8	NA	NA	NA	+	+	+	7	-2388	4790	65.442	4.731e-15
-2219.3	-50.03	NA	35.69	+	+	NA	7	-2397	4807	82.977	7.364e-19
-2115.6	-50.34	6.052	33.94	+	+	NA	8	-2396	4808	83.566	5.488e-19
-4484.7	181.95	NA	25.53	NA	NA	+	6	-2400	4811	86.962	1.004e-19
-4524.1	181.54	NA	25.95	+	NA	+	8	-2399	4814	89.247	3.204e-20
-3720.1	NA	NA	39.30	+	+	NA	6	-2401	4815	90.409	1.792e-20
-3629.6	NA	5.808	37.64	+	+	NA	7	-2401	4816	91.143	1.242e-20
-1709.5	180.78	54.061	NA	NA	NA	+	6	-2405	4822	97.444	5.318e-22
-1735.5	181.30	54.252	NA	+	NA	+	8	-2404	4825	100.777	1.005e-22
-3864.1	NA	60.117	27.54	NA	NA	+	6	-2411	4833	109.078	1.583e-24
-2246.8	-86.95	NA	38.19	NA	+	NA	5	-2412	4834	109.703	1.158e-24
-2288.5	-86.09	-2.329	38.83	NA	+	NA	6	-2412	4836	111.527	4.653e-25
-3902.9	NA	59.868	27.96	+	NA	+	8	-2410	4836	111.692	4.284e-25
-5410.3	NA	NA	46.98	NA	+	NA	4	-2427	4863	138.223	7.424e-31
-5433.5	NA	-5.201	48.21	NA	+	NA	5	-2427	4864	139.128	4.723e-31
-1000.8	256.55	NA	NA	NA	NA	+	5	-2431	4871	146.983	9.299e-33
-1014.6	257.16	NA	NA	+	NA	+	7	-2430	4875	150.703	1.448e-33
-5707.1	51.24	13.748	42.90	+	NA	NA	7	-2438	4891	166.751	4.741e-37
-6114.7	56.67	NA	47.40	+	NA	NA	6	-2441	4895	170.817	6.209e-38
-4013.2	NA	NA	40.61	NA	NA	+	5	-2443	4895	171.067	5.478e-38
-4047.8	NA	NA	41.09	+	NA	+	7	-2442	4897	173.113	1.970e-38
200.5	NA	90.298	NA	NA	NA	+	5	-2444	4899	174.468	1.001e-38
-4355.0	NA	16.883	39.66	+	NA	NA	6	-2445	4902	177.784	1.906e-39
182.0	NA	90.510	NA	+	NA	+	7	-2444	4903	178.295	1.477e-39
-4687.9	NA	NA	44.89	+	NA	NA	5	-2450	4909	184.809	5.685e-41
-6537.6	19.88	NA	51.77	NA	NA	NA	4	-2460	4928	203.674	4.552e-45
-5872.1	NA	NA	50.15	NA	NA	NA	3	-2461	4928	203.836	4.198e-45
-5836.3	NA	4.959	48.89	NA	NA	NA	4	-2461	4929	204.973	2.377e-45
-6445.5	17.84	3.293	50.76	NA	NA	NA	5	-2460	4930	205.355	1.965e-45
4784.5	-155.12	44.769	NA	+	+	NA	7	-2470	4955	230.236	7.767e-51
1747.4	NA	60.616	NA	+	+	NA	6	-2505	5022	297.540	1.886e-65
6551.4	-257.31	36.519	NA	NA	+	NA	5	-2507	5023	299.035	8.929e-66
7580.2	-211.52	NA	NA	+	+	NA	6	-2506	5024	299.241	8.057e-66
3706.2	NA	NA	NA	NA	NA	+	4	-2513	5035	310.285	3.221e-68
3709.7	NA	NA	NA	+	NA	+	6	-2513	5039	314.411	4.093e-69
8779.1	-299.34	NA	NA	NA	+	NA	4	-2529	5066	341.864	4.474e-75
1697.5	-26.23	76.000	NA	+	NA	NA	6	-2537	5086	361.234	2.782e-79
1239.3	NA	76.905	NA	+	NA	NA	5	-2538	5086	361.346	2.632e-79
4375.8	NA	NA	NA	+	+	NA	5	-2562	5133	409.061	1.145e-89
3413.5	-145.51	74.813	NA	NA	NA	NA	4	-2595	5199	474.205	8.187e-104
746.1	NA	74.025	NA	NA	+	NA	4	-2614	5236	512.010	5.056e-112
5590.3	-54.76	NA	NA	+	NA	NA	5	-2615	5240	516.032	6.767e-113
4719.2	NA	NA	NA	+	NA	NA	4	-2618	5244	519.963	9.482e-114
388.8	NA	86.792	NA	NA	NA	NA	3	-2629	5264	539.833	4.594e-118
7520.0	-193.01	NA	NA	NA	NA	NA	3	-2658	5322	598.047	1.050e-130
3862.3	NA	NA	NA	NA	+	NA	3	-2667	5340	615.648	1.582e-134
4207.1	NA	NA	NA	NA	NA	NA	2	-2700	5404	679.945	1.727e-148

Inference

Test small number of a priori hypothesis.

Use anova() to compare nested models

H₀ there is no interaction between sex and species for predicting flipper length

mod1 <- lm(body_mass_g ~ species + sex, data = penguins)
mod2 <- lm(body_mass_g ~ species * sex, data = penguins)
anova(mod1, mod2)

Analysis of Variance Table

Model 1: body_mass_g ~ species + sex
Model 2: body_mass_g ~ species * sex
  Res.Df      RSS Df Sum of Sq    F Pr(>F)    
1    329 32979185                             
2    327 31302628  2   1676557 8.76  2e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::Anova vs anova

mod2 <- lm(body_mass_g ~ species + sex, data = penguins)
mod3 <- lm(body_mass_g ~ sex + species, data = penguins)

anova - tests terms sequentially
car::Anova - marginal test

anova(mod2)

Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value Pr(>F)    
species     2 1.45e+08 72595110     724 <2e-16 ***
sex         1 3.71e+07 37090262     370 <2e-16 ***
Residuals 329 3.30e+07   100241                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mod3)

Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value Pr(>F)    
sex         1 3.89e+07 38878897     388 <2e-16 ***
species     2 1.43e+08 71700792     715 <2e-16 ***
Residuals 329 3.30e+07   100241                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::Anova(mod2)

Anova Table (Type II tests)

Response: body_mass_g
            Sum Sq  Df F value Pr(>F)    
species   1.43e+08   2     715 <2e-16 ***
sex       3.71e+07   1     370 <2e-16 ***
Residuals 3.30e+07 329                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::Anova(mod3)

Anova Table (Type II tests)

Response: body_mass_g
            Sum Sq  Df F value Pr(>F)    
sex       3.71e+07   1     370 <2e-16 ***
species   1.43e+08   2     715 <2e-16 ***
Residuals 3.30e+07 329                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multicollinearity

Two or more predictor variables in a multiple regression model are highly correlated. Example: pH and calcium
Coefficient estimates are unstable
- erratic change in response to small changes in the model or the data.
Solve by having lots of data

col <- performance::check_collinearity(full)
col

# Check for Multicollinearity

Low Correlation

   Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 island 3.73 [ 3.14,  4.49]         1.93      0.27     [0.22, 0.32]
    sex 2.33 [ 2.00,  2.77]         1.53      0.43     [0.36, 0.50]

Moderate Correlation

              Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
    bill_length_mm 6.10 [ 5.06,  7.40]         2.47      0.16     [0.14, 0.20]
     bill_depth_mm 6.10 [ 5.06,  7.41]         2.47      0.16     [0.14, 0.20]
 flipper_length_mm 6.80 [ 5.63,  8.26]         2.61      0.15     [0.12, 0.18]

High Correlation

    Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 species 63.52 [51.73, 78.06]         7.97      0.02     [0.01, 0.02]

plot(col)

Autocorrelation

Linear models assume residuals are independent

If data are spatially or temporally structured, residuals may be correlated.

Positive autocorrelation

confidence intervals are too narrow
increases risk of false positive (Type I error)

Luteinising Hormone concentration

Code

library(ggfortify)

lh <- fortify(lh) |>
  mutate(time = Index * 10) |>
  rename(concentration = Data)

mod_lh <- lm(concentration ~ time, data = lh)

augment(mod_lh, interval = "confidence") |>
  ggplot(aes(x = time, y = concentration)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, fill = "#ee5050") +
  geom_line(aes(y = .fitted), colour = "#ee5050") +
  labs(x = "Time, minutes", y = "Hormone concentration")

Detecting autocorrelation

For time series with equally-spaced observations, use autocorrelation function (ACF)

autoplot(acf(mod_lh$residuals, plot = FALSE))

Detecting autocorrelation

Tests for equally-spaced observations

Durbin-Watson test

performance::check_autocorrelation(mod_lh)

Warning: Autocorrelated residuals detected (p < .001).

Other methods for spatial data and non-equally spaced data

Solution - use a model that accounts for autocorrelation

generalised least squares (nlme::gls())

Reporting regression results - tables

Make a table - estimates, confidence intervals, p-values

Can calculate confidence intervals with broom::tidy() and format output

library(gt)
tidy(mod_mult, conf.int = TRUE) |>
  select(term, estimate, conf.low, conf.high, p.value) |>
  mutate(p.value = format.pval(p.value, eps = 0.001)) |>
  gt() |>
  fmt_number(
    columns = c(estimate, conf.low, conf.high),
    decimals = 4
  )

term	estimate	conf.low	conf.high	p.value
(Intercept)	−3.4367	−12.4466	5.5732	0.45
body_mass_g	0.0007	−0.0005	0.0018	0.24
flipper_length_mm	0.2219	0.1582	0.2855	<0.001

`gtsummary`

Or make table with gtsummary

library(gtsummary)
tbl <- tbl_regression(mod_mult,
  label = list(
    body_mass_g = "Body mass g",
    flipper_length_mm = "Flipper length mm"
  ),
  estimate_fun = label_style_sigfig(digits = 4)
)
tbl

Characteristic	Beta	95% CI¹	p-value
Body mass g	0.0007	-0.0005, 0.0018	0.2
Flipper length mm	0.2219	0.1582, 0.2855	<0.001
¹ CI = Confidence Interval

Reporting regression results - inline

Can extract values directly from regression .

estimate
confidence intervals
p-value, degrees of freedom, statistic

Flipper length is associated with longer bill length (estimate = `r round(coef(mod_mult)["flipper_length_mm"], 2)` mm/mm, 95% CI = `r round(confint(mod_mult)["flipper_length_mm", ], 2)` p = `r format.pval(tidy(mod_mult) |> filter(term == "flipper_length_mm") |> select("p.value"), eps = 0.001)`).

Flipper length is associated with longer bill length (estimate = 0.22 mm/mm, 95% CI = 0.16, 0.29 p = <0.001).

Also function inline_text() from gtsummary

See APA guide for reporting statistics