Multiple Regression

Bio300B Lecture 8

Richard J. Telford (Richard.Telford@uib.no)

Institutt for biovitenskap, UiB

8 October 2024

Outline

  • Types of models
  • Interactions
  • Model selection
    • Exploratory data analysis
    • Anova for hypothesis testing
  • Multi-collinearity
  • Autocorrelation
  • Reporting statistics

Adding more variables

# Two way anova - two categorical predictors
mod_anova2 <- lm(bill_length_mm ~ sex + species, data = penguins)

# ancova - one categorical one continuous predictor
mod_ancova <- lm(bill_length_mm ~ sex + flipper_length_mm, data = penguins)

# Multiple regression - two continuous
mod_mult <- lm(bill_length_mm ~ body_mass_g + flipper_length_mm, data = penguins)
summary(mod_anova2)

Call:
lm(formula = bill_length_mm ~ sex + species, data = penguins)

Residuals:
   Min     1Q Median     3Q    Max 
-6.087 -1.377 -0.071  1.225 11.013 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        36.977      0.231   160.2   <2e-16 ***
sexmale             3.694      0.255    14.5   <2e-16 ***
speciesChinstrap   10.010      0.341    29.3   <2e-16 ***
speciesGentoo       8.698      0.287    30.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.32 on 329 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.821, Adjusted R-squared:  0.819 
F-statistic:  503 on 3 and 329 DF,  p-value: <2e-16
summary(mod_ancova)

Call:
lm(formula = bill_length_mm ~ sex + flipper_length_mm, data = penguins)

Residuals:
   Min     1Q Median     3Q    Max 
-9.720 -2.812 -0.812  2.021 19.764 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -4.4669     3.2364   -1.38     0.17    
sexmale             2.0727     0.4568    4.54    8e-06 ***
flipper_length_mm   0.2359     0.0163   14.46   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.03 on 330 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.46,  Adjusted R-squared:  0.457 
F-statistic:  141 on 2 and 330 DF,  p-value: <2e-16
summary(mod_mult)

Call:
lm(formula = bill_length_mm ~ body_mass_g + flipper_length_mm, 
    data = penguins)

Residuals:
   Min     1Q Median     3Q    Max 
-8.806 -2.590 -0.705  1.991 18.829 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3.436694   4.580553   -0.75     0.45    
body_mass_g        0.000662   0.000567    1.17     0.24    
flipper_length_mm  0.221865   0.032348    6.86  3.3e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.12 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.433, Adjusted R-squared:  0.43 
F-statistic:  129 on 2 and 339 DF,  p-value: <2e-16

Different formula

What is an interaction

Effect of one predictor depends on value of another

Interaction between categorical predictors

mod <- lm(flipper_length_mm ~ species * sex, data = penguins)
summary(mod)

Call:
lm(formula = flipper_length_mm ~ species * sex, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.795  -3.411   0.088   3.459  17.589 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               187.795      0.662  283.72  < 2e-16 ***
speciesChinstrap            3.941      1.174    3.36  0.00088 ***
speciesGentoo              24.912      0.995   25.04  < 2e-16 ***
sexmale                     4.616      0.936    4.93  1.3e-06 ***
speciesChinstrap:sexmale    3.560      1.661    2.14  0.03278 *  
speciesGentoo:sexmale       4.218      1.397    3.02  0.00274 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.66 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.84,  Adjusted R-squared:  0.837 
F-statistic:  342 on 5 and 327 DF,  p-value: <2e-16

Understanding coefficients

#| label: coef-explain-app
#| standalone: true
#| viewerHeight: 650

library(shiny)
library(bslib)

ui <- page_sidebar(
  sidebar = sidebar(checkboxGroupInput("pred",
    "Predictor",
    choiceValues = c(
      "Food", "Temperature",
      "Interaction"
    ), choiceNames = c(
      "Food (A vs. B)",
      "Temperature (Low vs. High)", "Interaction"
    )
  )), card(
    textOutput("formula")
  ), card(
    card_header("Coefficients"),
    tableOutput("coef_table")
  ), card(plotOutput("plot")),
  tags$head(tags$style("#coef_table td{\n                     position:relative;\n                     };\n\n                     ")),
)
server <- function(input, output, session) {
  data <- reactive({
    b0 <- 5
    b1 <- 3
    b2 <- 2
    b3 <- 2
    food <- factor(rep(c("A", "B"), times = 24))
    temperature <- factor(rep(c("Low", "High"), each = 24),
      levels = c("Low", "High")
    )
    response <- b0 + b1 * (food == "B") + b2 * (temperature ==
      "High") + b3 * (food == "B" & temperature ==
      "High") + rnorm(48)
    data.frame(
      food = food, temperature = temperature,
      response = response
    )
  })
  form <- "response ~"
  form2 <- reactive({
    if ("Interaction" %in% input$pred) {
      paste(form, "food * temperature")
    } else if ("Food" %in% input$pred & "Temperature" %in%
      input$pred) {
      paste(form, "food + temperature")
    } else if ("Food" %in% input$pred) {
      paste(form, "food")
    } else if ("Temperature" %in% input$pred) {
      paste(form, "temperature")
    } else {
      paste(form, "1")
    }
  })
  observe({
    if ("Interaction" %in% input$pred) {
      updateCheckboxGroupInput(session, "pred", selected = c(
        "Food",
        "Temperature", "Interaction"
      ))
    }
  })
  model <- reactive({
    lm(form2(), data = data())
  })
  coefs <- reactive({
    coef(model())
  })
  coef_colours <- reactive({
    c(
      `(Intercept)` = "skyblue", foodB = "red", temperatureHigh = "green",
      `foodB:temperatureHigh` = "orange"
    )[names(coefs())]
  })
  coef_table <- reactive({
    c1 <- "<div style=\"width: 100%; height: 100%; z-index: 0; background-color: "
    c2 <- "; position:absolute; top: 0; left: 0; padding:5px;\">\n<span>"
    c3 <- "</span></div>"
    tab <- data.frame(
      Beta = paste0(
        c1, coef_colours(),
        c2, "β", seq_along(coefs()), c3
      ), Coefficent = names(coefs()),
      Estimate = coefs()
    )
    tab
  })
  output$formula <- renderText(paste("Model formula:", form2()))
  output$coef_table <- renderTable(coef_table(), sanitize.text.function = function(x) x)
  output$plot <- renderPlot({
    set.seed(1)
    f <- as.formula(form2())
    fc <- as.character(f)
    ylim <- c(0, max(data()$response))
    par(mar = c(1.5, 2.5, .5, .5), mgp = c(1.5, 0.5, 0))
    if (fc[3] == "1") {
      stripchart(data()$response,
        method = "jitter",
        jitter = 0.1, vertical = TRUE, pch = 1, ylim = ylim,
        ylab = "response"
      )
    } else {
      stripchart(f,
        data = data(), method = "jitter",
        jitter = 0.1, vertical = TRUE, pch = 1, ylim = ylim
      )
    }
    cols <- c("food", "temperature")[c(grepl(
      "food",
      fc[3]
    ), grepl("temperature", fc[3]))]
    if (length(cols) == 0) {
      cols <- "food"
    }
    pred <- predict(model(), newdata = unique(data()[,
      cols,
      drop = FALSE
    ]))
    points(seq_along(pred), pred,
      col = "#832424", pch = 16,
      cex = 4
    )
    xs <- seq_along(pred) - 0.2
    arrows(xs, rep(0, length(pred)), xs, rep(
      coefs()[1],
      length(pred)
    ),
    col = coef_colours()[1], lwd = 4,
    length = 0.1
    )
    pos <- 2 - 0.2
    if ("foodB" %in% names(coefs())) {
      arrows(pos, coefs()[1], pos, coefs()[1] + coefs()["foodB"],
        col = coef_colours()["foodB"], lwd = 4, length = 0.1
      )
      pos <- pos + 1
    }
    if ("temperatureHigh" %in% names(coefs())) {
      arrows(pos, coefs()[1], pos, coefs()[1] + coefs()["temperatureHigh"],
        col = coef_colours()["temperatureHigh"], lwd = 4,
        length = 0.1
      )
      pos <- pos + 1
    }
    if (all(c("foodB", "temperatureHigh") %in% names(coefs()))) {
      arrows(pos, coefs()[1], pos, coefs()[1] + coefs()["foodB"],
        col = coef_colours()["foodB"], lwd = 4, length = 0.1
      )
      arrows(pos, coefs()[1] + coefs()["foodB"], pos,
        coefs()[1] + coefs()["foodB"] + coefs()["temperatureHigh"],
        col = coef_colours()["temperatureHigh"], lwd = 4,
        length = 0.1
      )
      if ("foodB:temperatureHigh" %in% names(coefs())) {
        arrows(pos, coefs()[1] + coefs()["foodB"] +
          coefs()["temperatureHigh"], pos, coefs()[1] +
          coefs()["foodB"] + coefs()["temperatureHigh"] +
          coefs()["temperatureHigh"],
        col = coef_colours()["foodB:temperatureHigh"],
        lwd = 4, length = 0.1
        )
      }
    }
  })
}
shinyApp(ui, server)

Interaction between continuous and categorical predictors

mod_interact <- lm(bill_length_mm ~ sex * flipper_length_mm, data = adelie)
summary(mod_interact)

Call:
lm(formula = bill_length_mm ~ sex * flipper_length_mm, data = adelie)

Residuals:
   Min     1Q Median     3Q    Max 
-6.395 -1.329  0.029  1.427  5.438 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                39.7935     8.3528    4.76  4.6e-06 ***
sexmale                   -20.2163    11.0648   -1.83    0.070 .  
flipper_length_mm          -0.0135     0.0445   -0.30    0.762    
sexmale:flipper_length_mm   0.1217     0.0583    2.09    0.039 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.11 on 142 degrees of freedom
Multiple R-squared:  0.385, Adjusted R-squared:  0.372 
F-statistic: 29.6 on 3 and 142 DF,  p-value: 6.43e-15

Interaction between two continuous predictors

mod <- lm(flipper_length_mm ~ body_mass_g * bill_length_mm, data = adelie)
summary(mod)

Call:
lm(formula = flipper_length_mm ~ body_mass_g * bill_length_mm, 
    data = adelie)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.653  -3.101  -0.003   3.369  16.585 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 2.04e+02   5.72e+01    3.58  0.00048 ***
body_mass_g                -6.63e-03   1.52e-02   -0.44  0.66330    
bill_length_mm             -9.19e-01   1.48e+00   -0.62  0.53531    
body_mass_g:bill_length_mm  3.17e-04   3.89e-04    0.82  0.41546    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.79 on 142 degrees of freedom
Multiple R-squared:  0.229, Adjusted R-squared:  0.212 
F-statistic:   14 on 3 and 142 DF,  p-value: 4.62e-08

Power needed for interactions

  • Main effect = \(\bar{y_1}\) - \(\bar{y_2}\)
  • Interaction = (\(\bar{y_1}\) - \(\bar{y_2}\)) - (\(\bar{y_3}\) - \(\bar{y_4}\))
set.seed(42)
sigma <- 10
n <- 1000
random <- tibble(
  x = sample(c("A", "B"), n, replace = TRUE),
  z = sample(factor(1:2), n, replace = TRUE),
  y = rnorm(n, mean = 0, sd = sigma)
)

lm(y ~ x * z, data = random) |> tidy()
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.281     0.615     0.457  0.647 
2 xB            -1.90      0.864    -2.20   0.0280
3 z2             0.759     0.880     0.863  0.388 
4 xB:z2          1.06      1.24      0.849  0.396 

Formula for interactions

y ~ x + z + x:z

y ~ x * z

y ~ (x + z)^2

Use y ~ x + I(x^2) to get a quadratic. Or y ~ poly(x, 2)

mod1 <- lm(flipper_length_mm ~ species + sex + species:sex, data = penguins)
mod2 <- lm(flipper_length_mm ~ species * sex, data = penguins)
mod3 <- lm(flipper_length_mm ~ (species + sex)^2, data = penguins)

coef(mod1)
             (Intercept)         speciesChinstrap            speciesGentoo 
                 187.795                    3.941                   24.912 
                 sexmale speciesChinstrap:sexmale    speciesGentoo:sexmale 
                   4.616                    3.560                    4.218 
coef(mod2)
             (Intercept)         speciesChinstrap            speciesGentoo 
                 187.795                    3.941                   24.912 
                 sexmale speciesChinstrap:sexmale    speciesGentoo:sexmale 
                   4.616                    3.560                    4.218 
coef(mod3)
             (Intercept)         speciesChinstrap            speciesGentoo 
                 187.795                    3.941                   24.912 
                 sexmale speciesChinstrap:sexmale    speciesGentoo:sexmale 
                   4.616                    3.560                    4.218 

Model Selection

You want to the best model!

The best model for what?

  • Exploratory data analysis
  • Inference (hypothesis testing)
  • Predictions

Exploratory analysis

  • Consider all plausible models
  • P-values not meaningful
  • High type I error rate
  • Suggest hypotheses for hypothesis testing with independent data

The more biology you include in the model

Automatic model selection

Last resort - many problems and biases

  • Forward selection

  • Backwards selection

  • All possible models

library(MuMIn)
library(conflicted)
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")

penguins2 <- penguins |>
  drop_na() |>
  select(-year)
full <- lm(body_mass_g ~ ., data = penguins2, na.action = "na.fail")

AIC

AIC Akaike information criterion \(2k - 2 \times log(likelihood)\)
(k the number of parameters)

Measure of how well the model fit the data

Penalised for model complexity

AICc correction for small sample sizes

AIC weights - probability model is best of those tested

dredge(full) |> gt::gt()
(Intercept) bill_depth_mm bill_length_mm flipper_length_mm island sex species df logLik AICc delta weight
-1461.0 67.22 18.204 15.95 NA + + 8 -2354 4724 0.000 7.682e-01
-1500.0 67.58 18.189 16.24 + + + 10 -2354 4728 3.543 1.307e-01
-1211.5 74.38 NA 17.54 NA + + 7 -2357 4729 4.537 7.948e-02
-1245.4 74.51 NA 17.85 + + + 9 -2357 4732 8.027 1.388e-02
-759.1 NA 21.633 17.85 NA + + 7 -2360 4734 9.536 6.526e-03
-785.6 NA 21.528 18.14 + + + 9 -2359 4738 13.153 1.070e-03
-365.8 NA NA 20.02 NA + + 6 -2364 4741 16.649 1.863e-04
-390.0 NA NA 20.32 + + + 8 -2364 4745 20.118 3.287e-05
844.0 87.93 26.537 NA NA + + 7 -2369 4752 27.263 9.232e-07
829.8 88.33 26.664 NA + + + 9 -2369 4756 31.392 1.172e-07
1577.0 102.04 NA NA NA + + 6 -2375 4763 38.571 3.236e-09
1576.2 102.15 NA NA + + + 8 -2375 4767 42.751 4.000e-10
2169.3 NA 32.537 NA NA + + 6 -2378 4768 43.901 2.252e-10
2167.9 NA 32.535 NA + + + 8 -2378 4772 48.086 2.778e-11
-4282.1 141.77 39.718 20.23 NA NA + 7 -2385 4784 59.631 8.643e-14
3372.4 NA NA NA NA + + 5 -2388 4786 61.394 3.580e-14
-4324.0 141.82 39.599 20.58 + NA + 9 -2384 4786 62.049 2.580e-14
3375.8 NA NA NA + + + 7 -2388 4790 65.442 4.731e-15
-2219.3 -50.03 NA 35.69 + + NA 7 -2397 4807 82.977 7.364e-19
-2115.6 -50.34 6.052 33.94 + + NA 8 -2396 4808 83.566 5.488e-19
-4484.7 181.95 NA 25.53 NA NA + 6 -2400 4811 86.962 1.004e-19
-4524.1 181.54 NA 25.95 + NA + 8 -2399 4814 89.247 3.204e-20
-3720.1 NA NA 39.30 + + NA 6 -2401 4815 90.409 1.792e-20
-3629.6 NA 5.808 37.64 + + NA 7 -2401 4816 91.143 1.242e-20
-1709.5 180.78 54.061 NA NA NA + 6 -2405 4822 97.444 5.318e-22
-1735.5 181.30 54.252 NA + NA + 8 -2404 4825 100.777 1.005e-22
-3864.1 NA 60.117 27.54 NA NA + 6 -2411 4833 109.078 1.583e-24
-2246.8 -86.95 NA 38.19 NA + NA 5 -2412 4834 109.703 1.158e-24
-2288.5 -86.09 -2.329 38.83 NA + NA 6 -2412 4836 111.527 4.653e-25
-3902.9 NA 59.868 27.96 + NA + 8 -2410 4836 111.692 4.284e-25
-5410.3 NA NA 46.98 NA + NA 4 -2427 4863 138.223 7.424e-31
-5433.5 NA -5.201 48.21 NA + NA 5 -2427 4864 139.128 4.723e-31
-1000.8 256.55 NA NA NA NA + 5 -2431 4871 146.983 9.299e-33
-1014.6 257.16 NA NA + NA + 7 -2430 4875 150.703 1.448e-33
-5707.1 51.24 13.748 42.90 + NA NA 7 -2438 4891 166.751 4.741e-37
-6114.7 56.67 NA 47.40 + NA NA 6 -2441 4895 170.817 6.209e-38
-4013.2 NA NA 40.61 NA NA + 5 -2443 4895 171.067 5.478e-38
-4047.8 NA NA 41.09 + NA + 7 -2442 4897 173.113 1.970e-38
200.5 NA 90.298 NA NA NA + 5 -2444 4899 174.468 1.001e-38
-4355.0 NA 16.883 39.66 + NA NA 6 -2445 4902 177.784 1.906e-39
182.0 NA 90.510 NA + NA + 7 -2444 4903 178.295 1.477e-39
-4687.9 NA NA 44.89 + NA NA 5 -2450 4909 184.809 5.685e-41
-6537.6 19.88 NA 51.77 NA NA NA 4 -2460 4928 203.674 4.552e-45
-5872.1 NA NA 50.15 NA NA NA 3 -2461 4928 203.836 4.198e-45
-5836.3 NA 4.959 48.89 NA NA NA 4 -2461 4929 204.973 2.377e-45
-6445.5 17.84 3.293 50.76 NA NA NA 5 -2460 4930 205.355 1.965e-45
4784.5 -155.12 44.769 NA + + NA 7 -2470 4955 230.236 7.767e-51
1747.4 NA 60.616 NA + + NA 6 -2505 5022 297.540 1.886e-65
6551.4 -257.31 36.519 NA NA + NA 5 -2507 5023 299.035 8.929e-66
7580.2 -211.52 NA NA + + NA 6 -2506 5024 299.241 8.057e-66
3706.2 NA NA NA NA NA + 4 -2513 5035 310.285 3.221e-68
3709.7 NA NA NA + NA + 6 -2513 5039 314.411 4.093e-69
8779.1 -299.34 NA NA NA + NA 4 -2529 5066 341.864 4.474e-75
1697.5 -26.23 76.000 NA + NA NA 6 -2537 5086 361.234 2.782e-79
1239.3 NA 76.905 NA + NA NA 5 -2538 5086 361.346 2.632e-79
4375.8 NA NA NA + + NA 5 -2562 5133 409.061 1.145e-89
3413.5 -145.51 74.813 NA NA NA NA 4 -2595 5199 474.205 8.187e-104
746.1 NA 74.025 NA NA + NA 4 -2614 5236 512.010 5.056e-112
5590.3 -54.76 NA NA + NA NA 5 -2615 5240 516.032 6.767e-113
4719.2 NA NA NA + NA NA 4 -2618 5244 519.963 9.482e-114
388.8 NA 86.792 NA NA NA NA 3 -2629 5264 539.833 4.594e-118
7520.0 -193.01 NA NA NA NA NA 3 -2658 5322 598.047 1.050e-130
3862.3 NA NA NA NA + NA 3 -2667 5340 615.648 1.582e-134
4207.1 NA NA NA NA NA NA 2 -2700 5404 679.945 1.727e-148

Inference

Test small number of a priori hypothesis.

Use anova() to compare nested models

H0 there is no interaction between sex and species for predicting flipper length

mod1 <- lm(body_mass_g ~ species + sex, data = penguins)
mod2 <- lm(body_mass_g ~ species * sex, data = penguins)
anova(mod1, mod2)
Analysis of Variance Table

Model 1: body_mass_g ~ species + sex
Model 2: body_mass_g ~ species * sex
  Res.Df      RSS Df Sum of Sq    F Pr(>F)    
1    329 32979185                             
2    327 31302628  2   1676557 8.76  2e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::Anova vs anova

mod2 <- lm(body_mass_g ~ species + sex, data = penguins)
mod3 <- lm(body_mass_g ~ sex + species, data = penguins)
  • anova - tests terms sequentially
  • car::Anova - marginal test
anova(mod2)
Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value Pr(>F)    
species     2 1.45e+08 72595110     724 <2e-16 ***
sex         1 3.71e+07 37090262     370 <2e-16 ***
Residuals 329 3.30e+07   100241                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod3)
Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value Pr(>F)    
sex         1 3.89e+07 38878897     388 <2e-16 ***
species     2 1.43e+08 71700792     715 <2e-16 ***
Residuals 329 3.30e+07   100241                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(mod2)
Anova Table (Type II tests)

Response: body_mass_g
            Sum Sq  Df F value Pr(>F)    
species   1.43e+08   2     715 <2e-16 ***
sex       3.71e+07   1     370 <2e-16 ***
Residuals 3.30e+07 329                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(mod3)
Anova Table (Type II tests)

Response: body_mass_g
            Sum Sq  Df F value Pr(>F)    
sex       3.71e+07   1     370 <2e-16 ***
species   1.43e+08   2     715 <2e-16 ***
Residuals 3.30e+07 329                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multicollinearity

  • Two or more predictor variables in a multiple regression model are highly correlated. Example: pH and calcium
  • Coefficient estimates are unstable
    • erratic change in response to small changes in the model or the data.
  • Solve by having lots of data
col <- performance::check_collinearity(full)
col
# Check for Multicollinearity

Low Correlation

   Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 island 3.73 [ 3.14,  4.49]         1.93      0.27     [0.22, 0.32]
    sex 2.33 [ 2.00,  2.77]         1.53      0.43     [0.36, 0.50]

Moderate Correlation

              Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
    bill_length_mm 6.10 [ 5.06,  7.40]         2.47      0.16     [0.14, 0.20]
     bill_depth_mm 6.10 [ 5.06,  7.41]         2.47      0.16     [0.14, 0.20]
 flipper_length_mm 6.80 [ 5.63,  8.26]         2.61      0.15     [0.12, 0.18]

High Correlation

    Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 species 63.52 [51.73, 78.06]         7.97      0.02     [0.01, 0.02]
plot(col)

Autocorrelation

Linear models assume residuals are independent

If data are spatially or temporally structured, residuals may be correlated.

Positive autocorrelation

  • confidence intervals are too narrow
  • increases risk of false positive (Type I error)

Luteinising Hormone concentration

Code
library(ggfortify)

lh <- fortify(lh) |>
  mutate(time = Index * 10) |>
  rename(concentration = Data)

mod_lh <- lm(concentration ~ time, data = lh)

augment(mod_lh, interval = "confidence") |>
  ggplot(aes(x = time, y = concentration)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, fill = "#ee5050") +
  geom_line(aes(y = .fitted), colour = "#ee5050") +
  labs(x = "Time, minutes", y = "Hormone concentration")

Detecting autocorrelation

For time series with equally-spaced observations, use autocorrelation function (ACF)

autoplot(acf(mod_lh$residuals, plot = FALSE))

Detecting autocorrelation

Tests for equally-spaced observations

Durbin-Watson test

performance::check_autocorrelation(mod_lh)
Warning: Autocorrelated residuals detected (p < .001).

Other methods for spatial data and non-equally spaced data

Solution - use a model that accounts for autocorrelation

  • generalised least squares (nlme::gls())

Reporting regression results - tables

Make a table - estimates, confidence intervals, p-values

Can calculate confidence intervals with broom::tidy() and format output

library(gt)
tidy(mod_mult, conf.int = TRUE) |>
  select(term, estimate, conf.low, conf.high, p.value) |>
  mutate(p.value = format.pval(p.value, eps = 0.001)) |>
  gt() |>
  fmt_number(
    columns = c(estimate, conf.low, conf.high),
    decimals = 4
  )
term estimate conf.low conf.high p.value
(Intercept) −3.4367 −12.4466 5.5732 0.45
body_mass_g 0.0007 −0.0005 0.0018 0.24
flipper_length_mm 0.2219 0.1582 0.2855 <0.001

gtsummary

Or make table with gtsummary

library(gtsummary)
tbl <- tbl_regression(mod_mult,
  label = list(
    body_mass_g = "Body mass g",
    flipper_length_mm = "Flipper length mm"
  ),
  estimate_fun = label_style_sigfig(digits = 4)
)
tbl
Characteristic Beta 95% CI1 p-value
Body mass g 0.0007 -0.0005, 0.0018 0.2
Flipper length mm 0.2219 0.1582, 0.2855 <0.001
1 CI = Confidence Interval

Reporting regression results - inline

Can extract values directly from regression .

  • estimate
  • confidence intervals
  • p-value, degrees of freedom, statistic
Flipper length is associated with longer bill length (estimate = `r round(coef(mod_mult)["flipper_length_mm"], 2)` mm/mm, 95% CI = `r round(confint(mod_mult)["flipper_length_mm", ], 2)` p = `r format.pval(tidy(mod_mult) |> filter(term == "flipper_length_mm") |> select("p.value"), eps = 0.001)`).

Flipper length is associated with longer bill length (estimate = 0.22 mm/mm, 95% CI = 0.16, 0.29 p = <0.001).

Also function inline_text() from gtsummary

See APA guide for reporting statistics