Survival analysis and Summary

Bio300B Lecture 11

Richard J. Telford (Richard.Telford@uib.no)

Institutt for biovitenskap, UiB

6 November 2023

Survival analysis

How long (or far) until an event happens

  • How long before a seed germinates
  • How long before a patient dies after diagnosis
  • How many jumps can a horse clear at a racecourse

Data are typically right censored.

survival package

Survival data

Response data

  • time
  • status (0/1, 1/2 TRUE/FALSE)
library(survival)
data(lung)
lung |> slice(1:10)
   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1     3  306      2  74   1       1       90       100     1175      NA
2     3  455      2  68   1       0       90        90     1225      15
3     3 1010      1  56   1       0       90        90       NA      15
4     5  210      2  57   1       1       90        60     1150      11
5     1  883      2  60   1       0      100        90       NA       0
6    12 1022      1  74   1       1       50        80      513       0
7     7  310      2  68   2       2       70        60      384      10
8    11  361      2  71   2       2       60        80      538       1
9     1  218      2  53   1       1       70        80      825      16
10    7  166      2  61   1       2       70        70      271      34

Survival response

  • indicates censored data
Surv(lung$time, lung$status)
 [1]  306   455  1010+  210   883  1022+  310   361   218   166   170   654 
[13]  728    71   567   144   613   707    61    88   301    81   624   371 
[25]  394   520   574   118   390    12 

Fitting a model

lung <- lung |> 
  mutate(sex = factor(sex, levels = 1:2, labels = c("male", "female")))

mod_surv <- survfit(Surv(time, status) ~ sex, data = lung)

mod_surv
Call: survfit(formula = Surv(time, status) ~ sex, data = lung)

             n events median 0.95LCL 0.95UCL
sex=male   138    112    270     212     310
sex=female  90     53    426     348     550

plotting a model

  • Kaplan-Meier curve
  • ‘+’ mark censored data
library(ggfortify)
autoplot(mod_surv)

Testing differences in survival time

  • between two or more groups
mod_diff <- survdiff(Surv(time, status) ~ sex, data = lung)
mod_diff
Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)

             N Observed Expected (O-E)^2/E (O-E)^2/V
sex=male   138      112     91.6      4.55      10.3
sex=female  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.001 

Survivorship curve

Shape of the survival curve

Hazard: The instantaneous risk of death

m_ll <- survreg(Surv(time, status) ~ age + sex, 
                data = lung,
                dist = "loglogistic")
summary(m_ll)

Call:
survreg(formula = Surv(time, status) ~ age + sex, data = lung, 
    dist = "loglogistic")
               Value Std. Error     z       p
(Intercept)  6.39982    0.49256 12.99 < 2e-16
age         -0.01401    0.00771 -1.82 0.06946
sexfemale    0.47751    0.14036  3.40 0.00067
Log(scale)  -0.56991    0.06543 -8.71 < 2e-16

Scale= 0.566 

Log logistic distribution
Loglik(model)= -1152.9   Loglik(intercept only)= -1160.9
    Chisq= 16.07 on 2 degrees of freedom, p= 0.00032 
Number of Newton-Raphson Iterations: 4 
n= 228 

Choice of distribution

Exponentiated coefficients show relative increase or decrease in the expected survival times when a covariate is increased one step while others are fixed:

exp(coef(m_ll))
(Intercept)         age   sexfemale 
601.7394821   0.9860925   1.6120541 

Expected survival time decreases by 1.4 % (i.e. multiply by 0.986) for each additional year of age

The expected survival time for females is 61.2 % higher than for males (multiply by 1.612).

scale

m_ll$scale
[1] 0.5655786

The scale parameter < 1, indicates that slope of the hazard decreases with time - Type III

Choice of distribution

Distribution df
exponential 1
Weibull 2
lognormal 2
log logistic 2

Can use AIC to choose best

m_w <- update(m_ll, dist = "weibull")
AIC(m_ll, m_w)
     df      AIC
m_ll  4 2313.794
m_w   4 2302.109

Summary

Most important things

  1. Designing an experiment
  2. R tips and tricks
  3. Entering data into a spreadsheet
  4. Importing data
  5. Cleaning data
  6. Data visualisation
  7. Using quarto
  8. Choosing a model type
  9. Interpreting model output

Method choice - 2 key questions

Are the data clustered/grouped?

What is the distribution of the residuals?

(also other types of model - survival etc)

Clustered data

Mixed effect models

  • Normal distribution: linear mixed effect models
  • Binomial/poisson: generalised linear mixed effect models

Independent data, normal distribution

Linear models - ordinary least squares

  • t-tests
  • two sample t-test
  • one sample t-test
  • paired t-test
  • lm

Independent data, non-normal distribution

generalised linear models

  • glm

Binomial & Poisson distributions

Interpreting model output

data(penguins, package = "palmerpenguins")
mod <- lm(bill_length_mm ~ sex * species, data = penguins)
summary(mod)

Call:
lm(formula = bill_length_mm ~ sex * species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7904 -1.3735 -0.0638  1.2096 11.4265 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               37.2575     0.2710 137.473  < 2e-16 ***
sexmale                    3.1329     0.3833   8.174 6.64e-15 ***
speciesChinstrap           9.3160     0.4808  19.377  < 2e-16 ***
speciesGentoo              8.3063     0.4073  20.393  < 2e-16 ***
sexmale:speciesChinstrap   1.3877     0.6799   2.041   0.0421 *  
sexmale:speciesGentoo      0.7771     0.5721   1.358   0.1753    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.316 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8234,    Adjusted R-squared:  0.8207 
F-statistic:   305 on 5 and 327 DF,  p-value: < 2.2e-16

More statistics at BIO

Bio302 Spring semester

  • more regression methods
  • more R
  • GitHub

Bio303

  • multivariate methods
    • ordinations
    • cluster analysis