Bio300B Lecture 6
Institutt for biovitenskap, UiB
25 September 2024
\[\overline{x} = \frac{\sum_{i=1}^n x_i}{n}\]
mean()
median()
Use na.rm = TRUE
to remove missing values.
min()
max()
Range is difference between smallest and largest.
The average squared distance around the mean var()
\[\sigma^2 = \frac{\sum_{i=1}^n (x_i - \mu)^2}{n}\] Where \(\mu\) is the population mean.
\[s^2 = \frac{\sum_{i=1}^n (x_i - \overline{x})^2}{n-1}\] Where \(\overline{x}\) is the sample mean.
Square root of the variance \[s = \sqrt{s^2}\] \[s = \sqrt{\frac{\sum_{i=1}^n (x_i - \overline{x})^2}{n-1}}\] Same units as variable
sqrt(var())
sd()
Right skew
positive skew e1071::skewness()
heavy tails
e1071::kurtosis()
Sample: 123 Gentoo penguins
Population: All Gentoo penguins in the Palmer Archipelago.
What can we infer about the population from the sample?
Population of 1,000,000 indviduals
Large (100 individuals) or small (20 individuals) samples
Histograms of samples
NULL
Histogram of sample means
Standard deviation divided by square root of number of observations
\[SE = \frac{s}{\sqrt{n}}\] Large n - small SE - reliable estimate of mean
Small n - large SE - less reliable estimate of mean
If experiment repeated many times, confidence level represents the proportion of confidence intervals that contain the true value of the unknown population parameter.
95% confidence interval will contain true value in 95% of repeat experiments
Easily misunderstood.
(Bayesian credibility intervals are much more intuitive)
95% confidence interval
Mean \(\pm\) 1.96 * SE
Why 1.96?
Grey area is 95% of the normal distribution
Estimated means with 95% confidence interval for different sample sizes
The statistical hypothesis needs to match your scientific hypothesis
Hypothesis example:
H0: The bill length of Gentoo penguins does not depend on sex.
Need to know so we can code predictor variables and interpret the output of our models.
Statistical model families (lm, lme, glm etc.) differ in assumptions about the underlying distribution of the response variable.
Often assumed for continuous distributions, body mass, egg shell thickness
Count data
Shape varies with mean
lm()
, t.test()
)glm()
)lme()
, lmer()
)glmer()
)Compare means
H0: Bill length does not depend on sex in Gentoo penguins
Welch Two Sample t-test
data: bill_length_mm by sex
t = -8.8798, df = 111.31, p-value = 1.315e-14
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-4.782478 -3.037477
sample estimates:
mean in group female mean in group male
45.56379 49.47377
Often misinterpreted
If there were actually no effect (if the true difference between means were zero) then the probability of observing a value for the difference equal to, or greater than, that actually observed would be p=0.05.
Many assumptions
Confidence intervals are more interpretable
Null hypothesis is | True | False |
---|---|---|
Rejected |
Type I error |
Correct decision |
Not rejected |
Correct decision |
Type II error |
#| label: power-app
#| standalone: true
#| viewerHeight: 600
library(shiny)
ui <- fluidPage(
# Application title
titlePanel("One sided Z-test"),
sidebarLayout(
sidebarPanel(
sliderInput("delta", "True mean", value = 1, min = 0.1, max = 2),
sliderInput("sd", "Standard deviation", value = 1, min = 0.5, max = 2),
sliderInput("n", "Sample size, n", value = 10, min = 10, max = 100),
radioButtons("alpha", "Significance level, alpha", choices = c(0.05, 0.01, 0.001), selected = 0.05)
),
# Show a plot of the generated distribution
mainPanel(
plotOutput("distPlot")
)
))
server <- function(input, output, session) {
output$distPlot <- renderPlot({
n <- input$n
alpha <- as.numeric(input$alpha)
delta <- input$delta
sd <- input$sd
mx <- 2.5
mn <- -1.5
crit <-
qnorm(
alpha,
mean = 0,
sd = sd / sqrt(n),
lower.tail = FALSE
)
H0 <- data.frame(x = seq(mn, mx, length = 201))
H0$y <- dnorm(x = H0$x, mean = 0, sd = sd / sqrt(n))
H0$what <- ifelse(H0$x < crit, yes = "True negative", no = "False positive")
H0$hypothesis <- "H[0]"
H1 <- data.frame(x = seq(mn, mx, length = 201))
H1$y <- dnorm(x = H1$x, mean = delta, sd = sd / sqrt(n))
H1$what <- ifelse(H1$x < crit, yes = "True negative", no = "False positive")
H1$hypothesis <- "H[1]"
par(oma=c(2,2,0,0), mar=c(1.2,2,2,1), mfrow=c(2,1), cex = 1.5, tcl = -0.1, mgp = c(3, 0.2, 0))
plot(H0$x, H0$y, type = "n", main = expression(H[0]~is~true), xlab = "", ylab = "")
polygon(H0$x, H0$y)
polygon(c(H0$x[H0$x > crit][1], H0$x[H0$x > crit]), c(0, H0$y[H0$x > crit]), col = "#832424")
abline(v = crit)
plot(H1$x, H1$y, type = "n", main = expression(H[1]~is~true), xlab = "", ylab = "")
polygon(H1$x, H1$y)
polygon(c(H1$x[H1$x > crit][1], H1$x[H1$x > crit]), c(0, H1$y[H1$x > crit]), col = "#832424")
abline(v = crit)
mtext(text="x",side=1,line=0,outer=TRUE, 1.5)
mtext(text="Density",side=2,line=0,outer=TRUE, cex = 1.5)
})
}
shinyApp(ui, server)
With little power:
Lots of power
Need to do power analysis before experiment.
Can solve for any of these
Typically want to know how many observations needed.
Some power tests in base R.
power.t.test
power.anova.test
power.prop.test
More in pwr
package
Two-sample t test power calculation
n = 63.76561
d = 0.5
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
Effect size is Cohen’s d \(d = \frac{\mu_1 - \mu_2}{\sigma}\)