Data visualisation

Bio300B Lecture 4

Richard J. Telford (Richard.Telford@uib.no)

Institutt for biovitenskap, UiB

8 September 2025

Data visualisation

  • A picture is worth a thousand words
  • Tell a story with figures
  • Avoid common mistakes

“reflect the data, tell a story, and look professional” Wilke

ggplot2

  • one of at least three schemes for graphics in R
  • part of tidyverse

A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.

You provide the data, tell ‘ggplot2’ how to map variables to aesthetics, what graphical primitives to use, it takes care of the details.

ggplot in action

plot <- ggplot(data = penguins,     # Data
       mapping = aes(               # Aesthetics
         x = body_mass_g,    
         y = bill_length_mm, 
         colour = species)) +
  geom_point() +                    # Geometries
  scale_colour_brewer(palette = "Set2") + # scales
  labs(x = "Body mass, g",          # labels
       y = "Bill length mm", 
       colour = "Species") +
  theme_bw()                        # themes
                                    # Also facets
plot

ggplot in action

Data

Tibble or data frame with data to be plotted.

Tidy data

Can process data within ggplot but usually best to do it first

Can add data to the whole plot or to individual geoms

penguin_summary <- penguins |> group_by(species) |> summarise(body_mass_g = mean(body_mass_g, na.rm = TRUE), bill_length_mm = mean(bill_length_mm, na.rm = TRUE) )
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, colour = species)) +
  geom_point() +
  geom_text(aes(label = species), data = penguin_summary, colour = "black")
plot of penguin bill length against body mass coloured by species. The species name is written in the middle of each species' cluster of points.

Aesthetics

mapping specifies which variables in the data should be mapped onto which aesthetics with aes()

Each geom takes different aesthetics

Common aesthetics

  • x, y
  • fill, colour, alpha
  • shape, size
  • linetype, linewidth
  • group

Setting vs mapping

Mapping in aes()

ggplot(penguins, 
       aes(x = flipper_length_mm, 
           fill = "blue")) +
geom_histogram()

Setting in the geom

ggplot(penguins, 
       aes(x = flipper_length_mm)) +
geom_histogram(fill = "blue")

geoms

Use different geoms for different plot types

Important geoms

  • geom_point()
  • geom_boxplot()
  • geom_histogram()
  • geom_smooth()
  • geom_line()
  • geom_text()

Many geoms, some in extra packages

Geoms to show distributions

Histogram

Count how many observations in each bin

ggplot(penguins, aes(x = flipper_length_mm)) + geom_histogram()

Critical question - how many bins? Set with bins argument

Density

Smoothed histograms

ggplot(penguins, aes(x = flipper_length_mm)) + geom_density()

adjust argument adjusts bandwidth to control how smooth

Geoms to show many distributions

base <- ggplot(penguins, aes(x = species, y = flipper_length_mm))

p_prange <- base + stat_summary(fun = "mean", geom = "col")
p_box <- base + geom_boxplot(aes(fill = species))
p_vio <- base + geom_violin(aes(fill = species))
p_jit <- base + geom_jitter(aes(colour = species))
library(ggbeeswarm)
p_quasi <- base + geom_quasirandom(aes(colour = species))
p_quasi2 <- base + geom_violin(aes(fill = species), alpha = 0.3) +
  geom_quasirandom(aes(colour = species))

Boxplots can mislead

p <- datasauRus::box_plots |> 
  pivot_longer(everything()) |> 
  ggplot(aes(x = name, y = value))

p1 <- p + geom_boxplot()
p2 <- p + geom_violin()
Left hand side shows five identical boxplots, right hand side shows very different violin plots for the same datasets.

Show the raw data

top left panel shows mean + SE only, top right shows mean + SE togther with widely spread jittered raw data Bottom plots show the same with more data so SE are smaller

geoms for scatterplots

ggplot(penguins, aes(x = body_mass_g,  y = bill_length_mm, colour = species)) +
  geom_point() +
  geom_smooth(method = "lm")
  • geom_line() - join observations from left-right
  • geom_path() - join observations from first to last in data

Scales

Control how

  • variables are mapped onto the aesthetics
  • axes breaks

All called scale_aesthetic_description

  • scale_x_log()
  • scale_y_reverse()
  • scale_colour_viridis_c()
  • scale_shape_manual()

Labels

  • plot, axis and legend titles
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, colour = species)) +
  geom_point() +
  labs(x = "Body mass g",
       y = "Bill length mm", 
       colour = "Species", 
       title = "Bill length against body mass ") 

Facets

Split data into separate panels.

plot + facet_wrap(facets = vars(species))

facet_grid() for two dimensional arrays of subplots

diamonds |> ggplot(aes(x = carat, y = price)) +
  geom_point(data = diamonds |> select(carat, price), colour = "grey80") +
  geom_point() +
  facet_grid(rows = vars(color), cols = vars(clarity))

Themes

Change how non-data elements of the plot look

Entire themes

Themes

Can also change individual elements

plot + theme(legend.position = "top")

Removing elements

plot + theme(panel.grid = element_blank())

Colour & fills

Avoid primary colours

ggplot(penguins, aes(x = flipper_length_mm, fill = species)) +
  geom_histogram() +
  scale_fill_manual(values = c("red", "green", "blue")) +
  labs(x = "Flipper length mm")

Colour deficient vision

den <- ggplot(penguins, aes(x = bill_length_mm, fill = species)) +
  geom_density(alpha = 0.7)
den
colorBlindness::cvdPlot(den)

#End rainbow

Better colour scale

den <- ggplot(penguins, aes(x = bill_length_mm, fill = species)) +
  geom_density(alpha = 0.7) +
  scale_fill_brewer(palette = "Set2")
den
colorBlindness::cvdPlot(den)

Using colour effectively

Choose an appropriate palette.

Qualitative palettes

RColorBrewer::display.brewer.all(type = "qual")

Sequential palettes

RColorBrewer::display.brewer.all(type = "seq")

Dividing palettes

RColorBrewer::display.brewer.all(type = "div")

Viridis

ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(aes(colour = flipper_length_mm)) +
  scale_colour_viridis_c()

Highlight

ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(colour = "red") +
  gghighlight::gghighlight(species == "Chinstrap")

Redundant encoding

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm,
           colour = species,
           shape = species)) +
  geom_point() 
Plot of penguin data with points distinguished by both colour and shape

Also colour and linetype/linewidth

Avoiding legends

library(directlabels)
direct.label(plot) 
Plot of penguins data with labels applied directly to the plot instead of using a legend.

Avoiding overplotting

Problem - points plot on top of each other.

Plot of body mass against species for the penguin data. With geom_point() the data are all in one line with lots of overplotting. geom_jitter speads the points out to reduce overplotting.

More overplotting

Problem - too much data

Plots of diamond price ($) against mass in carats. First plot with default argument to geom_point() is difficult to interpret as points everywhere. Second plot, made by setting alpha to 0.1 is better as rare combinations of mass and price are shown in a paler colour. Plot shows strips with few diamonds sold at just below 1 or 2 carats. Third plot uses hexbinning and highlights the large number of small diamonds sold at a relatively low price.

Most common mistake in presentations

plot with very small labels

Solution

theme_bw(base_size = 18)

Summary

  • If you can imagine it, you can plot it
  • Whole ecosystem of packages to help

Further reading