summarise(group_by(penguins, species), mean = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
species mean
<fct> <dbl>
1 Adelie 38.8
2 Chinstrap 48.8
3 Gentoo 47.5
|>
If we want to run several functions in turn, for example, if we wanted to find the mean bill length for each penguin species with group_by
and summarise
(you will find out more about these functions in Chapter 14). You could nest one function inside the other.
summarise(group_by(penguins, species), mean = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
species mean
<fct> <dbl>
1 Adelie 38.8
2 Chinstrap 48.8
3 Gentoo 47.5
This sort of works, but with larger problems using more function, this solution becomes almost impossible to read and it is very easy to make mistakes and forget which brackets belong to which function.
Another strategy is to make and use intermediate objects
penguins_grouped <- group_by(penguins, species)
summarise(penguins_grouped, mean = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
species mean
<fct> <dbl>
1 Adelie 38.8
2 Chinstrap 48.8
3 Gentoo 47.5
This works better, but can generate many intermediates and it can be difficult to ensure that the correct one is used.
A popular alternative is to use pipes. The R code for a pipe is |>
. Pipes pass the results from one function directly into the next function.
penguins |>
group_by(species) |>
summarise(mean = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
species mean
<fct> <dbl>
1 Adelie 38.8
2 Chinstrap 48.8
3 Gentoo 47.5
The pipe basically means “and then”, so the above code can be read as “take the penguin data and then group by species and then summarise the mean bill length”.
You never need to use pipes, but they can make code more readable.
Here is a recipe for mashed potato using pipes.
buy("potatoes", kg = "1") |>
peel() |>
boil(minutes = "15") |>
drain() |>
mash(add = list("salt", "milk", "butter")) |>
serve(decorate = "parsley")
This recipe for mashed potato can be read as buy 1kg potatoes, and then peel them, and then boil them, and so on.
|>
The pipe passes result of code on left of pipe to the function on right, and puts it in the first available argument.
So
f <- "file.csv"
read_csv(file = f)
and be rewritten as
f |>
read_csv()
If you want to put the object passed through the pipe into the second argument, you need to name the first, so that it is not available. So if we want to pipe penguins
into lm
to fit a linear model, we need penguins to be put into the data
argument, which is the second argument of lm
. We can force this by naming the formula
argument, so that data
is the first available argument.
Or we can use the placeholder _
# named first argument, penguins pipes into second argument
penguins |>
lm(bill_length ~ species, data = _) # R 4.2 and newer only
More complex arrangements, for example using the piped object multiple times, can be done by writing a function, or using the pipebind
package. You probably won’t have to do this very often.
# Using a anonymous function
rnorm(10) |>
(function(x){x - mean(x)})() # Result from rnorm is passed to the anonymous function as x
[1] -1.97555714 -0.48218398 0.02953355 1.11727215 0.64242338 -0.66064742
[7] 1.20709071 0.10471908 2.11306503 -2.09571536
[1] 0.78215898 -0.07098804 -0.06318266 1.20092523 1.34671048 -2.98568019
[7] 0.81707960 -1.01027789 0.75633615 -0.77308166
[1] -0.8370634 0.3015608 -0.2920104 1.2031490 0.3592847 -1.1807594
[7] -0.3339675 -0.1887989 1.9754446 -1.0068395
%>%
The |>
pipe was introduced in R version 4.1. Previously, the magrittr
package pipe %>%
was widely used, especially with tidyverse functions. You will see the %>%
in a lot of code on stackoverflow and other help sites. In most cases the old and new pipes work in exactly the same way. Advantages of the |>
pipe are that it is
You can make a pipe either by typing it directly, or by using the RStudio keyboard short-cut . You may need to set the RStudio options. Go to Tools
> Global Options
> Code
and tick Use native pipe operator, |>
. To make your code readable, put a line break after each pipe.