14  Working with single tables in dplyr

Data frames are usually the most convenient objects for storing, plotting or analysing data in R. We also need to be able to manipulate data in data frames.

This tutorial will show you how to manipulate data frames using the dplyr package, part of tidyverse.

Everything that can be done with dplyr can be done with base R. The dplyr solution is usually easier to write and understand, and can be faster.

There are many function in dplyr, this tutorial focuses on the functions we find most useful.

14.0.1 Load the package

You can load dplyr with

But is is usually more convenient to load tidyverse so that you get ggplot2, dplyr, readr and other useful packages with one command.

14.0.2 More feedback tidylog

We can get more feedback on what dplyr functions have done to the data by loading the tidylog package.

More than once, tidylog has helped identified bugs in my code.

14.1 The penguins dataset

This tutorial will use the penguins dataset that is included with R. The penguins dataset contains morphological data on three species of penguin. You will meet this dataset repeatedly as it provides a convenient set of variables and observations well-suited for illustrating many purposes. To improve printing, we’ve converted the penguins data.frame into a tibble.

# Show the data
penguins
  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
 [ reached 'max' / getOption("max.print") -- omitted 341 rows ]
# convert to tibble to improve printing
penguins <- as_tibble(penguins)

14.2 Selecting columns with select()

You can choose which columns of the data frame you want with select().

The first argument is the data, which is supplied by the pipe |>, the next arguments are the names of the columns you want. The names do not need quote marks.

#select species, bill_len & bill_dep
penguins |>
  select(species, bill_len, bill_dep)
select: dropped 5 variables (island, flipper_len, body_mass, sex, year)
# A tibble: 344 × 3
  species bill_len bill_dep
  <fct>      <dbl>    <dbl>
1 Adelie      39.1     18.7
2 Adelie      39.5     17.4
3 Adelie      40.3     18  
# ℹ 341 more rows

This is equivalent to the base R code

#select species, bill_len & bill_dep
penguins[, c("species", "bill_len", "bill_dep")]
# A tibble: 344 × 3
  species bill_len bill_dep
  <fct>      <dbl>    <dbl>
1 Adelie      39.1     18.7
2 Adelie      39.5     17.4
3 Adelie      40.3     18  
# ℹ 341 more rows

Remember that if you want to use the output of this code in a further analysis, you need to assign it to an object name with <-.

NoteExercise

From the penguins data, select

  • species
  • species and bill_len
  • all columns except year
Hint
penguins |>
  select(___, ___)

14.2.1 select() helpers

Sometimes we don’t want to write out the names of all the columns we want to select. We might not even know them all in advance. Fortunately there are some helper functions.

If you want to select() adjacent columns, you can use the notation first:last.

#select species to bill_dep
penguins |> select(species:bill_dep)
select: dropped 4 variables (flipper_len, body_mass, sex, year)
# A tibble: 344 × 4
  species island    bill_len bill_dep
  <fct>   <fct>        <dbl>    <dbl>
1 Adelie  Torgersen     39.1     18.7
2 Adelie  Torgersen     39.5     17.4
3 Adelie  Torgersen     40.3     18  
# ℹ 341 more rows

Sometimes it is easier to remove the columns you don’t want. You can do this by putting a - in front of the column name.

#select everything but year and sex
penguins |> select(-year, -sex)
select: dropped 2 variables (sex, year)
# A tibble: 344 × 6
  species island    bill_len bill_dep flipper_len body_mass
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int>
1 Adelie  Torgersen     39.1     18.7         181      3750
2 Adelie  Torgersen     39.5     17.4         186      3800
3 Adelie  Torgersen     40.3     18           195      3250
# ℹ 341 more rows

If there is a pattern to the column names that we want to select (or remove), there are some helper functions. For example, to select columns that start with “bill”, we can use starts_with().

#select bill_len & bill_dep"
penguins |> select(starts_with("bill"))
select: dropped 6 variables (species, island, flipper_len, body_mass, sex, …)
# A tibble: 344 × 2
  bill_len bill_dep
     <dbl>    <dbl>
1     39.1     18.7
2     39.5     17.4
3     40.3     18  
# ℹ 341 more rows

Conversely, if we want to select all columns that end with “len”, we can use ends_with(). contains() is more flexible and matches() is the most powerful of the helper functions, using regular expressions to identify the columns (see the regular expression tutorial).

Sometimes, you might want to select all the columns of a certain type. For example, to select all the numeric columns we can use the is.numeric function inside select() with the helper where().

penguins |> select(where(is.numeric)) # No brackets on the function
select: dropped 3 variables (species, island, sex)
# A tibble: 344 × 5
  bill_len bill_dep flipper_len body_mass  year
     <dbl>    <dbl>       <int>     <int> <int>
1     39.1     18.7         181      3750  2007
2     39.5     17.4         186      3800  2007
3     40.3     18           195      3250  2007
# ℹ 341 more rows

Other is.* functions exist, for example, is.character for text.

You can also select columns by number (1 being the first column), but this is generally a bad idea because it makes the code difficult to understand and if a new column is added, or the column order is changed, the code will break.

Which of these strategies works best is context dependent.

NoteExercise

From the penguins data, select

  • all columns except year
  • all non-numeric columns
  • species and columns ending in “len”
Hint
# hint 1
penguins |>
  select(-___)

# hint 2
?where

# hint 3
?ends_with

14.3 Renaming columns with rename

You can use rename() to rename columns

penguins |> rename(Species = species)
rename: renamed one variable (Species)
# A tibble: 344 × 8
  Species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
3 Adelie  Torgersen     40.3     18           195      3250 female  2007
# ℹ 341 more rows

The syntax is new_name = current_name.

You can also rename a column when selecting. This is convenient if you are using select() anyway.

penguins |> select(Species = species)
select: renamed one variable (Species) and dropped 7 variables
# A tibble: 344 × 1
  Species
  <fct>  
1 Adelie 
2 Adelie 
3 Adelie 
# ℹ 341 more rows

14.4 Moving columns with relocate()

Sometimes it is useful to reorder the columns. This is never necessary for data analysis or plotting, but can be needed when making a table for presentation.

penguins |> relocate(island)
relocate: columns reordered (island, species, bill_len, bill_dep, flipper_len,
…)
# A tibble: 344 × 8
  island    species bill_len bill_dep flipper_len body_mass sex     year
  <fct>     <fct>      <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Torgersen Adelie      39.1     18.7         181      3750 male    2007
2 Torgersen Adelie      39.5     17.4         186      3800 female  2007
3 Torgersen Adelie      40.3     18           195      3250 female  2007
# ℹ 341 more rows

The default is to move the named column first, the .before and .after arguments let you move the column into any position.

14.5 Filtering rows with filter()

Filtering rows that meet some condition is a very common task.

For example, to filter rows of penguins that have a bill length greater than 40 mm, we can use

penguins |> filter(bill_len > 40)
filter: removed 102 rows (30%), 242 rows remaining
# A tibble: 242 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     40.3     18           195      3250 female  2007
2 Adelie  Torgersen     42       20.2         190      4250 <NA>    2007
3 Adelie  Torgersen     41.1     17.6         182      3200 female  2007
# ℹ 239 more rows

This will filter out each row where the condition is TRUE.

The base R equivalent of this is

penguins[penguins$bill_len > 40, ]
# A tibble: 244 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     40.3     18           195      3250 female  2007
2 <NA>    <NA>          NA       NA            NA        NA <NA>      NA
3 Adelie  Torgersen     42       20.2         190      4250 <NA>    2007
# ℹ 241 more rows

Generally, filter makes it easier to understand, especially with more complex criteria.

Other tests include

  • == exactly equals. Often a bad idea to use with numeric data
  • near safe function for testing equality of numeric data as it has a tolerance for rounding errors.
sqrt(2) ^ 2 == 2 # should be true, but rounding errors
[1] FALSE
sqrt(2) ^ 2 - 2 # the difference
[1] 4.440892e-16
near(sqrt(2) ^ 2, 2) # safe alternative
[1] TRUE
  • != not equal to
  • < less than
  • <= less than or equal to
  • > greater than
  • >= greater than or equal to
  • is.na() for filtering by missing values.
  • between() for filtering values with a range
  • %in% is used when you want to test if a value is in a vector
penguins |>
  filter(species %in% c("Adelie", "Chinstrap"))
filter: removed 124 rows (36%), 220 rows remaining
# A tibble: 220 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
3 Adelie  Torgersen     40.3     18           195      3250 female  2007
# ℹ 217 more rows
#equivalent to
penguins |>
  filter(species == "Adelie" | species == "Chinstrap")
filter: removed 124 rows (36%), 220 rows remaining
# A tibble: 220 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
3 Adelie  Torgersen     40.3     18           195      3250 female  2007
# ℹ 217 more rows
  # with many alternatives, this gets very long

14.5.1 Filtering on multiple criteria

If we want to filter on multiple criteria, we need to decide whether we want all criteria to be TRUE (AND in Boolean logic), or for one or more to be TRUE (OR in Boolean logic).

If we want all criteria to be TRUE, we can separate them by a comma (or by an & if you want to be explicit).

penguins |>
  filter(bill_len > 40, bill_dep > 18)
filter: removed 263 rows (76%), 81 rows remaining
# A tibble: 81 × 8
  species island    bill_len bill_dep flipper_len body_mass sex    year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct> <int>
1 Adelie  Torgersen     42       20.2         190      4250 <NA>   2007
2 Adelie  Torgersen     42.5     20.7         197      4500 male   2007
3 Adelie  Torgersen     46       21.5         194      4200 male   2007
# ℹ 78 more rows

If we want rows where any of the criteria is TRUE, we can separate them by a |.

penguins |>
  filter(bill_len > 40 | bill_dep > 18)
filter: removed 53 rows (15%), 291 rows remaining
# A tibble: 291 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     40.3     18           195      3250 female  2007
3 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
# ℹ 288 more rows

We can negate a criterion by putting ! in front of it. So to filter rows that do not have bills longer than 40 mm we can use

penguins |> filter(!bill_len > 40)
filter: removed 244 rows (71%), 100 rows remaining
# A tibble: 100 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
3 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
# ℹ 97 more rows

Of course, in this example, we could also use <= as the test.

14.5.2 Common errors

The commonest error is to use a single = rather than ==. Only the latter is a test of equality. If you do this, the error message is quite helpful.

penguins |> filter(species = "Chinstrap")
Error in `.fun()` at tidylog/R/filter.R:99:5:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `species == "Chinstrap"`?

Another common error is to forget to quote any strings.

penguins |> filter(species == Chinstrap)
Error in `.fun()` at tidylog/R/filter.R:99:5:
ℹ In argument: `species == Chinstrap`.
Caused by error:
! object 'Chinstrap' not found
NoteExercise

From the penguins data, filter

  • Gentoo penguins
  • Gentoo or Adelie penguins
  • penguins with a mass greater than or equal to 5000g
  • penguins with a bill length between 45 and 50 mm
  • Gentoo penguins not from from Dream Island
Hint
#hint 1
penguins |> filter(___ = ___)

#hint 2
?`%in%`

#hint 3
penguins |> filter(___ >= ___)

#hint 4
?between

#hint 5
?`!`

14.6 Slicing the data with slice()

Sometimes it is useful to extract rows by row number.

penguins |> slice(3:7)
slice: removed 339 rows (99%), 5 rows remaining
# A tibble: 5 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     40.3     18           195      3250 female  2007
2 Adelie  Torgersen     NA       NA            NA        NA <NA>    2007
3 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
4 Adelie  Torgersen     39.3     20.6         190      3650 male    2007
5 Adelie  Torgersen     38.9     17.8         181      3625 female  2007

You can use negative numbers to remove rows. Be careful using slice() as if the data change, different rows may be returned.

14.7 Distinct rows with distinct()

If there are duplicates in the data, we can remove these with distinct(). distinct() with no extra arguments will remove duplicate rows.

penguins |> distinct()
distinct: no rows removed
# A tibble: 344 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
3 Adelie  Torgersen     40.3     18           195      3250 female  2007
# ℹ 341 more rows

If we are only interested in some of the columns, we can supply the names of these columns.

penguins |> distinct(island)
distinct: removed 341 rows (99%), 3 rows remaining
# A tibble: 3 × 1
  island   
  <fct>    
1 Torgersen
2 Biscoe   
3 Dream    

Other columns will be removed unless the argument .keep_all = TRUE is used.

NoteExercise

From the penguins data, find distinct values of

  • species
  • species, island and sex
Hint
penguins |> distinct(___, ___)

14.8 Random rows

Sometimes you want to sample rows at random from a data.frame. This can be done with slice_sample(). This can either sample a constant n rows or constant fraction of the rows depending on whether the n or prop argument is used.

penguins |> slice_sample(n = 10)
slice_sample: removed 334 rows (97%), 10 rows remaining
# A tibble: 10 × 8
   species   island    bill_len bill_dep flipper_len body_mass sex     year
   <fct>     <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
 1 Chinstrap Dream         51.5     18.7         187      3250 male    2009
 2 Adelie    Torgersen     38.8     17.6         191      3275 female  2009
 3 Chinstrap Dream         51.3     19.9         198      3700 male    2007
 4 Gentoo    Biscoe        46.5     14.4         217      4900 female  2008
 5 Gentoo    Biscoe        48.7     15.1         222      5350 male    2007
 6 Adelie    Torgersen     36.7     18.8         187      3800 female  2008
 7 Gentoo    Biscoe        47.7     15           216      4750 female  2008
 8 Adelie    Dream         39.8     19.1         184      4650 male    2007
 9 Adelie    Dream         43.2     18.5         192      4100 male    2008
10 Chinstrap Dream         46.1     18.2         178      3250 female  2007

14.9 Mutating and adding columns with mutate

The function mutate() can add an new column or replace an existing one.

To make a new column called body_mass_kg we can use

penguins |>
  mutate(body_mass_kg = body_mass / 1000)
mutate: new variable 'body_mass_kg' (double) with 95 unique values and 1% NA
# A tibble: 344 × 9
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
3 Adelie  Torgersen     40.3     18           195      3250 female  2007
# ℹ 341 more rows
# ℹ 1 more variable: body_mass_kg <dbl>

There are lots of functions that are useful to use with mutate. Any function that returns either a single value or as many values as are in the data can be used.

mutate() is very useful when cleaning data.

  • See text manipulation tutorial for cleaning text with the stringr package.
  • See date and time tutorial for cleaning dates and times with the lubridate package.
NoteExercise

With the penguins data,

  • convert flipper length to cm
  • add a column with the ratio of bill width to length
Hint
# hint 1
penguins |>
  mutate(___ = ___)

14.10 Summarising data with summarise()

summarise() lets us summarise data. We can use it if we want to calculate a summary statistic of the data. Remember to separate arguments with a comma.

penguins |> summarise(
  flipper_len_mean = mean(flipper_len, na.rm = TRUE),
  flipper_len_sd = sd(flipper_len, na.rm = TRUE)
  )
summarise: now one row and 2 columns, ungrouped
# A tibble: 1 × 2
  flipper_len_mean flipper_len_sd
             <dbl>          <dbl>
1             201.           14.1

Only the columns created in the summarise() and any grouping columns (see below) will be kept.

NoteExercise

With the penguins data, find

  • the maximum and minimum bill length
Hint
penguins |>
  summarise(___ = ___(___))

14.10.1 Summarising multiple columns

Sometimes you want to summarise multiple columns at the same time. This can be done with the across() helper function. across() needs to be told which columns to process and what function or functions to use.

penguins |>
   summarise(
     across(c(bill_len, bill_dep),
            .fns = \(x)mean(x, na.rm = TRUE)))
summarise: now one row and 2 columns, ungrouped
# A tibble: 1 × 2
  bill_len bill_dep
     <dbl>    <dbl>
1     43.9     17.2
#using a list of functions
penguins |>
  summarise(
    across(.cols = starts_with("bill"),
           .fns = list(sd = \(x)sd(x, na.rm = TRUE),
                       mean = \(x)mean(x, na.rm = TRUE))))
summarise: now one row and 4 columns, ungrouped
# A tibble: 1 × 4
  bill_len_sd bill_len_mean bill_dep_sd bill_dep_mean
        <dbl>         <dbl>       <dbl>         <dbl>
1        5.46          43.9        1.97          17.2

You can also use across() with mutate() to mutate several columns at the same time.

14.11 Grouping data with group_by

group_by() changes the way that many of the dplyr functions work. Instead of working on the entire dataset, they now work on each group in the data

To find the mean flipper length for each species, we need to group_by() species and then summarise().

penguins |>
  group_by(species) |>
  summarise(mean_flipper_length = mean(flipper_len))
group_by: one grouping variable (species)
summarise: now 3 rows and 2 columns, ungrouped
# A tibble: 3 × 2
  species   mean_flipper_length
  <fct>                   <dbl>
1 Adelie                    NA 
2 Chinstrap                196.
3 Gentoo                    NA 

Grouped data can be ungrouped with ungroup(). This can help prevent surprises!

NoteExercise

With the penguins data, find

  • the maximum and minimum bill depth of each species
Hint
penguins |>
  group_by(___)
  summarise(___ =  ___(___))

14.12 Sorting with arrange()

To sort the data frame by one or more of the variables we can use arrange().

penguins |> arrange(bill_len, bill_dep)
# A tibble: 344 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Dream         32.1     15.5         188      3050 female  2009
2 Adelie  Dream         33.1     16.1         178      2900 female  2008
3 Adelie  Torgersen     33.5     19           190      3600 female  2008
# ℹ 341 more rows

This will sort smallest first. To reverse the sort order, use desc()

penguins |> arrange(desc(bill_len), desc(bill_dep))
# A tibble: 344 × 8
  species   island bill_len bill_dep flipper_len body_mass sex     year
  <fct>     <fct>     <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Gentoo    Biscoe     59.6     17           230      6050 male    2007
2 Chinstrap Dream      58       17.8         181      3700 female  2007
3 Gentoo    Biscoe     55.9     17           228      5600 male    2009
# ℹ 341 more rows
NoteExercise

Sort the penguins data by

  • body mass
  • island and body mass, with the largest birds first
Hint
# hint 1
penguins |>
  arrange(___, ___)

# hint 2
?desc

14.13 Counting rows with count() and n()

The function n can count how many rows there are in the each group (or the entire data frame if it is not grouped). It can be used with either mutate() or summarise().

penguins |>
  group_by(species) |>
  summarise(n = n())
group_by: one grouping variable (species)
summarise: now 3 rows and 2 columns, ungrouped
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Or with count()

penguins |>
  count(species)
count: now 3 rows and 2 columns, ungrouped
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

count() is more concise, but group_by() and summarise() is useful when you need to calculate more variables (for example mean and standard deviation).

NoteExercise

From the penguins data, find

  • how many penguins are there of each species
  • how many penguins are there of each species in each island
Hint
penguins |>
  count(___, ___)

# or
penguins |>
  group_by(___) |>
  summarise(___ = n())

14.14 Common problems

14.14.1 Non standard names

Ideally column names should follow the standard rules for naming objects in R - UPPER and lower case letters, numbers, “.” and “_” with the first character being a letter (or a dot if you want an invisible object). Sometimes when you import data, it has non-standard names with spaces or extra characters. If you need to refer to a column name that doesn’t follow the rules, you need to enclose it with back-ticks.

df <- tibble(`Region/Country` = "Norway", value = 42)
df
# A tibble: 1 × 2
  `Region/Country` value
  <chr>            <dbl>
1 Norway              42
df |> rename(region_country = `Region/Country`)
rename: renamed one variable (region_country)
# A tibble: 1 × 2
  region_country value
  <chr>          <dbl>
1 Norway            42

It is sometimes best to rename these columns to make them easier to refer to. janitor::clean_names() is very efficient for making easy-to-use names.

Contributors

  • Richard Telford