Across (dplyr 1.0.0): applying dplyr functions simultaneously across multiple columns

With the introduction of dplyr 1.0.0, there are a few new features: the biggest of which is across() which supersedes the scoped versions of dplyr functions.
R
tidyverse
dplyr
Author

Rebecca Barter

Published

July 9, 2020

I often find that I want to use a dplyr function on multiple columns at once. For instance, perhaps I want to scale all of the numeric variables at once using a mutate function, or I want to provide the same summary for three of my variables.

While it’s been possible to do such tasks for a while using scoped verbs, it’s now even easier - and more consistent - using dplyr’s new across() function.

To demonstrate across(), I’m going to use Palmer’s Penguin dataset, which was originally collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, but has recently been made popular in the R community by Allison Horst as an alternative to the over-used Iris dataset.

To start with, let’s load the penguins dataset (via the palmerpenguins package) and the tidyverse package. If you’re new to the tidyverse (primarily to dplyr and piping, %>%), I suggest taking a look at my post on the tidyverse before reading this post.

# remotes::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)
library(tidyverse)
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

There are 344 rows in the penguins dataset, one for each penguin, and 7 columns. The first two columns, species and island, specify the species and island of the penguin, the next four specify numeric traits about the penguin, including the bill and flipper length, the bill depth and the body mass.

The new across() function turns all dplyr functions into “scoped” versions of themselves, which means you can specify multiple columns that your dplyr function will apply to.

Ordinarily, if we want to summarise a single column, such as species, by calculating the number of distinct entries (using n_distinct()) it contains, we would typically write

penguins %>%
  summarise(distinct_species = n_distinct(species))
# A tibble: 1 × 1
  distinct_species
             <int>
1                3

If we wanted to calculate n_distinct() not only across species, but also across island and sex, we would need to write out the n_distinct function three separate times:

penguins %>%
  summarise(distinct_species = n_distinct(species),
            distinct_island = n_distinct(island),
            distinct_sex = n_distinct(sex))
# A tibble: 1 × 3
  distinct_species distinct_island distinct_sex
             <int>           <int>        <int>
1                3               3            3

Wouldn’t it be nice if we could just write which columns we want to apply n_distinct() to, and then specify n_distinct() once, rather than having to apply n_distinct to each column separately?

This is where across() comes in. It is used inside your favourite dplyr function and the syntax is across(.cols, .fnd), where .cols specifies the columns that you want the dplyr function to act on. When dplyr functions involve external functions that you’re applying to columns e.g. n_distinct() in the example above, this external function is placed in the .fnd argument. For example, we would to apply n_distinct() to species, island, and sex, we would write across(c(species, island, sex), n_distinct) in the summarise parentheses.

Note that we are specifying which variables we want to involve in the summarise using c(), as if we’re listing the variable names in a vector, but because we’re in dplyr-land, we don’t need to put them in quotes:

penguins %>%
  summarise(across(c(species, island, sex), 
                   n_distinct))
# A tibble: 1 × 3
  species island   sex
    <int>  <int> <int>
1       3      3     3

Something else that’s really neat is that you can also use !c() to negate a set of variables (i.e. to apply the function to all variables except those that you specified in c()):

penguins %>%
  summarise(across(!c(species, island, sex), 
                   n_distinct))
# A tibble: 1 × 5
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
           <int>         <int>             <int>       <int> <int>
1            165            81                56          95     3

I want to emphasize here that the function n_distinct() is an argument of across(), rather than being an argument of the dplyr function (summarise).

Select helpers: selecting columns to apply the function to

So far we’ve seen how to apply a dplyr function to a set of columns using a vector notation c(col1, col2, col3, ...). However, there are many other ways to specify the columns that you want to apply the dplyr function to.

  • everything(): apply the function to all of the columns
penguins %>%
  summarise(across(everything(), n_distinct))
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…²   sex  year
    <int>  <int>          <int>         <int>          <int>   <int> <int> <int>
1       3      3            165            81             56      95     3     3
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
  • starts_with(): apply the function to all columns whose name starts with a specific string
penguins %>%
  summarise(across(starts_with("bill"), n_distinct))
# A tibble: 1 × 2
  bill_length_mm bill_depth_mm
           <int>         <int>
1            165            81
  • contains(): apply the function to all columns whose name contains a specific string
penguins %>%
  summarise(across(contains("length"), n_distinct))
# A tibble: 1 × 2
  bill_length_mm flipper_length_mm
           <int>             <int>
1            165                56
  • where() apply the function to all columns that satisfy a logical condition, such as is.numeric()
penguins %>%
  summarise(across(where(is.numeric), n_distinct))
# A tibble: 1 × 5
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
           <int>         <int>             <int>       <int> <int>
1            165            81                56          95     3

The full list of select helpers can be found here.

Using in-line functions with across

Let’s look at an example of summarizing the columns using a custom function (rather than n_distinct()). I usually do this using the tilde-dot shorthand for inline functions. The notation works by replacing

function(x) {
  x + 10
}

with

~{.x + 10}

~ indicates that you have started an anonymous function, and the argument of the anonymous function can be referred to using .x (or simply .). Unlike normal function arguments that can be anything that you like, the tilde-dot function argument is always .x.

For instance, to identify how many missing values there are in every column, we could specify the inline function ~sum(is.na(.)), which calculates how many NA values are in each column (where the column is represented by .) and adds them up:

penguins %>%
  summarise(across(everything(), 
                   ~sum(is.na(.))))
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…²   sex  year
    <int>  <int>          <int>         <int>          <int>   <int> <int> <int>
1       0      0              2             2              2       2    11     0
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

This shows that there are missing values in every column except for the first two (species and island).

A mutate example

What if we want to replace the missing values in the numeric columns with 0 (clearly a terrible choice)? Without the across() function, we would apply an if_else() function separately to each numeric column, which will replace all NA values with 0 and leave all non-NA values as they are:

replace0 <- function(x) {
  if_else(condition = is.na(x), 
          true = 0, 
          false = as.numeric(x))
}
penguins %>%
  mutate(bill_length_mm = replace0(bill_length_mm),
         bill_depth_mm = replace0(bill_depth_mm),
         flipper_length_mm = replace0(flipper_length_mm),
         body_mass_g = replace0(body_mass_g))
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <dbl>   <dbl> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen            0             0            0       0 <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

But fortunately, we can do this a lot more efficiently with across().

# define a function to replace NA with 0

penguins %>%
  mutate(across(where(is.numeric), replace0))
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <dbl>   <dbl> <fct> <dbl>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen            0             0            0       0 <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Although obviously 0 isn’t a great choice, so perhaps we can replace the missing values with the mean value of the column. This time, rather than define a new function (in place of replace0), we’ll be a bit more concise and use the tilde-dot notation to specify the function we want to apply.

penguins %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), mean(., na.rm = T), as.numeric(.))))
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <dbl>   <dbl> <fct> <dbl>
 1 Adelie  Torgersen           39.1          18.7       181    3750  male   2007
 2 Adelie  Torgersen           39.5          17.4       186    3800  fema…  2007
 3 Adelie  Torgersen           40.3          18         195    3250  fema…  2007
 4 Adelie  Torgersen           43.9          17.2       201.   4202. <NA>   2007
 5 Adelie  Torgersen           36.7          19.3       193    3450  fema…  2007
 6 Adelie  Torgersen           39.3          20.6       190    3650  male   2007
 7 Adelie  Torgersen           38.9          17.8       181    3625  fema…  2007
 8 Adelie  Torgersen           39.2          19.6       195    4675  male   2007
 9 Adelie  Torgersen           34.1          18.1       193    3475  <NA>   2007
10 Adelie  Torgersen           42            20.2       190    4250  <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Or better yet, perhaps we can replace the missing values with the average value within the relevant species and island.

penguins %>%
  group_by(species, island) %>%
  mutate(across(where(is.numeric), 
                ~if_else(condition = is.na(.), 
                         true = mean(., na.rm = T), 
                         false = as.numeric(.)))) %>%
  ungroup()
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <dbl>   <dbl> <fct> <dbl>
 1 Adelie  Torgersen           39.1          18.7       181    3750  male   2007
 2 Adelie  Torgersen           39.5          17.4       186    3800  fema…  2007
 3 Adelie  Torgersen           40.3          18         195    3250  fema…  2007
 4 Adelie  Torgersen           39.0          18.4       191.   3706. <NA>   2007
 5 Adelie  Torgersen           36.7          19.3       193    3450  fema…  2007
 6 Adelie  Torgersen           39.3          20.6       190    3650  male   2007
 7 Adelie  Torgersen           38.9          17.8       181    3625  fema…  2007
 8 Adelie  Torgersen           39.2          19.6       195    4675  male   2007
 9 Adelie  Torgersen           34.1          18.1       193    3475  <NA>   2007
10 Adelie  Torgersen           42            20.2       190    4250  <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

A select example

When you’re using select, you don’t have to include the across() function, because the select helpers have always worked with select(). This means that you can just write

penguins %>%
  select(where(is.numeric))
# A tibble: 344 × 5
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
            <dbl>         <dbl>             <int>       <int> <int>
 1           39.1          18.7               181        3750  2007
 2           39.5          17.4               186        3800  2007
 3           40.3          18                 195        3250  2007
 4           NA            NA                  NA          NA  2007
 5           36.7          19.3               193        3450  2007
 6           39.3          20.6               190        3650  2007
 7           38.9          17.8               181        3625  2007
 8           39.2          19.6               195        4675  2007
 9           34.1          18.1               193        3475  2007
10           42            20.2               190        4250  2007
# … with 334 more rows

rather than

penguins %>%
  select(across(where(is.numeric)))

which will throw an error.

Hopefully across() will make your life easier, as it has mine!