Across (dplyr 1.0.0): applying dplyr functions simultaneously across multiple columns

With the introduction of dplyr 1.0.0, there are a few new features: the biggest of which is across() which supersedes the scoped versions of dplyr functions.

Rebecca Barter

I often find that I want to use a dplyr function on multiple columns at once. For instance, perhaps I want to scale all of the numeric variables at once using a mutate function, or I want to provide the same summary for three of my variables.

While it’s been possible to do such tasks for a while using scoped verbs, it’s now even easier - and more consistent - using dplyr’s new across() function.

To demonstrate across(), I’m going to use Palmer’s Penguin dataset, which was originally collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, but has recently been made popular in the R community by Allison Horst as an alternative to the over-used Iris dataset.

To start with, let’s load the penguins dataset (via the palmerpenguins package) and the tidyverse package. If you’re new to the tidyverse (primarily to dplyr and piping, %>%), I suggest taking a look at my post on the tidyverse before reading this post.

# remotes::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)
library(tidyverse)
penguins
## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge…           39.1          18.7              181        3750
##  2 Adelie  Torge…           39.5          17.4              186        3800
##  3 Adelie  Torge…           40.3          18                195        3250
##  4 Adelie  Torge…           NA            NA                 NA          NA
##  5 Adelie  Torge…           36.7          19.3              193        3450
##  6 Adelie  Torge…           39.3          20.6              190        3650
##  7 Adelie  Torge…           38.9          17.8              181        3625
##  8 Adelie  Torge…           39.2          19.6              195        4675
##  9 Adelie  Torge…           34.1          18.1              193        3475
## 10 Adelie  Torge…           42            20.2              190        4250
## # … with 334 more rows, and 1 more variable: sex <fct>

There are 344 rows in the penguins dataset, one for each penguin, and 7 columns. The first two columns, species and island, specify the species and island of the penguin, the next four specify numeric traits about the penguin, including the bill and flipper length, the bill depth and the body mass.

The new across() function turns all dplyr functions into “scoped” versions of themselves, which means you can specify multiple columns that your dplyr function will apply to.

Ordinarily, if we want to summarise a single column, such as species, by calculating the number of distinct entries (using n_distinct()) it contains, we would typically write

penguins %>%
  summarise(distinct_species = n_distinct(species))
## # A tibble: 1 x 1
##   distinct_species
##              <int>
## 1                3

If we wanted to calculate n_distinct() not only across species, but also across island and sex, we would need to write out the n_distinct function three separate times:

penguins %>%
  summarise(distinct_species = n_distinct(species),
            distinct_island = n_distinct(island),
            distinct_sex = n_distinct(sex))
## # A tibble: 1 x 3
##   distinct_species distinct_island distinct_sex
##              <int>           <int>        <int>
## 1                3               3            3

Wouldn’t it be nice if we could just write which columns we want to apply n_distinct() to, and then specify n_distinct() once, rather than having to apply n_distinct to each column separately?

This is where across() comes in. It is used inside your favourite dplyr function and the syntax is across(.cols, .fnd), where .cols specifies the columns that you want the dplyr function to act on. When dplyr functions involve external functions that you’re applying to columns e.g. n_distinct() in the example above, this external function is placed in the .fnd argument. For example, we would to apply n_distinct() to species, island, and sex, we would write across(c(species, island, sex), n_distinct) in the summarise parentheses.

Note that we are specifying which variables we want to involve in the summarise using c(), as if we’re listing the variable names in a vector, but because we’re in dplyr-land, we don’t need to put them in quotes:

penguins %>%
  summarise(across(c(species, island, sex), 
                   n_distinct))
## # A tibble: 1 x 3
##   species island   sex
##     <int>  <int> <int>
## 1       3      3     3

Something else that’s really neat is that you can also use !c() to negate a set of variables (i.e. to apply the function to all variables except those that you specified in c()):

penguins %>%
  summarise(across(!c(species, island, sex), 
                   n_distinct))
## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <int>         <int>             <int>       <int>
## 1            165            81                56          95

I want to emphasize here that the function n_distinct() is an argument of across(), rather than being an argument of the dplyr function (summarise).

Select helpers: selecting columns to apply the function to

So far we’ve seen how to apply a dplyr function to a set of columns using a vector notation c(col1, col2, col3, ...). However, there are many other ways to specify the columns that you want to apply the dplyr function to.

  • everything(): apply the function to all of the columns
penguins %>%
  summarise(across(everything(), n_distinct))
## # A tibble: 1 x 7
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g   sex
##     <int>  <int>          <int>         <int>            <int>       <int> <int>
## 1       3      3            165            81               56          95     3
  • starts_with(): apply the function to all columns whose name starts with a specific string
penguins %>%
  summarise(across(starts_with("bill"), n_distinct))
## # A tibble: 1 x 2
##   bill_length_mm bill_depth_mm
##            <int>         <int>
## 1            165            81
  • contains(): apply the function to all columns whose name contains a specific string
penguins %>%
  summarise(across(contains("length"), n_distinct))
## # A tibble: 1 x 2
##   bill_length_mm flipper_length_mm
##            <int>             <int>
## 1            165                56
  • where() apply the function to all columns that satisfy a logical condition, such as is.numeric()
penguins %>%
  summarise(across(where(is.numeric), n_distinct))
## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <int>         <int>             <int>       <int>
## 1            165            81                56          95

The full list of select helpers can be found here.

Using in-line functions with across

Let’s look at an example of summarizing the columns using a custom function (rather than n_distinct()). I usually do this using the tilde-dot shorthand for inline functions. The notation works by replacing

function(x) {
  x + 10
}

with

~{.x + 10}

~ indicates that you have started an anonymous function, and the argument of the anonymous function can be referred to using .x (or simply .). Unlike normal function arguments that can be anything that you like, the tilde-dot function argument is always .x.

For instance, to identify how many missing values there are in every column, we could specify the inline function ~sum(is.na(.)), which calculates how many NA values are in each column (where the column is represented by .) and adds them up:

penguins %>%
  summarise(across(everything(), 
                   ~sum(is.na(.))))
## # A tibble: 1 x 7
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g   sex
##     <int>  <int>          <int>         <int>            <int>       <int> <int>
## 1       0      0              2             2                2           2    11

This shows that there are missing values in every column except for the first two (species and island).

A mutate example

What if we want to replace the missing values in the numeric columns with 0 (clearly a terrible choice)? Without the across() function, we would apply an if_else() function separately to each numeric column, which will replace all NA values with 0 and leave all non-NA values as they are:

replace0 <- function(x) {
  if_else(condition = is.na(x), 
          true = 0, 
          false = as.numeric(x))
}
penguins %>%
  mutate(bill_length_mm = replace0(bill_length_mm),
         bill_depth_mm = replace0(bill_depth_mm),
         flipper_length_mm = replace0(flipper_length_mm),
         body_mass_g = replace0(body_mass_g))
## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Adelie  Torge…           39.1          18.7              181        3750
##  2 Adelie  Torge…           39.5          17.4              186        3800
##  3 Adelie  Torge…           40.3          18                195        3250
##  4 Adelie  Torge…            0             0                  0           0
##  5 Adelie  Torge…           36.7          19.3              193        3450
##  6 Adelie  Torge…           39.3          20.6              190        3650
##  7 Adelie  Torge…           38.9          17.8              181        3625
##  8 Adelie  Torge…           39.2          19.6              195        4675
##  9 Adelie  Torge…           34.1          18.1              193        3475
## 10 Adelie  Torge…           42            20.2              190        4250
## # … with 334 more rows, and 1 more variable: sex <fct>

But fortunately, we can do this a lot more efficiently with accross().

# define a function to replace NA with 0

penguins %>%
  mutate(across(where(is.numeric), replace0))
## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Adelie  Torge…           39.1          18.7              181        3750
##  2 Adelie  Torge…           39.5          17.4              186        3800
##  3 Adelie  Torge…           40.3          18                195        3250
##  4 Adelie  Torge…            0             0                  0           0
##  5 Adelie  Torge…           36.7          19.3              193        3450
##  6 Adelie  Torge…           39.3          20.6              190        3650
##  7 Adelie  Torge…           38.9          17.8              181        3625
##  8 Adelie  Torge…           39.2          19.6              195        4675
##  9 Adelie  Torge…           34.1          18.1              193        3475
## 10 Adelie  Torge…           42            20.2              190        4250
## # … with 334 more rows, and 1 more variable: sex <fct>

Although obviously 0 isn’t a great choice, so perhaps we can replace the missing values with the mean value of the column. This time, rather than define a new function (in place of replace0), we’ll be a bit more concise and use the tilde-dot notation to specify the function we want to apply.

penguins %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), mean(., na.rm = T), as.numeric(.))))
## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Adelie  Torge…           39.1          18.7             181        3750 
##  2 Adelie  Torge…           39.5          17.4             186        3800 
##  3 Adelie  Torge…           40.3          18               195        3250 
##  4 Adelie  Torge…           43.9          17.2             201.       4202.
##  5 Adelie  Torge…           36.7          19.3             193        3450 
##  6 Adelie  Torge…           39.3          20.6             190        3650 
##  7 Adelie  Torge…           38.9          17.8             181        3625 
##  8 Adelie  Torge…           39.2          19.6             195        4675 
##  9 Adelie  Torge…           34.1          18.1             193        3475 
## 10 Adelie  Torge…           42            20.2             190        4250 
## # … with 334 more rows, and 1 more variable: sex <fct>

Or better yet, perhaps we can replace the missing values with the average value within the relevant species and island.

penguins %>%
  group_by(species, island) %>%
  mutate(across(where(is.numeric), 
                ~if_else(condition = is.na(.), 
                         true = mean(., na.rm = T), 
                         false = as.numeric(.)))) %>%
  ungroup()
## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Adelie  Torge…           39.1          18.7             181        3750 
##  2 Adelie  Torge…           39.5          17.4             186        3800 
##  3 Adelie  Torge…           40.3          18               195        3250 
##  4 Adelie  Torge…           39.0          18.4             191.       3706.
##  5 Adelie  Torge…           36.7          19.3             193        3450 
##  6 Adelie  Torge…           39.3          20.6             190        3650 
##  7 Adelie  Torge…           38.9          17.8             181        3625 
##  8 Adelie  Torge…           39.2          19.6             195        4675 
##  9 Adelie  Torge…           34.1          18.1             193        3475 
## 10 Adelie  Torge…           42            20.2             190        4250 
## # … with 334 more rows, and 1 more variable: sex <fct>

A select example

When you’re using select, you don’t have to include the across() function, because the select helpers have always worked with select(). This means that you can just write

penguins %>%
  select(where(is.numeric))
## # A tibble: 344 x 4
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##             <dbl>         <dbl>             <int>       <int>
##  1           39.1          18.7               181        3750
##  2           39.5          17.4               186        3800
##  3           40.3          18                 195        3250
##  4           NA            NA                  NA          NA
##  5           36.7          19.3               193        3450
##  6           39.3          20.6               190        3650
##  7           38.9          17.8               181        3625
##  8           39.2          19.6               195        4675
##  9           34.1          18.1               193        3475
## 10           42            20.2               190        4250
## # … with 334 more rows

rather than

penguins %>%
  select(across(where(is.numeric)))

which will throw an error.

Hopefully across() will make your life easier, as it has mine!