mutate_all(), select_if(), summarise_at()... what's the deal with scoped verbs?!

What's the deal with these mutate_all(), select_if(), summarise_at(), functions? They seem so useful, but there doesn't seem to be a decent explanation of how to use them anywhere on the internet. Turns out, they're called 'scoped verbs' and hopefully this post will become one of many decent explanations of how to use them!

Rebecca Barter

I often find myself wishing that I could apply the same mutate function to several columns in a data frame at once, such as convert all factors to characters, or do something to all columns that have missing values, or select all variables whose names end with _important. When I first googled these problems around a year ago, I started to see solutions that use weird extensions of the basic mutate(), select(), rename(), and summarise() dplyr functions that look like summarise_all(), filter_at(), mutate_if(), and so on. I have since learned that these functions are called “scoped verbs” (where “scoped” means that they operate only on a selection of variables).

Unfortunately, despite my extensive googling, I never really found a satisfactory description of how to use these functions in general, I think primarily because the documentation for these functions is not particularly useful (try ?mutate_at()).

Fortunately, I recently attended a series of lightening talks hosted by the RLadies SF chapter where Sara Altman pointed us towards a summary document that Hadley Wickham wrote for the Data Science class he helped create at Stanford in 2017 (this class is now taught by Sara Altman herself).

To summarise what I will demonstrate below, there are three scoped variants of the standard mutate, summarise, rename and select (and transmute) dplyr functions that can be specified by the following suffixes:

  • _if: allows you to pick variables that satisfy some logical criteria such as is.numeric() or is.character() (e.g. summarising only the numeric columns)

  • _at: allows you to perform an operation only on variables specified by name (e.g. mutating only the columns whose name ends with "_date")

  • _all: allows you to perform an operation on all variables at once (e.g. calculating the number of missing values in every column)

To explain how these functions all work, I will use the dataset from a survey of 800 Pittsburgh residents on whether or not they approve of self-driving car companies testing their autonomous vehicles on the streets of Pittsburgh (there have several articles on this issue in recent times in case you missed them: 1, 2). The data can usually be downloaded from data.gov (but is currently unavailable due to the current Government Shutdown - I will update this with an actual link to the data one day). For now you can download the data from here.

A random sample of 10 rows of this dataset is shown below. To make it easy to see what’s going on, I’ll restrict my analysis below to these 10 rows

# load in the only library you ever really need
library(tidyverse)
library(lubridate)
# load in survey data
av_survey <- read_csv("../data/bikepghpublic.csv")
set.seed(45679)
av_survey_sample <- av_survey %>% 
  # select jsut a few columns and give some more intuitive column names
  select(id = `Response ID`,
         start_date = `Start Date`, 
         end_date = `End Date`,
         interacted_with_av_as_pedestrian = InteractPedestrian,
         interacted_with_av_as_cyclist = InteractBicycle,
         circumstanses_of_interaction = CircumstancesCoded, # lol @ typo in data
         approve_av_testing_pgh = FeelingsProvingGround) %>%
  # take a random sample of 10 rows
  sample_n(10) %>%
  # make data frame so that we view the whole thing
  as.data.frame()
av_survey_sample
##           id                 start_date                   end_date
## 1  260332858  02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  260336847 02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3  260381455  02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4  260342663  02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  260897634 03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  260334235 02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7  260382639  02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8  260352393  02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9  260396004  02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10 260343325  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST
##    interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    circumstanses_of_interaction approve_av_testing_pgh
## 1                             2       Somewhat Approve
## 2                             2                Approve
## 3                             2             Disapprove
## 4                            NA       Somewhat Approve
## 5                            NA                Approve
## 6                            NA    Somewhat Disapprove
## 7                            NA                Neutral
## 8                            NA                Approve
## 9                            NA                Approve
## 10                           NA                Approve

A quick useful aside: Using shorthand for functions

For many of the examples below, I will be using the ~fun(.x) shorthand for writing temporary functions. If you’ve never seen this shorthand before it’s incredibly useful. As an example, here are three ways of counting the number of missing values in each column of a data frame.

The first approach uses the traditional sapply() function and temporary function syntax.

# using apply and the normal temporary function syntax
sapply(av_survey_sample, function(x) sum(is.na(x)))
##                               id                       start_date 
##                                0                                0 
##                         end_date interacted_with_av_as_pedestrian 
##                                0                                0 
##    interacted_with_av_as_cyclist     circumstanses_of_interaction 
##                                0                                7 
##           approve_av_testing_pgh 
##                                0

The second still uses the temporary function syntax, but is using the map_dbl() function from the purrr package instead of the old-school sapply() function.

# using purrr::map_dbl and the normal temporary function syntax
av_survey_sample %>% map_dbl(function(x) sum(is.na(x)))
##                               id                       start_date 
##                                0                                0 
##                         end_date interacted_with_av_as_pedestrian 
##                                0                                0 
##    interacted_with_av_as_cyclist     circumstanses_of_interaction 
##                                0                                7 
##           approve_av_testing_pgh 
##                                0

The third uses the map_dbl() function with the ~fun(.x) syntax.

# using purrr::map_dbl and the `~fun(.x)` temporary function syntax
av_survey_sample %>% map_dbl(~sum(is.na(.x)))
##                               id                       start_date 
##                                0                                0 
##                         end_date interacted_with_av_as_pedestrian 
##                                0                                0 
##    interacted_with_av_as_cyclist     circumstanses_of_interaction 
##                                0                                7 
##           approve_av_testing_pgh 
##                                0

The _if() scoped variant: perform an operation on variables that satisfy a logical criteria

_if allows you to perform an operation on variables that satisfy some logical criteria such as is.numeric() or is.character().

select_if()

For instance, we can use select_if() to extract the numeric columns of the tibble only.

av_survey_sample %>% select_if(is.numeric)
##           id circumstanses_of_interaction
## 1  260332858                            2
## 2  260336847                            2
## 3  260381455                            2
## 4  260342663                           NA
## 5  260897634                           NA
## 6  260334235                           NA
## 7  260382639                           NA
## 8  260352393                           NA
## 9  260396004                           NA
## 10 260343325                           NA

We could also apply use more complex logical statements, for example by selecting columns that have at least one missing value.

av_survey_sample %>% 
  # select columns with at least one NA
  # the expression evaluates to TRUE if there is one or more missing values
  select_if(~sum(is.na(.x)) > 0) 
##    circumstanses_of_interaction
## 1                             2
## 2                             2
## 3                             2
## 4                            NA
## 5                            NA
## 6                            NA
## 7                            NA
## 8                            NA
## 9                            NA
## 10                           NA

rename_if()

We could rename columns that satisfy a logical expression using rename_if(). For instance, we can add a num_ prefix to all numeric column names.

av_survey_sample %>%
  # only rename numeric columns by adding a "num_" prefix
  rename_if(is.numeric, ~paste0("num_", .x))
##       num_id                 start_date                   end_date
## 1  260332858  02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  260336847 02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3  260381455  02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4  260342663  02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  260897634 03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  260334235 02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7  260382639  02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8  260352393  02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9  260396004  02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10 260343325  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST
##    interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    num_circumstanses_of_interaction approve_av_testing_pgh
## 1                                 2       Somewhat Approve
## 2                                 2                Approve
## 3                                 2             Disapprove
## 4                                NA       Somewhat Approve
## 5                                NA                Approve
## 6                                NA    Somewhat Disapprove
## 7                                NA                Neutral
## 8                                NA                Approve
## 9                                NA                Approve
## 10                               NA                Approve

mutate_if()

We could similarly use mutate_if() to mutate columns that satisfy specified logical conditions. In the example below, we mutate all columns that have at least one missing value by replacing NA with "missing".

av_survey_sample %>% 
  # only mutate columns with at least one NA
  # replace each NA value with the character "missing"
  mutate_if(~sum(is.na(.x)) > 0,
            ~if_else(is.na(.x), "missing", as.character(.x)))
##           id                 start_date                   end_date
## 1  260332858  02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  260336847 02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3  260381455  02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4  260342663  02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  260897634 03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  260334235 02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7  260382639  02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8  260352393  02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9  260396004  02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10 260343325  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST
##    interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    circumstanses_of_interaction approve_av_testing_pgh
## 1                             2       Somewhat Approve
## 2                             2                Approve
## 3                             2             Disapprove
## 4                       missing       Somewhat Approve
## 5                       missing                Approve
## 6                       missing    Somewhat Disapprove
## 7                       missing                Neutral
## 8                       missing                Approve
## 9                       missing                Approve
## 10                      missing                Approve

summarise_if()

Similarly, summarise_if() will summarise columns that satisfy the specified logical conditions. Below, we summarise each character column by reporting the most common value (but for some reason there is no mode() function in R, so we need to write our own).

# function to calculate the mode (most common) observation
mode <- function(x) {
  names(sort(table(x)))[1]
}
# summarise character
av_survey_sample %>% 
  summarise_if(is.character, mode)
##                   start_date                   end_date
## 1 02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
##   interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                         Not sure                           Yes
##   approve_av_testing_pgh
## 1             Disapprove

The _at() scoped variant: perform an operation only on variables specified by name

_at allows you to perform an operation only on variables specified by name.

To specify which variables you want to operate on, you need to include the variable names inside the vars() function as the first argument. I think of as like vars() like c() to provide multiple values (in this case variable names) as a single argument. For example av_survey_sample %>% mutate_at(vars(start_date, end_date), mdy_hms) will only mutate the start_date and end_date variables by converting them to lubridate format using the mdy_hms function.

These variables can be specified explicitly by name within the vars() function, or using the select_helpers within the vars() function.

Select helpers

Select helpers are functions that you can use within select() to help specify which variables you want to select. The options are

  • starts_with(): select all variables that start with a specified character string

  • ends_with(): select all variables that end with a specified character string

  • contains(): select all variables that contain a specified character string

  • matches(): select variables that match a specified character string

  • one_of(): selects variables that match any entries in the specified character vector

  • num_range(): selects variables that are numbered (e.g. columns named V1, V2, V3 would be selected by select(num_range("V", 1:3)))

There are many ways that we could select the date variables using the ends_with() and contains() select helpers:

# selecting the date columns by providing their names
av_survey_sample %>% select(start_date, end_date)
##                    start_date                   end_date
## 1   02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3   02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4   02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7   02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8   02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9   02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST
# selecting the columns that end with "_date"
av_survey_sample %>% select(ends_with("_date"))
##                    start_date                   end_date
## 1   02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3   02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4   02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7   02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8   02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9   02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST
# selecting the columns that contain "date"
av_survey_sample %>% select(contains("date"))
##                    start_date                   end_date
## 1   02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3   02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4   02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7   02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8   02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9   02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST

If you ever find yourself wanting to provide variable names as characters, the matches() and one_of() select helpers can help you do that.

# provide matches with a single character variables
variable <- "start_date"
av_survey_sample %>% select(matches(variable))
##                    start_date
## 1   02/23/2017 9:31:44 AM PST
## 2  02/23/2017 11:46:17 AM PST
## 3   02/24/2017 3:43:22 AM PST
## 4   02/23/2017 2:10:53 PM PST
## 5  03/06/2017 11:25:59 AM PST
## 6  02/23/2017 10:26:07 AM PST
## 7   02/24/2017 4:54:41 AM PST
## 8   02/23/2017 7:12:21 PM PST
## 9   02/24/2017 1:20:53 PM PST
## 10  02/23/2017 2:27:05 PM PST
# provide one_of with a vector of character variables
variables <- c("start_date", "end_date")
av_survey_sample %>% select(one_of(variables))
##                    start_date                   end_date
## 1   02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3   02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4   02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7   02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8   02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9   02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST

Note that technically there does exist a select_at() function that requires a vars() input, but I can’t really think of a good use of this function…

# this is the same as av_survey_sample %>% select(start_date, end_date)
av_survey_sample %>% 
  select_at(vars(start_date, end_date))
##                    start_date                   end_date
## 1   02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3   02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4   02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7   02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8   02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9   02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST

The syntax of this select_at() example though can be useful for understanding how the vars() function can be used in the other _at() functions).

rename_at()

You can rename specified variables using the rename_at() function. For instance, we could replace all column names that contain the character string “av” with the same column name but an uppercase “AV” instead of the original lowercase “av”.

To do this, we use the select helper contains() within the vars() function.

# use a select helper to only apply to columns whose name contains "av"
# then rename these columns with "AV" in place of "av"
av_survey_sample %>% 
  rename_at(vars(contains("av")), 
            ~gsub("av", "AV", .x))
##           id                 start_date                   end_date
## 1  260332858  02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  260336847 02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3  260381455  02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4  260342663  02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  260897634 03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  260334235 02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7  260382639  02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8  260352393  02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9  260396004  02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10 260343325  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST
##    interacted_with_AV_as_pedestrian interacted_with_AV_as_cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    circumstanses_of_interaction approve_AV_testing_pgh
## 1                             2       Somewhat Approve
## 2                             2                Approve
## 3                             2             Disapprove
## 4                            NA       Somewhat Approve
## 5                            NA                Approve
## 6                            NA    Somewhat Disapprove
## 7                            NA                Neutral
## 8                            NA                Approve
## 9                            NA                Approve
## 10                           NA                Approve

mutate_at()

To mutate only the date variables, normally we would do the mdy_hms() transformation to each variable separately as follows:

# use the standard (unscoped) approach
av_survey_sample %>% 
  mutate(start_date = mdy_hms(start_date),
         end_date = mdy_hms(end_date))
##           id          start_date            end_date
## 1  260332858 2017-02-23 09:31:44 2017-02-23 09:41:07
## 2  260336847 2017-02-23 11:46:17 2017-02-23 11:51:50
## 3  260381455 2017-02-24 03:43:22 2017-02-24 03:47:05
## 4  260342663 2017-02-23 14:10:53 2017-02-23 14:12:42
## 5  260897634 2017-03-06 11:25:59 2017-03-06 11:28:04
## 6  260334235 2017-02-23 10:26:07 2017-02-23 10:28:13
## 7  260382639 2017-02-24 04:54:41 2017-02-24 04:57:21
## 8  260352393 2017-02-23 19:12:21 2017-02-23 19:14:30
## 9  260396004 2017-02-24 13:20:53 2017-02-24 13:23:58
## 10 260343325 2017-02-23 14:27:05 2017-02-23 14:32:06
##    interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    circumstanses_of_interaction approve_av_testing_pgh
## 1                             2       Somewhat Approve
## 2                             2                Approve
## 3                             2             Disapprove
## 4                            NA       Somewhat Approve
## 5                            NA                Approve
## 6                            NA    Somewhat Disapprove
## 7                            NA                Neutral
## 8                            NA                Approve
## 9                            NA                Approve
## 10                           NA                Approve

However, using mutate_at() and supplying these column names as arguments to the vars() function, we could specify the function only once.

# specifying specific variables to apply the same function to
av_survey_sample %>% 
  mutate_at(vars(start_date, end_date), mdy_hms)
##           id          start_date            end_date
## 1  260332858 2017-02-23 09:31:44 2017-02-23 09:41:07
## 2  260336847 2017-02-23 11:46:17 2017-02-23 11:51:50
## 3  260381455 2017-02-24 03:43:22 2017-02-24 03:47:05
## 4  260342663 2017-02-23 14:10:53 2017-02-23 14:12:42
## 5  260897634 2017-03-06 11:25:59 2017-03-06 11:28:04
## 6  260334235 2017-02-23 10:26:07 2017-02-23 10:28:13
## 7  260382639 2017-02-24 04:54:41 2017-02-24 04:57:21
## 8  260352393 2017-02-23 19:12:21 2017-02-23 19:14:30
## 9  260396004 2017-02-24 13:20:53 2017-02-24 13:23:58
## 10 260343325 2017-02-23 14:27:05 2017-02-23 14:32:06
##    interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    circumstanses_of_interaction approve_av_testing_pgh
## 1                             2       Somewhat Approve
## 2                             2                Approve
## 3                             2             Disapprove
## 4                            NA       Somewhat Approve
## 5                            NA                Approve
## 6                            NA    Somewhat Disapprove
## 7                            NA                Neutral
## 8                            NA                Approve
## 9                            NA                Approve
## 10                           NA                Approve

Moreover, we can use the select helpers to specify which columns we want to mutate, without having to write out the entire column names.

# use a "select helper" to specify the variables that end with "_date"
av_survey_sample %>% 
  mutate_at(vars(ends_with("_date")), mdy_hms)
##           id          start_date            end_date
## 1  260332858 2017-02-23 09:31:44 2017-02-23 09:41:07
## 2  260336847 2017-02-23 11:46:17 2017-02-23 11:51:50
## 3  260381455 2017-02-24 03:43:22 2017-02-24 03:47:05
## 4  260342663 2017-02-23 14:10:53 2017-02-23 14:12:42
## 5  260897634 2017-03-06 11:25:59 2017-03-06 11:28:04
## 6  260334235 2017-02-23 10:26:07 2017-02-23 10:28:13
## 7  260382639 2017-02-24 04:54:41 2017-02-24 04:57:21
## 8  260352393 2017-02-23 19:12:21 2017-02-23 19:14:30
## 9  260396004 2017-02-24 13:20:53 2017-02-24 13:23:58
## 10 260343325 2017-02-23 14:27:05 2017-02-23 14:32:06
##    interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    circumstanses_of_interaction approve_av_testing_pgh
## 1                             2       Somewhat Approve
## 2                             2                Approve
## 3                             2             Disapprove
## 4                            NA       Somewhat Approve
## 5                            NA                Approve
## 6                            NA    Somewhat Disapprove
## 7                            NA                Neutral
## 8                            NA                Approve
## 9                            NA                Approve
## 10                           NA                Approve

summarise_at()

The summarise_at() scoped verb behaves very similarly to the mutate_at() scoped verb, in that we can easily specify which variables we want to apply the same summary function to.

For instance, the following example summarises all variables that contain the word “interacted” by counting the number of “Yes” entries.

av_survey_sample %>% 
  summarise_at(vars(contains("interacted")), ~sum(.x == "Yes"))
##   interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
## 1                                3                             2

The _all() scoped variant: perform an operation on all variables at once

_all allows you to perform an operation on all variables at once (e.g. calculating the number of missing values in every column).

rename_all()

The select_all() would is quite redundant (it would simply return all columns). Its friend rename_all(), however can be very useful.

For instance, we could rename all variables by replacing underscores _ with dots . (although I would advise against this: underscores are way better than dots!).

av_survey_sample %>% 
  rename_all(~gsub("_", ".", .x))
##           id                 start.date                   end.date
## 1  260332858  02/23/2017 9:31:44 AM PST  02/23/2017 9:41:07 AM PST
## 2  260336847 02/23/2017 11:46:17 AM PST 02/23/2017 11:51:50 AM PST
## 3  260381455  02/24/2017 3:43:22 AM PST  02/24/2017 3:47:05 AM PST
## 4  260342663  02/23/2017 2:10:53 PM PST  02/23/2017 2:12:42 PM PST
## 5  260897634 03/06/2017 11:25:59 AM PST 03/06/2017 11:28:04 AM PST
## 6  260334235 02/23/2017 10:26:07 AM PST 02/23/2017 10:28:13 AM PST
## 7  260382639  02/24/2017 4:54:41 AM PST  02/24/2017 4:57:21 AM PST
## 8  260352393  02/23/2017 7:12:21 PM PST  02/23/2017 7:14:30 PM PST
## 9  260396004  02/24/2017 1:20:53 PM PST  02/24/2017 1:23:58 PM PST
## 10 260343325  02/23/2017 2:27:05 PM PST  02/23/2017 2:32:06 PM PST
##    interacted.with.av.as.pedestrian interacted.with.av.as.cyclist
## 1                               Yes                      Not sure
## 2                               Yes                           Yes
## 3                               Yes                           Yes
## 4                                No                            No
## 5                                No                            No
## 6                                No                            No
## 7                                No                            No
## 8                          Not sure                      Not sure
## 9                          Not sure                      Not sure
## 10                               No                            No
##    circumstanses.of.interaction approve.av.testing.pgh
## 1                             2       Somewhat Approve
## 2                             2                Approve
## 3                             2             Disapprove
## 4                            NA       Somewhat Approve
## 5                            NA                Approve
## 6                            NA    Somewhat Disapprove
## 7                            NA                Neutral
## 8                            NA                Approve
## 9                            NA                Approve
## 10                           NA                Approve

mutate_all()

We could apply the same mutate function to every column at once using mutate_all(). For instance, the code below converts every column to a numeric (although this results in mostly missing values for the character variables)

av_survey_sample %>%
  mutate_all(as.numeric)
##           id start_date end_date interacted_with_av_as_pedestrian
## 1  260332858         NA       NA                               NA
## 2  260336847         NA       NA                               NA
## 3  260381455         NA       NA                               NA
## 4  260342663         NA       NA                               NA
## 5  260897634         NA       NA                               NA
## 6  260334235         NA       NA                               NA
## 7  260382639         NA       NA                               NA
## 8  260352393         NA       NA                               NA
## 9  260396004         NA       NA                               NA
## 10 260343325         NA       NA                               NA
##    interacted_with_av_as_cyclist circumstanses_of_interaction
## 1                             NA                            2
## 2                             NA                            2
## 3                             NA                            2
## 4                             NA                           NA
## 5                             NA                           NA
## 6                             NA                           NA
## 7                             NA                           NA
## 8                             NA                           NA
## 9                             NA                           NA
## 10                            NA                           NA
##    approve_av_testing_pgh
## 1                      NA
## 2                      NA
## 3                      NA
## 4                      NA
## 5                      NA
## 6                      NA
## 7                      NA
## 8                      NA
## 9                      NA
## 10                     NA

summarise_all()

We could also apply the same summary function to every column at once using summarise_all(). For instance, the example below calculates the number of distinct entries in each column.

av_survey_sample %>%
  summarise_all(n_distinct)
##   id start_date end_date interacted_with_av_as_pedestrian
## 1 10         10       10                                3
##   interacted_with_av_as_cyclist circumstanses_of_interaction
## 1                             3                            2
##   approve_av_testing_pgh
## 1                      5

Conclusion

Hopefully this summary is useful to you in your data manipulation adventures!