Learn to purrr

Purrr is the tidyverse’s answer to apply functions for iteration. It’s one of those packages that you might have heard of, but seemed too complicated to sit down and learn. Starting with map functions, and taking you on a journey that will harness the power of the list, this post will have you purrring in no time.
R
purrr
tidyverse
Author

Rebecca Barter

Published

August 19, 2019

“It was on the corner of the street that he noticed the first sign of something peculiar - a cat reading a map” - J.K. Rowling

Purrr is one of those tidyverse packages that you keep hearing about, and you know you should probably learn it, but you just never seem to get around to it.

At it’s core, purrr is all about iteration. Purrr introduces map functions (the tidyverse’s answer to base R’s apply functions, but more in line with functional programming practices) as well as some new functions for manipulating lists. To get a quick snapshot of any tidyverse package, a nice place to go is the cheatsheet. I find these particularly useful after I’ve already got the basics of a package down, because I inevitably realise that there are a bunch of functionalities I knew nothing about.

Another useful resource for learning about purrr is Jenny Bryan’s tutorial. Jenny’s tutorial is fantastic, but is a lot longer than mine. This post is a lot shorter and my goal is to get you up and running with purrr very quickly.

While the workhorse of dplyr is the data frame, the workhorse of purrr is the list. If you aren’t familiar with lists, hopefully this will help you understand what they are:

Here is an example of a list that has three elements: a single number, a vector and a data frame

my_first_list <- list(my_number = 5,
                      my_vector = c("a", "b", "c"),
                      my_dataframe = data.frame(a = 1:3, b = c("q", "b", "z"), c = c("bananas", "are", "so very great")))
my_first_list
$my_number
[1] 5

$my_vector
[1] "a" "b" "c"

$my_dataframe
  a b             c
1 1 q       bananas
2 2 b           are
3 3 z so very great

Note that a data frame is actually a special case of a list where each element of the list is a vector of the same length.

Map functions: beyond apply

A map function is one that applies the same action/function to every element of an object (e.g. each entry of a list or a vector, or each of the columns of a data frame).

If you’re familiar with the base R apply() functions, then it turns out that you are already familiar with map functions, even if you didn’t know it!

The apply() functions are set of super useful base-R functions for iteratively performing an action across entries of a vector or list without having to write a for-loop. While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply…).

The naming convention of the map functions are such that the type of the output is specified by the term that follows the underscore in the function name.

  • map(.x, .f) is the main mapping function and returns a list

  • map_df(.x, .f) returns a data frame

  • map_dbl(.x, .f) returns a numeric (double) vector

  • map_chr(.x, .f) returns a character vector

  • map_lgl(.x, .f) returns a logical vector

Consistent with the way of the tidyverse, the first argument of each mapping function is always the data object that you want to map over, and the second argument is always the function that you want to iteratively apply to each element of the input object.

The input object to any map function is always either

  • a vector (of any type), in which case the iteration is done over the entries of the vector,

  • a list, in which case the iteration is performed over the elements of the list,

  • a data frame, in which case the iteration is performed over the columns of the data frame (which, since a data frame is a special kind of list, is technically the same as the previous point).

Since the first argument is always the data, this means that map functions play nicely with pipes (%>%). If you’ve never seen pipes before, they’re really useful (originally from the magrittr package, but also ported with the dplyr package and thus with the tidyverse). Piping allows you to string together many functions by piping an object (which itself might be the output of a function) into the first argument of the next function. If you’d like to learn more about pipes, check out my tidyverse blog posts.

Throughout this post I will demonstrate each of purrr’s functionalities using both a simple numeric example (to explain the concept) and the gapminder data (to show a more complex example).

Simplest usage: repeated looping with map

Fundamentally, maps are for iteration. In the example below I will iterate through the vector c(1, 4, 7) by adding 10 to each entry. This function applied to a single number, which we will call .x, can be defined as

addTen <- function(.x) {
  return(.x + 10)
}

The map() function below iterates addTen() across all entries of the vector, .x = c(1, 4, 7), and returns the output as a list

library(tidyverse)
map(.x = c(1, 4, 7), 
    .f = addTen)
[[1]]
[1] 11

[[2]]
[1] 14

[[3]]
[1] 17

Fortunately, you don’t actually need to specify the argument names

map(c(1, 4, 7), addTen)
[[1]]
[1] 11

[[2]]
[1] 14

[[3]]
[1] 17

Note that

  • the first element of the output is the result of applying the function to the first element of the input (1),

  • the second element of the output is the result of applying the function to the second element of the input (4),

  • and the third element of the output is the result of applying the function to the third element of the input (7).

The following code chunks show that no matter if the input object is a vector, a list, or a data frame, map() always returns a list.

map(list(1, 4, 7), addTen)
[[1]]
[1] 11

[[2]]
[1] 14

[[3]]
[1] 17
map(data.frame(a = 1, b = 4, c = 7), addTen)
$a
[1] 11

$b
[1] 14

$c
[1] 17

If we wanted the output of map to be some other object type, we need to use a different function. For instance to map the input to a numeric (double) vector, you can use the map_dbl() (“map to a double”) function.

map_dbl(c(1, 4, 7), addTen)
[1] 11 14 17

To map to a character vector, you can use the map_chr() (“map to a character”) function.

map_chr(c(1, 4, 7), addTen)
Warning: Automatic coercion from double to character was deprecated in purrr 1.0.0.
ℹ Please use an explicit call to `as.character()` within `map_chr()` instead.
[1] "11.000000" "14.000000" "17.000000"

If you want to return a data frame, then you would use the map_df() function. However, you need to make sure that in each iteration you’re returning a data frame which has consistent column names. map_df will automatically bind the rows of each iteration.

For this example, I want to return a data frame whose columns correspond to the original number and the number plus ten.

map_df(c(1, 4, 7), function(.x) {
  return(data.frame(old_number = .x, 
                    new_number = addTen(.x)))
})
  old_number new_number
1          1         11
2          4         14
3          7         17

Note that in this case, I defined an “anonymous” function as our output for each iteration. An anonymous function is a temporary function (that you define as the function argument to the map). Here I used the argument name .x, but I could have used anything.

Another function to be aware of is modify(), which is just like the map functions, but always returns an object the same type as the input object.

library(tidyverse)
modify(c(1, 4, 7), addTen)
[1] 11 14 17
modify(list(1, 4, 7), addTen)
[[1]]
[1] 11

[[2]]
[1] 14

[[3]]
[1] 17
modify(data.frame(1, 4, 7), addTen)
  X1 X4 X7
1 11 14 17

Modify also has a pretty useful sibling, modify_if(), that only applies the function to elements that satisfy a specific criteria (specified by a “predicate function”, the second argument called .p). For instance, the following example only modifies the third entry since it is greater than 5.

modify_if(.x = list(1, 4, 7), 
          .p = function(x) x > 5,
          .f = addTen)
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 17

The tilde-dot shorthand for functions

To make the code more concise you can use the tilde-dot shorthand for anonymous functions (the functions that you create as arguments of other functions).

The notation works by replacing

function(x) {
  x + 10
}

with

~{.x + 10}

~ indicates that you have started an anonymous function, and the argument of the anonymous function can be referred to using .x (or simply .). Unlike normal function arguments that can be anything that you like, the tilde-dot function argument is always .x.

Thus, instead of defining the addTen() function separately, we could use the tilde-dot shorthand

map_dbl(c(1, 4, 7), ~{.x + 10})
[1] 11 14 17

Applying map functions in a slightly more interesting context

Throughout this tutorial, we will use the gapminder dataset that can be loaded directly if you’re connected to the internet. Each function will first be demonstrated using a simple numeric example, and then will be demonstrated using a more complex practical example based on the gapminder dataset.

My general workflow involves loading the original data and saving it as an object with a meaningful name and an _orig suffix. I then define a copy of the original dataset without the _orig suffix. Having an original copy of my data in my environment means that it is easy to check that my manipulations do what I expected. I will make direct data cleaning modifications to the gapminder data frame, but will never edit the gapminder_orig data frame.

# to download the data directly:
gapminder_orig <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv")
# define a copy of the original dataset that we will clean and play with 
gapminder <- gapminder_orig

The gapminder dataset has 1704 rows containing information on population, life expectancy and GDP per capita by year and country.

A “tidy” data frame is one where every row is a single observational unit (in this case, indexed by country and year), and every column corresponds to a variable that is measured for each observational unit (in this case, for each country and year, a measurement is made for population, continent, life expectancy and GDP). If you’d like to learn more about “tidy data”, I highly recommend reading Hadley Wickham’s tidy data article.

dim(gapminder)
[1] 1704    6
head(gapminder)
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Since gapminder is a data frame, the map_ functions will iterate over each column. An example of simple usage of the map_ functions is to summarize each column. For instance, you can identify the type of each column by applying the class() function to each column. Since the output of the class() function is a character, we will use the map_chr() function:

# apply the class() function to each column
gapminder %>% map_chr(class)
    country        year         pop   continent     lifeExp   gdpPercap 
"character"   "integer"   "numeric" "character"   "numeric"   "numeric" 

I frequently do this to get a quick snapshot of each column type of a new dataset directly in the console. As a habit, I usually pipe in the data using %>%, rather than provide it as an argument. Remember that the pipe places the object to the left of the pipe in the first argument of the function to the right.

Similarly, if you wanted to identify the number of distinct values in each column, you could apply the n_distinct() function from the dplyr package to each column. Since the output of n_distinct() is a numeric (a double), you might want to use the map_dbl() function so that the results of each iteration (the application of n_distinct() to each column) are concatenated into a numeric vector:

# apply the n_distinct() function to each column
gapminder %>% map_dbl(n_distinct)
  country      year       pop continent   lifeExp gdpPercap 
      142        12      1704         5      1626      1704 

If you want to do something a little more complicated, such return a few different summaries of each column in a data frame, you can use map_df(). When things are getting a little bit more complicated, you typically need to define an anonymous function that you want to apply to each column. Using the tilde-dot notation, the anonymous function below calculates the number of distinct entries and the type of the current column (which is accessible as .x), and then combines them into a two-column data frame. Once it has iterated through each of the columns, the map_df function combines the data frames row-wise into a single data frame.

gapminder %>% map_df(~(data.frame(n_distinct = n_distinct(.x),
                                  class = class(.x))))
  n_distinct     class
1        142 character
2         12   integer
3       1704   numeric
4          5 character
5       1626   numeric
6       1704   numeric

Note that we’ve lost the variable names! The variable names correspond to the names of the objects over which we are iterating (in this case, the column names), and these are not automatically included as a column in the output data frame. You can tell map_df() to include them using the .id argument of map_df(). This will automatically take the name of the element being iterated over and include it in the column corresponding to whatever you set .id to.

gapminder %>% map_df(~(data.frame(n_distinct = n_distinct(.x),
                                  class = class(.x))),
                     .id = "variable")
   variable n_distinct     class
1   country        142 character
2      year         12   integer
3       pop       1704   numeric
4 continent          5 character
5   lifeExp       1626   numeric
6 gdpPercap       1704   numeric

If you’re having trouble thinking through these map actions, I recommend that you first figure out what the code would be to do what you want for a single element, and then paste it into the map_df() function (a nice trick I saw Hadley Wickham used a few years ago when he presented on purrr at RLadies SF).

For instance, since the first element of the gapminder data frame is the first column, let’s define .x in our environment to be this first column.

# take the first element of the gapminder data
.x <- gapminder %>% pluck(1)
# look at the first 6 rows
head(.x)
[1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
[6] "Afghanistan"

Then, you can create a data frame for this column that contains the number of distinct entries, and the class of the column.

data.frame(n_distinct = n_distinct(.x),
           class = class(.x))
  n_distinct     class
1        142 character

Since this has done what was expected want for the first column, you can paste this code into the map function using the tilde-dot shorthand.

gapminder %>% map_df(~(data.frame(n_distinct = n_distinct(.x),
                                  class = class(.x))),
                     .id = "variable")
   variable n_distinct     class
1   country        142 character
2      year         12   integer
3       pop       1704   numeric
4 continent          5 character
5   lifeExp       1626   numeric
6 gdpPercap       1704   numeric

map_df() is definitely one of the most powerful functions of purrr in my opinion, and is probably the one that I use most.

Maps with multiple input objects

After gaining a basic understanding of purrr’s map functions, you can start to do some fancier stuff. For instance, what if you want to perform a map that iterates through two objects. The code below uses map functions to create a list of plots that compare life expectancy and GDP per capita for each continent/year combination.

The map function that maps over two objects instead of 1 is called map2(). The first two arguments are the two objects you want to iterate over, and the third is the function (with two arguments, one for each object).

map2(.x = object1, # the first object to iterate over
     .y = object2, # the second object to iterate over
     .f = plotFunction(.x, .y))

First, you need to define a vector (or list) of continents and a paired vector (or list) of years that you want to iterate through. Note that in our continent/year example

  • the first iteration will correspond to the first continent in the continent vector and the first year in the year vector,

  • the second iteration will correspond to the second continent in the continent vector and the second year in the year vector.

This might seem obvious, but it is a natural instinct to incorrectly assume that map2() will automatically perform the action on all combinations that can be made from the two vectors. For instance if you have a continent vector .x = c("Americas", "Asia") and a year vector .y = c(1952, 2007), then you might assume that map2 will iterate over the Americas for 1952 and for 2007, and then Asia for 1952 and 2007. It won’t though. The iteration will actually be first the Americas for 1952 only, and then Asia for 2007 only.

First, let’s get our vectors of continents and years, starting by obtaining all distinct combinations of continents and years that appear in the data.

continent_year <- gapminder %>% distinct(continent, year)
continent_year
   continent year
1       Asia 1952
2       Asia 1957
3       Asia 1962
4       Asia 1967
5       Asia 1972
6       Asia 1977
7       Asia 1982
8       Asia 1987
9       Asia 1992
10      Asia 1997
11      Asia 2002
12      Asia 2007
13    Europe 1952
14    Europe 1957
15    Europe 1962
16    Europe 1967
17    Europe 1972
18    Europe 1977
19    Europe 1982
20    Europe 1987
21    Europe 1992
22    Europe 1997
23    Europe 2002
24    Europe 2007
25    Africa 1952
26    Africa 1957
27    Africa 1962
28    Africa 1967
29    Africa 1972
30    Africa 1977
31    Africa 1982
32    Africa 1987
33    Africa 1992
34    Africa 1997
35    Africa 2002
36    Africa 2007
37  Americas 1952
38  Americas 1957
39  Americas 1962
40  Americas 1967
41  Americas 1972
42  Americas 1977
43  Americas 1982
44  Americas 1987
45  Americas 1992
46  Americas 1997
47  Americas 2002
48  Americas 2007
49   Oceania 1952
50   Oceania 1957
51   Oceania 1962
52   Oceania 1967
53   Oceania 1972
54   Oceania 1977
55   Oceania 1982
56   Oceania 1987
57   Oceania 1992
58   Oceania 1997
59   Oceania 2002
60   Oceania 2007

Then extracting the continent and year pairs as separate vectors

# extract the continent and year pairs as separate vectors
continents <- continent_year %>% pull(continent) %>% as.character
years <- continent_year %>% pull(year)

If you want to use tilde-dot short-hand, the anonymous arguments will be .x for the first object being iterated over, and .y for the second object being iterated over.

Before jumping straight into the map function, it’s a good idea to first figure out what the code will be for just first iteration (the first continent and the first year, which happen to be Asia in 1952).

# try to figure out the code for the first example
.x <- continents[1]
.y <- years[1]
# make a scatterplot of GDP vs life expectancy in all Asian countries for 1952
gapminder %>% 
  filter(continent == .x,
         year == .y) %>%
  ggplot() +
  geom_point(aes(x = gdpPercap, y = lifeExp)) +
  ggtitle(glue::glue(.x, " ", .y))

This seems to have worked. So you can then copy-and-paste the code into the map2 function

plot_list <- map2(.x = continents, 
                  .y = years, 
                  .f = ~{
                    gapminder %>% 
                      filter(continent == .x,
                             year == .y) %>%
                      ggplot() +
                      geom_point(aes(x = gdpPercap, y = lifeExp)) +
                      ggtitle(glue::glue(.x, " ", .y))
                  })

And you can look at a few of the entries of the list to see that they make sense

plot_list[[1]]

plot_list[[22]]

pmap() allows you to iterate over an arbitrary number of objects (i.e. more than two).

List columns and Nested data frames

Tibbles are tidyverse data frames. Some crazy stuff starts happening when you learn that tibble columns can be lists (as opposed to vectors, which is what they usually are). This is where the difference between tibbles and data frames becomes real.

For instance, a tibble can be “nested” where the tibble is essentially split into separate data frames based on a grouping variable, and these separate data frames are stored as entries of a list (that is then stored in the data column of the data frame).

Below I nest the gapminder data by continent.

gapminder_nested <- gapminder %>% 
  group_by(continent) %>% 
  nest()
gapminder_nested
# A tibble: 5 × 2
# Groups:   continent [5]
  continent data              
  <chr>     <list>            
1 Asia      <tibble [396 × 5]>
2 Europe    <tibble [360 × 5]>
3 Africa    <tibble [624 × 5]>
4 Americas  <tibble [300 × 5]>
5 Oceania   <tibble [24 × 5]> 

The first column is the variable that we grouped by, continent, and the second column is the rest of the data frame corresponding to that group (as if you had filtered the data frame to the specific continent). To see this, the code below shows that the first entry in the data column corresponds to the entire gapminder dataset for Asia.

gapminder_nested$data[[1]]
# A tibble: 396 × 5
   country      year      pop lifeExp gdpPercap
   <chr>       <int>    <dbl>   <dbl>     <dbl>
 1 Afghanistan  1952  8425333    28.8      779.
 2 Afghanistan  1957  9240934    30.3      821.
 3 Afghanistan  1962 10267083    32.0      853.
 4 Afghanistan  1967 11537966    34.0      836.
 5 Afghanistan  1972 13079460    36.1      740.
 6 Afghanistan  1977 14880372    38.4      786.
 7 Afghanistan  1982 12881816    39.9      978.
 8 Afghanistan  1987 13867957    40.8      852.
 9 Afghanistan  1992 16317921    41.7      649.
10 Afghanistan  1997 22227415    41.8      635.
# … with 386 more rows

Using dplyr pluck() function, this can be written as

gapminder_nested %>% 
  # extract the first entry from the data column
  pluck("data", 1)
# A tibble: 396 × 5
   country      year      pop lifeExp gdpPercap
   <chr>       <int>    <dbl>   <dbl>     <dbl>
 1 Afghanistan  1952  8425333    28.8      779.
 2 Afghanistan  1957  9240934    30.3      821.
 3 Afghanistan  1962 10267083    32.0      853.
 4 Afghanistan  1967 11537966    34.0      836.
 5 Afghanistan  1972 13079460    36.1      740.
 6 Afghanistan  1977 14880372    38.4      786.
 7 Afghanistan  1982 12881816    39.9      978.
 8 Afghanistan  1987 13867957    40.8      852.
 9 Afghanistan  1992 16317921    41.7      649.
10 Afghanistan  1997 22227415    41.8      635.
# … with 386 more rows

Similarly, the 5th entry in the data column corresponds to the entire gapminder dataset for Oceania.

gapminder_nested %>% pluck("data", 5)
# A tibble: 24 × 5
   country    year      pop lifeExp gdpPercap
   <chr>     <int>    <dbl>   <dbl>     <dbl>
 1 Australia  1952  8691212    69.1    10040.
 2 Australia  1957  9712569    70.3    10950.
 3 Australia  1962 10794968    70.9    12217.
 4 Australia  1967 11872264    71.1    14526.
 5 Australia  1972 13177000    71.9    16789.
 6 Australia  1977 14074100    73.5    18334.
 7 Australia  1982 15184200    74.7    19477.
 8 Australia  1987 16257249    76.3    21889.
 9 Australia  1992 17481977    77.6    23425.
10 Australia  1997 18565243    78.8    26998.
# … with 14 more rows

You might be asking at this point why you would ever want to nest your data frame? It just doesn’t seem like that useful a thing to do… until you realise that you now have the power to use dplyr manipulations on more complex objects that can be stored in a list.

However, since actions such as mutate() are applied directly to the entire column (which is usually a vector, so is fine), we run into issues when we try to mutate a list. For instance, since columns are usually vectors, normal vectorized functions work just fine on them

tibble(vec_col = 1:10) %>%
  mutate(vec_sum = sum(vec_col))
# A tibble: 10 × 2
   vec_col vec_sum
     <int>   <int>
 1       1      55
 2       2      55
 3       3      55
 4       4      55
 5       5      55
 6       6      55
 7       7      55
 8       8      55
 9       9      55
10      10      55

but when the column is a list, vectorized functions don’t know what to do with them, and we get an error that says Error in sum(x) : invalid 'type' (list) of argument. Try

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = sum(list_col))

To apply mutate functions to a list-column, you need to wrap the function you want to apply in a map function.

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map(list_col, sum))
# A tibble: 3 × 2
  list_col  list_sum 
  <list>    <list>   
1 <dbl [3]> <dbl [1]>
2 <dbl [1]> <dbl [1]>
3 <dbl [3]> <dbl [1]>

Since map() returns a list itself, the list_sum column is thus itself a list

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map(list_col, sum)) %>% 
  pull(list_sum)
[[1]]
[1] 13

[[2]]
[1] 5

[[3]]
[1] 31

What could we do if we wanted it to be a vector? We could use the map_dbl() function instead!

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map_dbl(list_col, sum))
# A tibble: 3 × 2
  list_col  list_sum
  <list>       <dbl>
1 <dbl [3]>       13
2 <dbl [1]>        5
3 <dbl [3]>       31

Nesting the gapminder data

Let’s return to the nested gapminder dataset. I want to calculate the average life expectancy within each continent and add it as a new column using mutate(). Based on the example above, can you explain why the following code doesn’t work?

gapminder_nested %>% 
  mutate(avg_lifeExp = mean(data$lifeExp))

I was hoping that this code would extract the lifeExp column from each data frame. But I’m applying the mutate to the data column, which itself doesn’t have an entry called lifeExp since it’s a list of data frames. How could I get access to the lifeExp column of the data frames stored in the data list? Using a map function of course!

Think of an individual data frame as .x. Again, I will first figure out the code for calculating the mean life expectancy for the first entry of the column. The following code defines .x to be the first entry of the data column (this is the data frame for Asia).

# the first entry of the "data" column
.x <- gapminder_nested %>% pluck("data", 1)
.x
# A tibble: 396 × 5
   country      year      pop lifeExp gdpPercap
   <chr>       <int>    <dbl>   <dbl>     <dbl>
 1 Afghanistan  1952  8425333    28.8      779.
 2 Afghanistan  1957  9240934    30.3      821.
 3 Afghanistan  1962 10267083    32.0      853.
 4 Afghanistan  1967 11537966    34.0      836.
 5 Afghanistan  1972 13079460    36.1      740.
 6 Afghanistan  1977 14880372    38.4      786.
 7 Afghanistan  1982 12881816    39.9      978.
 8 Afghanistan  1987 13867957    40.8      852.
 9 Afghanistan  1992 16317921    41.7      649.
10 Afghanistan  1997 22227415    41.8      635.
# … with 386 more rows

Then to calculate the average life expectancy for Asia, I could write

mean(.x$lifeExp)
[1] 60.0649

So copy-pasting this into the tilde-dot anonymous function argument of the map_dbl() function within mutate(), I get what I wanted!

gapminder_nested %>% 
  mutate(avg_lifeExp = map_dbl(data, ~{mean(.x$lifeExp)}))
# A tibble: 5 × 3
# Groups:   continent [5]
  continent data               avg_lifeExp
  <chr>     <list>                   <dbl>
1 Asia      <tibble [396 × 5]>        60.1
2 Europe    <tibble [360 × 5]>        71.9
3 Africa    <tibble [624 × 5]>        48.9
4 Americas  <tibble [300 × 5]>        64.7
5 Oceania   <tibble [24 × 5]>         74.3

This code iterates through the data frames stored in the data column, returns the average life expectancy for each data frame, and concatonates the results into a numeric vector (which is then stored as a column called avg_lifeExp).

I hear what you’re saying… this is something that we could have done a lot more easily using standard dplyr commands (such as summarise()). True, but hopefully it helped you understand why you need to wrap mutate functions inside map functions when applying them to list columns.

Even if this example was less than inspiring, I promise the next example will knock your socks off!

The next exampe will demonstrate how to fit a model separately for each continent, and evaluate it, all within a single tibble. First, I will fit a linear model for each continent and store it as a list-column. If the data frame for a single continent is .x, then the model I want to fit is lm(lifeExp ~ pop + gdpPercap + year, data = .x) (check for yourself that this does what you expect). So I can copy-past this command into the map() function within the mutate()

# fit a model separately for each continent
gapminder_nested <- gapminder_nested %>% 
  mutate(lm_obj = map(data, ~lm(lifeExp ~ pop + gdpPercap + year, data = .x)))
gapminder_nested
# A tibble: 5 × 3
# Groups:   continent [5]
  continent data               lm_obj
  <chr>     <list>             <list>
1 Asia      <tibble [396 × 5]> <lm>  
2 Europe    <tibble [360 × 5]> <lm>  
3 Africa    <tibble [624 × 5]> <lm>  
4 Americas  <tibble [300 × 5]> <lm>  
5 Oceania   <tibble [24 × 5]>  <lm>  

Where the first linear model (for Asia) is

gapminder_nested %>% pluck("lm_obj", 1)

Call:
lm(formula = lifeExp ~ pop + gdpPercap + year, data = .x)

Coefficients:
(Intercept)          pop    gdpPercap         year  
 -7.833e+02    4.228e-11    2.510e-04    4.251e-01  

I can then predict the response for the data stored in the data column using the corresponding linear model. So I have two objects I want to iterate over: the data and the linear model object. This means I want to use map2(). When things get a little more complicated I like to have multiple function arguments, so I’m going to use a full anonymous function rather than the tilde-dot shorthand.

# predict the response for each continent
gapminder_nested <- gapminder_nested %>% 
  mutate(pred = map2(lm_obj, data, function(.lm, .data) predict(.lm, .data)))
gapminder_nested
# A tibble: 5 × 4
# Groups:   continent [5]
  continent data               lm_obj pred       
  <chr>     <list>             <list> <list>     
1 Asia      <tibble [396 × 5]> <lm>   <dbl [396]>
2 Europe    <tibble [360 × 5]> <lm>   <dbl [360]>
3 Africa    <tibble [624 × 5]> <lm>   <dbl [624]>
4 Americas  <tibble [300 × 5]> <lm>   <dbl [300]>
5 Oceania   <tibble [24 × 5]>  <lm>   <dbl [24]> 

And I can then calculate the correlation between the predicted response and the true response, this time using the map2()_dbl function since I want the output the be a numeric vector rather than a list of single elements.

# calculate the correlation between observed and predicted response for each continent
gapminder_nested <- gapminder_nested %>% 
  mutate(cor = map2_dbl(pred, data, function(.pred, .data) cor(.pred, .data$lifeExp)))
gapminder_nested
# A tibble: 5 × 5
# Groups:   continent [5]
  continent data               lm_obj pred          cor
  <chr>     <list>             <list> <list>      <dbl>
1 Asia      <tibble [396 × 5]> <lm>   <dbl [396]> 0.723
2 Europe    <tibble [360 × 5]> <lm>   <dbl [360]> 0.834
3 Africa    <tibble [624 × 5]> <lm>   <dbl [624]> 0.645
4 Americas  <tibble [300 × 5]> <lm>   <dbl [300]> 0.779
5 Oceania   <tibble [24 × 5]>  <lm>   <dbl [24]>  0.987

Holy guacamole, that is so awesome!

Advanced exercise

The goal of this exercise is to fit a separate linear model for each continent without splitting up the data. Create the following data frame that has the continent, each term in the model for the continent, its linear model coefficient estimate, and standard error.

# A tibble: 20 × 6
   continent term         estimate std.error statistic  p.value
   <chr>     <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 Asia      (Intercept) -7.83e+ 2   4.83e+1  -16.2    1.22e-45
 2 Asia      pop          4.23e-11   2.04e-9    0.0207 9.83e- 1
 3 Asia      year         4.25e- 1   2.44e-2   17.4    1.13e-50
 4 Asia      gdpPercap    2.51e- 4   3.01e-5    8.34   1.31e-15
 5 Europe    (Intercept) -1.61e+ 2   2.28e+1   -7.09   7.44e-12
 6 Europe    pop         -8.18e- 9   7.80e-9   -1.05   2.95e- 1
 7 Europe    year         1.16e- 1   1.16e-2    9.96   8.88e-21
 8 Europe    gdpPercap    3.25e- 4   2.15e-5   15.2    2.21e-40
 9 Africa    (Intercept) -4.70e+ 2   3.39e+1  -13.9    2.17e-38
10 Africa    pop         -3.68e- 9   1.89e-8   -0.195  8.45e- 1
11 Africa    year         2.61e- 1   1.71e-2   15.2    1.07e-44
12 Africa    gdpPercap    1.12e- 3   1.01e-4   11.1    2.46e-26
13 Americas  (Intercept) -5.33e+ 2   4.10e+1  -13.0    6.40e-31
14 Americas  pop         -2.15e- 8   8.62e-9   -2.49   1.32e- 2
15 Americas  year         3.00e- 1   2.08e-2   14.4    3.79e-36
16 Americas  gdpPercap    6.75e- 4   7.15e-5    9.44   1.13e-18
17 Oceania   (Intercept) -2.10e+ 2   5.12e+1   -4.10   5.61e- 4
18 Oceania   pop          8.37e- 9   3.34e-8    0.251  8.05e- 1
19 Oceania   year         1.42e- 1   2.65e-2    5.34   3.19e- 5
20 Oceania   gdpPercap    2.03e- 4   8.47e-5    2.39   2.66e- 2

Hint: starting from the gapminder dataset, use group_by() and nest() to nest by continent, use a mutate together with map to fit a linear model for each continent, use another mutate with broom::tidy() to get a data frame of model coefficients for each model, and a transmute to get just the columns you want, followed by an unnest() to re-expand the nested tibble.

The solution code is at the end of this post.

If you want to stop here, you will already know more than most purrr users. The remainder of this blog post involves little-used features of purrr for manipulating lists.

Additional purrr functionalities for lists

To demonstrate how to use purrr to manipulate lists, we will split the gapminder dataset into a list of data frames (which is kind of like the converse of a data frame containing a list-column). To make sure it’s easy to follow, we will only keep 5 rows from each continent.

set.seed(23489)
gapminder_list <- gapminder %>% split(gapminder$continent) %>%
  map(~sample_n(., 5))
gapminder_list
$Africa
            country year      pop continent lifeExp gdpPercap
1            Gambia 1967   439593    Africa  35.857  734.7829
2      Sierra Leone 1967  2662190    Africa  34.113 1206.0435
3           Namibia 1997  1774766    Africa  58.909 3899.5243
4 Equatorial Guinea 1992   387838    Africa  47.545 1132.0550
5     Cote d'Ivoire 2002 16252726    Africa  46.832 1648.8008

$Americas
             country year     pop continent lifeExp gdpPercap
1 Dominican Republic 1997 7992357  Americas  69.957  3614.101
2        Puerto Rico 1987 3444468  Americas  74.630 12281.342
3           Honduras 1992 5077347  Americas  66.399  3081.695
4            Uruguay 2007 3447496  Americas  76.384 10611.463
5         Costa Rica 1962 1345187  Americas  62.842  3460.937

$Asia
     country year       pop continent lifeExp gdpPercap
1    Lebanon 1967   2186894      Asia  63.870 6006.9830
2      Nepal 1962  10332057      Asia  39.393  652.3969
3 Yemen Rep. 1992  13367997      Asia  55.599 1879.4967
4      India 1972 567000000      Asia  50.651  724.0325
5   Cambodia 1952   4693836      Asia  39.417  368.4693

$Europe
         country year      pop continent lifeExp gdpPercap
1 United Kingdom 2002 59912431    Europe  78.471  29479.00
2         Greece 1997 10502372    Europe  77.869  18747.70
3        Belgium 2002 10311970    Europe  78.320  30485.88
4        Croatia 2002  4481020    Europe  74.876  11628.39
5    Netherlands 1967 12596822    Europe  73.820  15363.25

$Oceania
      country year      pop continent lifeExp gdpPercap
1   Australia 1982 15184200   Oceania  74.740  19477.01
2 New Zealand 1997  3676187   Oceania  77.550  21050.41
3 New Zealand 2007  4115771   Oceania  80.204  25185.01
4   Australia 2007 20434176   Oceania  81.235  34435.37
5 New Zealand 1952  1994794   Oceania  69.390  10556.58

Keep/Discard: select_if for lists

keep() only keeps elements of a list that satisfy a given condition, much like select_if() selects columns of a data frame that satisfy a given condition.

The following code only keeps the gapminder continent data frames (the elements of the list) that have an average (among the sample of 5 rows) life expectancy of at least 70.

gapminder_list %>%
  keep(~{mean(.x$lifeExp) > 70})
$Americas
             country year     pop continent lifeExp gdpPercap
1 Dominican Republic 1997 7992357  Americas  69.957  3614.101
2        Puerto Rico 1987 3444468  Americas  74.630 12281.342
3           Honduras 1992 5077347  Americas  66.399  3081.695
4            Uruguay 2007 3447496  Americas  76.384 10611.463
5         Costa Rica 1962 1345187  Americas  62.842  3460.937

$Europe
         country year      pop continent lifeExp gdpPercap
1 United Kingdom 2002 59912431    Europe  78.471  29479.00
2         Greece 1997 10502372    Europe  77.869  18747.70
3        Belgium 2002 10311970    Europe  78.320  30485.88
4        Croatia 2002  4481020    Europe  74.876  11628.39
5    Netherlands 1967 12596822    Europe  73.820  15363.25

$Oceania
      country year      pop continent lifeExp gdpPercap
1   Australia 1982 15184200   Oceania  74.740  19477.01
2 New Zealand 1997  3676187   Oceania  77.550  21050.41
3 New Zealand 2007  4115771   Oceania  80.204  25185.01
4   Australia 2007 20434176   Oceania  81.235  34435.37
5 New Zealand 1952  1994794   Oceania  69.390  10556.58

discard() does the opposite of keep(): it discards any elements that satisfy your logical condition.

Reduce

reduce() is designed to combine (reduces) all of the elements of a list into a single object by iteratively applying a binary function (a function that takes two inputs).

For instance, applying a reduce function to add up all of the elements of the vector c(1, 2, 3) is like doing sum(sum(1, 2), 3): first it applies sum to 1 and 2, then it applies sum again to the output of sum(1, 2) and 3.

reduce(c(1, 2, 3), sum)
[1] 6

accumulate() also returns the intermediate values.

accumulate(c(1, 2, 3), sum)
[1] 1 3 6

An example of when reduce() might come in handy is when you want to perform many left_join()s in a row, or to do repeated rbinds() (e.g. to bind the rows of the list back together into a single data frame)

gapminder_list %>%
  reduce(rbind)
              country year       pop continent lifeExp  gdpPercap
1              Gambia 1967    439593    Africa  35.857   734.7829
2        Sierra Leone 1967   2662190    Africa  34.113  1206.0435
3             Namibia 1997   1774766    Africa  58.909  3899.5243
4   Equatorial Guinea 1992    387838    Africa  47.545  1132.0550
5       Cote d'Ivoire 2002  16252726    Africa  46.832  1648.8008
6  Dominican Republic 1997   7992357  Americas  69.957  3614.1013
7         Puerto Rico 1987   3444468  Americas  74.630 12281.3419
8            Honduras 1992   5077347  Americas  66.399  3081.6946
9             Uruguay 2007   3447496  Americas  76.384 10611.4630
10         Costa Rica 1962   1345187  Americas  62.842  3460.9370
11            Lebanon 1967   2186894      Asia  63.870  6006.9830
12              Nepal 1962  10332057      Asia  39.393   652.3969
13         Yemen Rep. 1992  13367997      Asia  55.599  1879.4967
14              India 1972 567000000      Asia  50.651   724.0325
15           Cambodia 1952   4693836      Asia  39.417   368.4693
16     United Kingdom 2002  59912431    Europe  78.471 29478.9992
17             Greece 1997  10502372    Europe  77.869 18747.6981
18            Belgium 2002  10311970    Europe  78.320 30485.8838
19            Croatia 2002   4481020    Europe  74.876 11628.3890
20        Netherlands 1967  12596822    Europe  73.820 15363.2514
21          Australia 1982  15184200   Oceania  74.740 19477.0093
22        New Zealand 1997   3676187   Oceania  77.550 21050.4138
23        New Zealand 2007   4115771   Oceania  80.204 25185.0091
24          Australia 2007  20434176   Oceania  81.235 34435.3674
25        New Zealand 1952   1994794   Oceania  69.390 10556.5757

Logical statements for lists

Asking logical questions of a list can be done using every() and some(). For instance to ask whether every continent has average life expectancy greater than 70, you can use every()

gapminder_list %>% every(~{mean(.x$life) > 70})
[1] FALSE

To ask whether some continents have average life expectancy greater than 70, you can use some()

gapminder_list %>% some(~{mean(.x$life) > 70})
[1] TRUE

An equivalent of %in% for lists is has_element().

list(1, c(2, 5, 1), "a") %>% has_element("a")
[1] TRUE

Most of these functions also work on vectors.

Now go forth and purrr!

Answer to advanced exercise

The following code produces the table from the exercise above

gapminder %>% 
  group_by(continent) %>% 
  nest() %>%
  mutate(lm_obj = map(data, ~lm(lifeExp ~ pop + year + gdpPercap, data = .))) %>%
  mutate(lm_tidy = map(lm_obj, broom::tidy)) %>%
  ungroup() %>%
  transmute(continent, lm_tidy) %>%
  unnest(cols = c(lm_tidy))
# A tibble: 20 × 6
   continent term         estimate std.error statistic  p.value
   <chr>     <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 Asia      (Intercept) -7.83e+ 2   4.83e+1  -16.2    1.22e-45
 2 Asia      pop          4.23e-11   2.04e-9    0.0207 9.83e- 1
 3 Asia      year         4.25e- 1   2.44e-2   17.4    1.13e-50
 4 Asia      gdpPercap    2.51e- 4   3.01e-5    8.34   1.31e-15
 5 Europe    (Intercept) -1.61e+ 2   2.28e+1   -7.09   7.44e-12
 6 Europe    pop         -8.18e- 9   7.80e-9   -1.05   2.95e- 1
 7 Europe    year         1.16e- 1   1.16e-2    9.96   8.88e-21
 8 Europe    gdpPercap    3.25e- 4   2.15e-5   15.2    2.21e-40
 9 Africa    (Intercept) -4.70e+ 2   3.39e+1  -13.9    2.17e-38
10 Africa    pop         -3.68e- 9   1.89e-8   -0.195  8.45e- 1
11 Africa    year         2.61e- 1   1.71e-2   15.2    1.07e-44
12 Africa    gdpPercap    1.12e- 3   1.01e-4   11.1    2.46e-26
13 Americas  (Intercept) -5.33e+ 2   4.10e+1  -13.0    6.40e-31
14 Americas  pop         -2.15e- 8   8.62e-9   -2.49   1.32e- 2
15 Americas  year         3.00e- 1   2.08e-2   14.4    3.79e-36
16 Americas  gdpPercap    6.75e- 4   7.15e-5    9.44   1.13e-18
17 Oceania   (Intercept) -2.10e+ 2   5.12e+1   -4.10   5.61e- 4
18 Oceania   pop          8.37e- 9   3.34e-8    0.251  8.05e- 1
19 Oceania   year         1.42e- 1   2.65e-2    5.34   3.19e- 5
20 Oceania   gdpPercap    2.03e- 4   8.47e-5    2.39   2.66e- 2