Learn to purrr

Purrr is the tidyverse's answer to apply functions for iteration. It's one of those packages that you might have heard of, but seemed too complicated to sit down and learn. Starting with map functions, and taking you on a journey that will harness the power of the list, this post will have you purrring in no time.

Rebecca Barter

“It was on the corner of the street that he noticed the first sign of something peculiar - a cat reading a map” - J.K. Rowling

Purrr is one of those tidyverse packages that you keep hearing about, and you know you should probably learn it, but you just never seem to get around to it.

At it’s core, purrr is all about iteration. Purrr introduces map functions (the tidyverse’s answer to base R’s apply functions, but more in line with functional programming practices) as well as some new functions for manipulating lists. To get a quick snapshot of any tidyverse package, a nice place to go is the cheatsheet. I find these particularly useful after I’ve already got the basics of a package down, because I inevitably realise that there are a bunch of functionalities I knew nothing about.

Another useful resource for learning about purrr is Jenny Bryan’s tutorial. Jenny’s tutorial is fantastic, but is a lot longer than mine. This post is a lot shorter and my goal is to get you up and running with purrr very quickly.

While the workhorse of dplyr is the data frame, the workhorse of purrr is the list. If you aren’t familiar with lists, hopefully this will help you understand what they are:

  • A vector is a way of storing many individual elements (a single number or a single character or string) of the same type together in a single object,

  • A data frame is a way of storing many vectors of the same length but possibly of different types together in a single object

  • A list is a way of storing many objects of any type (e.g. data frames, plots, vectors) together in a single object

Here is an example of a list that has three elements: a single number, a vector and a data frame

my_first_list <- list(my_number = 5,
                      my_vector = c("a", "b", "c"),
                      my_dataframe = data.frame(a = 1:3, b = c("q", "b", "z"), c = c("bananas", "are", "so very great")))
my_first_list
## $my_number
## [1] 5
## 
## $my_vector
## [1] "a" "b" "c"
## 
## $my_dataframe
##   a b             c
## 1 1 q       bananas
## 2 2 b           are
## 3 3 z so very great

Note that a data frame is actually a special case of a list where each element of the list is a vector of the same length.

Map functions: beyond apply

A map function is one that applies the same action/function to every element of an object (e.g. each entry of a list or a vector, or each of the columns of a data frame).

If you’re familiar with the base R apply() functions, then it turns out that you are already familiar with map functions, even if you didn’t know it!

The apply() functions are set of super useful base-R functions for iteratively performing an action across entries of a vector or list without having to write a for-loop. While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply…).

The naming convention of the map functions are such that the type of the output is specified by the term that follows the underscore in the function name.

  • map(.x, .f) is the main mapping function and returns a list

  • map_df(.x, .f) returns a data frame

  • map_dbl(.x, .f) returns a numeric (double) vector

  • map_chr(.x, .f) returns a character vector

  • map_lgl(.x, .f) returns a logical vector

Consistent with the way of the tidyverse, the first argument of each mapping function is always the data object that you want to map over, and the second argument is always the function that you want to iteratively apply to each element of the input object.

The input object to any map function is always either

  • a vector (of any type), in which case the iteration is done over the entries of the vector,

  • a list, in which case the iteration is performed over the elements of the list,

  • a data frame, in which case the iteration is performed over the columns of the data frame (which, since a data frame is a special kind of list, is technically the same as the previous point).

Since the first argument is always the data, this means that map functions play nicely with pipes (%>%). If you’ve never seen pipes before, they’re really useful (originally from the magrittr package, but also ported with the dplyr package and thus with the tidyverse). Piping allows you to string together many functions by piping an object (which itself might be the output of a function) into the first argument of the next function. If you’d like to learn more about pipes, check out my tidyverse blog posts.

Throughout this post I will demonstrate each of purrr’s functionalities using both a simple numeric example (to explain the concept) and the gapminder data (to show a more complex example).

Simplest usage: repeated looping with map

Fundamentally, maps are for iteration. In the example below I will iterate through the vector c(1, 4, 7) by adding 10 to each entry. This function applied to a single number, which we will call .x, can be defined as

addTen <- function(.x) {
  return(.x + 10)
}

The map() function below iterates addTen() across all entries of the vector, .x = c(1, 4, 7), and returns the output as a list

library(tidyverse)
map(.x = c(1, 4, 7), 
    .f = addTen)
## [[1]]
## [1] 11
## 
## [[2]]
## [1] 14
## 
## [[3]]
## [1] 17

Fortunately, you don’t actually need to specify the argument names

map(c(1, 4, 7), addTen)
## [[1]]
## [1] 11
## 
## [[2]]
## [1] 14
## 
## [[3]]
## [1] 17

Note that

  • the first element of the output is the result of applying the function to the first element of the input (1),

  • the second element of the output is the result of applying the function to the second element of the input (4),

  • and the third element of the output is the result of applying the function to the third element of the input (7).

The following code chunks show that no matter if the input object is a vector, a list, or a data frame, map() always returns a list.

map(list(1, 4, 7), addTen)
## [[1]]
## [1] 11
## 
## [[2]]
## [1] 14
## 
## [[3]]
## [1] 17
map(data.frame(a = 1, b = 4, c = 7), addTen)
## $a
## [1] 11
## 
## $b
## [1] 14
## 
## $c
## [1] 17

If we wanted the output of map to be some other object type, we need to use a different function. For instance to map the input to a numeric (double) vector, you can use the map_dbl() (“map to a double”) function.

map_dbl(c(1, 4, 7), addTen)
## [1] 11 14 17

To map to a character vector, you can use the map_chr() (“map to a character”) function.

map_chr(c(1, 4, 7), addTen)
## [1] "11.000000" "14.000000" "17.000000"

If you want to return a data frame, then you would use the map_df() function. However, you need to make sure that in each iteration you’re returning a data frame which has consistent column names. map_df will automatically bind the rows of each iteration.

For this example, I want to return a data frame whose columns correspond to the original number and the number plus ten.

map_df(c(1, 4, 7), function(.x) {
  return(data.frame(old_number = .x, 
                    new_number = addTen(.x)))
})
##   old_number new_number
## 1          1         11
## 2          4         14
## 3          7         17

Note that in this case, I defined an “anonymous” function as our output for each iteration. An anonymous function is a temporary function (that you define as the function argument to the map). Here I used the argument name .x, but I could have used anything.

Another function to be aware of is modify(), which is just like the map functions, but always returns an object the same type as the input object.

library(tidyverse)
modify(c(1, 4, 7), addTen)
## [1] 11 14 17
modify(list(1, 4, 7), addTen)
## [[1]]
## [1] 11
## 
## [[2]]
## [1] 14
## 
## [[3]]
## [1] 17
modify(data.frame(1, 4, 7), addTen)
##   X1 X4 X7
## 1 11 14 17

Modify also has a pretty useful sibling, modify_if(), that only applies the function to elements that satisfy a specific criteria (specified by a “predicate function”, the second argument called .p). For instance, the following example only modifies the third entry since it is greater than 5.

modify_if(.x = list(1, 4, 7), 
          .p = function(x) x > 5,
          .f = addTen)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 17

The tilde-dot shorthand for functions

To make the code more concise you can use the tilde-dot shorthand for anonymous functions (the functions that you create as arguments of other functions).

The notation works by replacing

function(x) {
  x + 10
}

with

~{.x + 10}

~ indicates that you have started an anonymous function, and the argument of the anonymous function can be referred to using .x (or simply .). Unlike normal function arguments that can be anything that you like, the tilde-dot function argument is always .x.

Thus, instead of defining the addTen() function separately, we could use the tilde-dot shorthand

map_dbl(c(1, 4, 7), ~{.x + 10})
## [1] 11 14 17

Applying map functions in a slightly more interesting context

Throughout this tutorial, we will use the gapminder dataset that can be loaded directly if you’re connected to the internet. Each function will first be demonstrated using a simple numeric example, and then will be demonstrated using a more complex practical example based on the gapminder dataset.

My general workflow involves loading the original data and saving it as an object with a meaningful name and an _orig suffix. I then define a copy of the original dataset without the _orig suffix. Having an original copy of my data in my environment means that it is easy to check that my manipulations do what I expected. I will make direct data cleaning modifications to the gapminder data frame, but will never edit the gapminder_orig data frame.

# to download the data directly:
gapminder_orig <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv")
# define a copy of the original dataset that we will clean and play with 
gapminder <- gapminder_orig

The gapminder dataset has 1704 rows containing information on population, life expectancy and GDP per capita by year and country.

A “tidy” data frame is one where every row is a single observational unit (in this case, indexed by country and year), and every column corresponds to a variable that is measured for each observational unit (in this case, for each country and year, a measurement is made for population, continent, life expectancy and GDP). If you’d like to learn more about “tidy data”, I highly recommend reading Hadley Wickham’s tidy data article.

dim(gapminder)
## [1] 1704    6
head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Since gapminder is a data frame, the map_ functions will iterate over each column. An example of simple usage of the map_ functions is to summarize each column. For instance, you can identify the type of each column by applying the class() function to each column. Since the output of the class() function is a character, we will use the map_chr() function:

# apply the class() function to each column
gapminder %>% map_chr(class)
##   country      year       pop continent   lifeExp gdpPercap 
##  "factor" "integer" "numeric"  "factor" "numeric" "numeric"

I frequently do this to get a quick snapshot of each column type of a new dataset directly in the console. As a habit, I usually pipe in the data using %>%, rather than provide it as an argument. Remember that the pipe places the object to the left of the pipe in the first argument of the function to the right.

Similarly, if you wanted to identify the number of distinct values in each column, you could apply the n_distinct() function from the dplyr package to each column. Since the output of n_distinct() is a numeric (a double), you might want to use the map_dbl() function so that the results of each iteration (the application of n_distinct() to each column) are concatenated into a numeric vector:

# apply the n_distinct() function to each column
gapminder %>% map_dbl(n_distinct)
##   country      year       pop continent   lifeExp gdpPercap 
##       142        12      1704         5      1626      1704

If you want to do something a little more complicated, such return a few different summaries of each column in a data frame, you can use map_df(). When things are getting a little bit more complicated, you typically need to define an anonymous function that you want to apply to each column. Using the tilde-dot notation, the anonymous function below calculates the number of distinct entries and the type of the current column (which is accessible as .x), and then combines them into a two-column data frame. Once it has iterated through each of the columns, the map_df function combines the data frames row-wise into a single data frame.

gapminder %>% map_df(~(data.frame(n_distinct = n_distinct(.x),
                                  class = class(.x))))
##   n_distinct   class
## 1        142  factor
## 2         12 integer
## 3       1704 numeric
## 4          5  factor
## 5       1626 numeric
## 6       1704 numeric

Note that we’ve lost the variable names! The variable names correspond to the names of the objects over which we are iterating (in this case, the column names), and these are not automatically included as a column in the output data frame. You can tell map_df() to include them using the .id argument of map_df(). This will automatically take the name of the element being iterated over and include it in the column corresponding to whatever you set .id to.

gapminder %>% map_df(~(data.frame(n_distinct = n_distinct(.x),
                                  class = class(.x))),
                     .id = "variable")
##    variable n_distinct   class
## 1   country        142  factor
## 2      year         12 integer
## 3       pop       1704 numeric
## 4 continent          5  factor
## 5   lifeExp       1626 numeric
## 6 gdpPercap       1704 numeric

If you’re having trouble thinking through these map actions, I recommend that you first figure out what the code would be to do what you want for a single element, and then paste it into the map_df() function (a nice trick I saw Hadley Wickham used a few years ago when he presented on purrr at RLadies SF).

For instance, since the first element of the gapminder data frame is the first column, let’s define .x in our environment to be this first column.

# take the first element of the gapminder data
.x <- gapminder %>% pluck(1)
# look at the first 6 rows
head(.x)
## [1] Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan
## 142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe

Then, you can create a data frame for this column that contains the number of distinct entries, and the class of the column.

data.frame(n_distinct = n_distinct(.x),
           class = class(.x))
##   n_distinct  class
## 1        142 factor

Since this has done what was expected want for the first column, you can paste this code into the map function using the tilde-dot shorthand.

gapminder %>% map_df(~(data.frame(n_distinct = n_distinct(.x),
                                  class = class(.x))),
                     .id = "variable")
##    variable n_distinct   class
## 1   country        142  factor
## 2      year         12 integer
## 3       pop       1704 numeric
## 4 continent          5  factor
## 5   lifeExp       1626 numeric
## 6 gdpPercap       1704 numeric

map_df() is definitely one of the most powerful functions of purrr in my opinion, and is probably the one that I use most.

Maps with multiple input objects

After gaining a basic understanding of purrr’s map functions, you can start to do some fancier stuff. For instance, what if you want to perform a map that iterates through two objects. The code below uses map functions to create a list of plots that compare life expectancy and GDP per capita for each continent/year combination.

The map function that maps over two objects instead of 1 is called map2(). The first two arguments are the two objects you want to iterate over, and the third is the function (with two arguments, one for each object).

map2(.x = object1, # the first object to iterate over
     .y = object2, # the second object to iterate over
     .f = plotFunction(.x, .y))

First, you need to define a vector (or list) of continents and a paired vector (or list) of years that you want to iterate through. Note that in our continent/year example

  • the first iteration will correspond to the first continent in the continent vector and the first year in the year vector,

  • the second iteration will correspond to the second continent in the continent vector and the second year in the year vector.

This might seem obvious, but it is a natural instinct to incorrectly assume that map2() will automatically perform the action on all combinations that can be made from the two vectors. For instance if you have a continent vector .x = c("Americas", "Asia") and a year vector .y = c(1952, 2007), then you might assume that map2 will iterate over the Americas for 1952 and for 2007, and then Asia for 1952 and 2007. It won’t though. The iteration will actually be first the Americas for 1952 only, and then Asia for 2007 only.

First, let’s get our vectors of continents and years, starting by obtaining all distinct combinations of continents and years that appear in the data.

continent_year <- gapminder %>% distinct(continent, year)
continent_year
##    continent year
## 1       Asia 1952
## 2       Asia 1957
## 3       Asia 1962
## 4       Asia 1967
## 5       Asia 1972
## 6       Asia 1977
## 7       Asia 1982
## 8       Asia 1987
## 9       Asia 1992
## 10      Asia 1997
## 11      Asia 2002
## 12      Asia 2007
## 13    Europe 1952
## 14    Europe 1957
## 15    Europe 1962
## 16    Europe 1967
## 17    Europe 1972
## 18    Europe 1977
## 19    Europe 1982
## 20    Europe 1987
## 21    Europe 1992
## 22    Europe 1997
## 23    Europe 2002
## 24    Europe 2007
## 25    Africa 1952
## 26    Africa 1957
## 27    Africa 1962
## 28    Africa 1967
## 29    Africa 1972
## 30    Africa 1977
## 31    Africa 1982
## 32    Africa 1987
## 33    Africa 1992
## 34    Africa 1997
## 35    Africa 2002
## 36    Africa 2007
## 37  Americas 1952
## 38  Americas 1957
## 39  Americas 1962
## 40  Americas 1967
## 41  Americas 1972
## 42  Americas 1977
## 43  Americas 1982
## 44  Americas 1987
## 45  Americas 1992
## 46  Americas 1997
## 47  Americas 2002
## 48  Americas 2007
## 49   Oceania 1952
## 50   Oceania 1957
## 51   Oceania 1962
## 52   Oceania 1967
## 53   Oceania 1972
## 54   Oceania 1977
## 55   Oceania 1982
## 56   Oceania 1987
## 57   Oceania 1992
## 58   Oceania 1997
## 59   Oceania 2002
## 60   Oceania 2007

Then extracting the continent and year pairs as separate vectors

# extract the continent and year pairs as separate vectors
continents <- continent_year %>% pull(continent) %>% as.character
years <- continent_year %>% pull(year)

If you want to use tilde-dot short-hand, the anonymous arguments will be .x for the first object being iterated over, and .y for the second object being iterated over.

Before jumping straight into the map function, it’s a good idea to first figure out what the code will be for just first iteration (the first continent and the first year, which happen to be Asia in 1952).

# try to figure out the code for the first example
.x <- continents[1]
.y <- years[1]
# make a scatterplot of GDP vs life expectancy in all Asian countries for 1952
gapminder %>% 
  filter(continent == .x,
         year == .y) %>%
  ggplot() +
  geom_point(aes(x = gdpPercap, y = lifeExp)) +
  ggtitle(glue::glue(.x, " ", .y))

This seems to have worked. So you can then copy-and-paste the code into the map2 function

plot_list <- map2(.x = continents, 
                  .y = years, 
                  .f = ~{
                    gapminder %>% 
                      filter(continent == .x,
                             year == .y) %>%
                      ggplot() +
                      geom_point(aes(x = gdpPercap, y = lifeExp)) +
                      ggtitle(glue::glue(.x, " ", .y))
                  })

And you can look at a few of the entries of the list to see that they make sense

plot_list[[1]]

plot_list[[22]]

pmap() allows you to iterate over an arbitrary number of objects (i.e. more than two).

List columns and Nested data frames

Tibbles are tidyverse data frames. Some crazy stuff starts happening when you learn that tibble columns can be lists (as opposed to vectors, which is what they usually are). This is where the difference between tibbles and data frames becomes real.

For instance, a tibble can be “nested” where the tibble is essentially split into separate data frames based on a grouping variable, and these separate data frames are stored as entries of a list (that is then stored in the data column of the data frame).

Below I nest the gapminder data by continent.

gapminder_nested <- gapminder %>% 
  group_by(continent) %>% 
  nest()
gapminder_nested
## # A tibble: 5 x 2
## # Groups:   continent [5]
##   continent           data
##   <fct>     <list<df[,5]>>
## 1 Asia           [396 × 5]
## 2 Europe         [360 × 5]
## 3 Africa         [624 × 5]
## 4 Americas       [300 × 5]
## 5 Oceania         [24 × 5]

The first column is the variable that we grouped by, continent, and the second column is the rest of the data frame corresponding to that group (as if you had filtered the data frame to the specific continent). To see this, the code below shows that the first entry in the data column corresponds to the entire gapminder dataset for Asia.

gapminder_nested$data[[1]]
## # A tibble: 396 x 5
##    country      year      pop lifeExp gdpPercap
##    <fct>       <int>    <dbl>   <dbl>     <dbl>
##  1 Afghanistan  1952  8425333    28.8      779.
##  2 Afghanistan  1957  9240934    30.3      821.
##  3 Afghanistan  1962 10267083    32.0      853.
##  4 Afghanistan  1967 11537966    34.0      836.
##  5 Afghanistan  1972 13079460    36.1      740.
##  6 Afghanistan  1977 14880372    38.4      786.
##  7 Afghanistan  1982 12881816    39.9      978.
##  8 Afghanistan  1987 13867957    40.8      852.
##  9 Afghanistan  1992 16317921    41.7      649.
## 10 Afghanistan  1997 22227415    41.8      635.
## # … with 386 more rows

Using dplyr pluck() function, this can be written as

gapminder_nested %>% 
  # extract the first entry from the data column
  pluck("data", 1)
## # A tibble: 396 x 5
##    country      year      pop lifeExp gdpPercap
##    <fct>       <int>    <dbl>   <dbl>     <dbl>
##  1 Afghanistan  1952  8425333    28.8      779.
##  2 Afghanistan  1957  9240934    30.3      821.
##  3 Afghanistan  1962 10267083    32.0      853.
##  4 Afghanistan  1967 11537966    34.0      836.
##  5 Afghanistan  1972 13079460    36.1      740.
##  6 Afghanistan  1977 14880372    38.4      786.
##  7 Afghanistan  1982 12881816    39.9      978.
##  8 Afghanistan  1987 13867957    40.8      852.
##  9 Afghanistan  1992 16317921    41.7      649.
## 10 Afghanistan  1997 22227415    41.8      635.
## # … with 386 more rows

Similarly, the 5th entry in the data column corresponds to the entire gapminder dataset for Oceania.

gapminder_nested %>% pluck("data", 5)
## # A tibble: 24 x 5
##    country    year      pop lifeExp gdpPercap
##    <fct>     <int>    <dbl>   <dbl>     <dbl>
##  1 Australia  1952  8691212    69.1    10040.
##  2 Australia  1957  9712569    70.3    10950.
##  3 Australia  1962 10794968    70.9    12217.
##  4 Australia  1967 11872264    71.1    14526.
##  5 Australia  1972 13177000    71.9    16789.
##  6 Australia  1977 14074100    73.5    18334.
##  7 Australia  1982 15184200    74.7    19477.
##  8 Australia  1987 16257249    76.3    21889.
##  9 Australia  1992 17481977    77.6    23425.
## 10 Australia  1997 18565243    78.8    26998.
## # … with 14 more rows

You might be asking at this point why you would ever want to nest your data frame? It just doesn’t seem like that useful a thing to do… until you realise that you now have the power to use dplyr manipulations on more complex objects that can be stored in a list.

However, since actions such as mutate() are applied directly to the entire column (which is usually a vector, so is fine), we run into issues when we try to mutate a list. For instance, since columns are usually vectors, normal vectorized functions work just fine on them

tibble(vec_col = 1:10) %>%
  mutate(vec_sum = sum(vec_col))
## # A tibble: 10 x 2
##    vec_col vec_sum
##      <int>   <int>
##  1       1      55
##  2       2      55
##  3       3      55
##  4       4      55
##  5       5      55
##  6       6      55
##  7       7      55
##  8       8      55
##  9       9      55
## 10      10      55

but when the column is a list, vectorized functions don’t know what to do with them, and we get an error that says Error in sum(x) : invalid 'type' (list) of argument. Try

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = sum(list_col))

To apply mutate functions to a list-column, you need to wrap the function you want to apply in a map function.

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map(list_col, sum))
## # A tibble: 3 x 2
##   list_col  list_sum 
##   <list>    <list>   
## 1 <dbl [3]> <dbl [1]>
## 2 <dbl [1]> <dbl [1]>
## 3 <dbl [3]> <dbl [1]>

Since map() returns a list itself, the list_sum column is thus itself a list

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map(list_col, sum)) %>% 
  pull(list_sum)
## [[1]]
## [1] 13
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 31

What could we do if we wanted it to be a vector? We could use the map_dbl() function instead!

tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map_dbl(list_col, sum))
## # A tibble: 3 x 2
##   list_col  list_sum
##   <list>       <dbl>
## 1 <dbl [3]>       13
## 2 <dbl [1]>        5
## 3 <dbl [3]>       31

Nesting the gapminder data

Let’s return to the nested gapminder dataset. I want to calculate the average life expectancy within each continent and add it as a new column using mutate(). Based on the example above, can you explain why the following code doesn’t work?

gapminder_nested %>% 
  mutate(avg_lifeExp = mean(data$lifeExp))

I was hoping that this code would extract the lifeExp column from each data frame. But I’m applying the mutate to the data column, which itself doesn’t have an entry called lifeExp since it’s a list of data frames. How could I get access to the lifeExp column of the data frames stored in the data list? Using a map function of course!

Think of an individual data frame as .x. Again, I will first figure out the code for calculating the mean life expectancy for the first entry of the column. The following code defines .x to be the first entry of the data column (this is the data frame for Asia).

# the first entry of the "data" column
.x <- gapminder_nested %>% pluck("data", 1)
.x
## # A tibble: 396 x 5
##    country      year      pop lifeExp gdpPercap
##    <fct>       <int>    <dbl>   <dbl>     <dbl>
##  1 Afghanistan  1952  8425333    28.8      779.
##  2 Afghanistan  1957  9240934    30.3      821.
##  3 Afghanistan  1962 10267083    32.0      853.
##  4 Afghanistan  1967 11537966    34.0      836.
##  5 Afghanistan  1972 13079460    36.1      740.
##  6 Afghanistan  1977 14880372    38.4      786.
##  7 Afghanistan  1982 12881816    39.9      978.
##  8 Afghanistan  1987 13867957    40.8      852.
##  9 Afghanistan  1992 16317921    41.7      649.
## 10 Afghanistan  1997 22227415    41.8      635.
## # … with 386 more rows

Then to calculate the average life expectancy for Asia, I could write

mean(.x$lifeExp)
## [1] 60.0649

So copy-pasting this into the tilde-dot anonymous function argument of the map_dbl() function within mutate(), I get what I wanted!

gapminder_nested %>% 
  mutate(avg_lifeExp = map_dbl(data, ~{mean(.x$lifeExp)}))
## # A tibble: 5 x 3
## # Groups:   continent [5]
##   continent           data avg_lifeExp
##   <fct>     <list<df[,5]>>       <dbl>
## 1 Asia           [396 × 5]        60.1
## 2 Europe         [360 × 5]        71.9
## 3 Africa         [624 × 5]        48.9
## 4 Americas       [300 × 5]        64.7
## 5 Oceania         [24 × 5]        74.3

This code iterates through the data frames stored in the data column, returns the average life expectancy for each data frame, and concatonates the results into a numeric vector (which is then stored as a column called avg_lifeExp).

I hear what you’re saying… this is something that we could have done a lot more easily using standard dplyr commands (such as summarise()). True, but hopefully it helped you understand why you need to wrap mutate functions inside map functions when applying them to list columns.

Even if this example was less than inspiring, I promise the next example will knock your socks off!

The next exampe will demonstrate how to fit a model separately for each continent, and evaluate it, all within a single tibble. First, I will fit a linear model for each continent and store it as a list-column. If the data frame for a single continent is .x, then the model I want to fit is lm(lifeExp ~ pop + gdpPercap + year, data = .x) (check for yourself that this does what you expect). So I can copy-past this command into the map() function within the mutate()

# fit a model separately for each continent
gapminder_nested <- gapminder_nested %>% 
  mutate(lm_obj = map(data, ~lm(lifeExp ~ pop + gdpPercap + year, data = .x)))
gapminder_nested
## # A tibble: 5 x 3
## # Groups:   continent [5]
##   continent           data lm_obj
##   <fct>     <list<df[,5]>> <list>
## 1 Asia           [396 × 5] <lm>  
## 2 Europe         [360 × 5] <lm>  
## 3 Africa         [624 × 5] <lm>  
## 4 Americas       [300 × 5] <lm>  
## 5 Oceania         [24 × 5] <lm>

Where the first linear model (for Asia) is

gapminder_nested %>% pluck("lm_obj", 1)
## 
## Call:
## lm(formula = lifeExp ~ pop + gdpPercap + year, data = .x)
## 
## Coefficients:
## (Intercept)          pop    gdpPercap         year  
##  -7.833e+02    4.228e-11    2.510e-04    4.251e-01

I can then predict the response for the data stored in the data column using the corresponding linear model. So I have two objects I want to iterate over: the data and the linear model object. This means I want to use map2(). When things get a little more complicated I like to have multiple function arguments, so I’m going to use a full anonymous function rather than the tilde-dot shorthand.

# predict the response for each continent
gapminder_nested <- gapminder_nested %>% 
  mutate(pred = map2(lm_obj, data, function(.lm, .data) predict(.lm, .data)))
gapminder_nested
## # A tibble: 5 x 4
## # Groups:   continent [5]
##   continent           data lm_obj pred       
##   <fct>     <list<df[,5]>> <list> <list>     
## 1 Asia           [396 × 5] <lm>   <dbl [396]>
## 2 Europe         [360 × 5] <lm>   <dbl [360]>
## 3 Africa         [624 × 5] <lm>   <dbl [624]>
## 4 Americas       [300 × 5] <lm>   <dbl [300]>
## 5 Oceania         [24 × 5] <lm>   <dbl [24]>

And I can then calculate the correlation between the predicted response and the true response, this time using the map2()_dbl function since I want the output the be a numeric vector rather than a list of single elements.

# calculate the correlation between observed and predicted response for each continent
gapminder_nested <- gapminder_nested %>% 
  mutate(cor = map2_dbl(pred, data, function(.pred, .data) cor(.pred, .data$lifeExp)))
gapminder_nested
## # A tibble: 5 x 5
## # Groups:   continent [5]
##   continent           data lm_obj pred          cor
##   <fct>     <list<df[,5]>> <list> <list>      <dbl>
## 1 Asia           [396 × 5] <lm>   <dbl [396]> 0.723
## 2 Europe         [360 × 5] <lm>   <dbl [360]> 0.834
## 3 Africa         [624 × 5] <lm>   <dbl [624]> 0.645
## 4 Americas       [300 × 5] <lm>   <dbl [300]> 0.779
## 5 Oceania         [24 × 5] <lm>   <dbl [24]>  0.987

Holy guacamole, that is so awesome!

Advanced exercise

The goal of this exercise is to fit a separate linear model for each continent without splitting up the data. Create the following data frame that has the continent, each term in the model for the continent, its linear model coefficient estimate, and standard error.

## # A tibble: 20 x 6
##    continent term         estimate std.error statistic  p.value
##    <fct>     <chr>           <dbl>     <dbl>     <dbl>    <dbl>
##  1 Asia      (Intercept) -7.83e+ 2   4.83e+1  -16.2    1.22e-45
##  2 Asia      pop          4.23e-11   2.04e-9    0.0207 9.83e- 1
##  3 Asia      year         4.25e- 1   2.44e-2   17.4    1.13e-50
##  4 Asia      gdpPercap    2.51e- 4   3.01e-5    8.34   1.31e-15
##  5 Europe    (Intercept) -1.61e+ 2   2.28e+1   -7.09   7.44e-12
##  6 Europe    pop         -8.18e- 9   7.80e-9   -1.05   2.95e- 1
##  7 Europe    year         1.16e- 1   1.16e-2    9.96   8.88e-21
##  8 Europe    gdpPercap    3.25e- 4   2.15e-5   15.2    2.21e-40
##  9 Africa    (Intercept) -4.70e+ 2   3.39e+1  -13.9    2.17e-38
## 10 Africa    pop         -3.68e- 9   1.89e-8   -0.195  8.45e- 1
## 11 Africa    year         2.61e- 1   1.71e-2   15.2    1.07e-44
## 12 Africa    gdpPercap    1.12e- 3   1.01e-4   11.1    2.46e-26
## 13 Americas  (Intercept) -5.33e+ 2   4.10e+1  -13.0    6.40e-31
## 14 Americas  pop         -2.15e- 8   8.62e-9   -2.49   1.32e- 2
## 15 Americas  year         3.00e- 1   2.08e-2   14.4    3.79e-36
## 16 Americas  gdpPercap    6.75e- 4   7.15e-5    9.44   1.13e-18
## 17 Oceania   (Intercept) -2.10e+ 2   5.12e+1   -4.10   5.61e- 4
## 18 Oceania   pop          8.37e- 9   3.34e-8    0.251  8.05e- 1
## 19 Oceania   year         1.42e- 1   2.65e-2    5.34   3.19e- 5
## 20 Oceania   gdpPercap    2.03e- 4   8.47e-5    2.39   2.66e- 2

Hint: starting from the gapminder dataset, use group_by() and nest() to nest by continent, use a mutate together with map to fit a linear model for each continent, use another mutate with broom::tidy() to get a data frame of model coefficients for each model, and a transmute to get just the columns you want, followed by an unnest() to re-expand the nested tibble.

The solution code is at the end of this post.

If you want to stop here, you will already know more than most purrr users. The remainder of this blog post involves little-used features of purrr for manipulating lists.

Additional purrr functionalities for lists

To demonstrate how to use purrr to manipulate lists, we will split the gapminder dataset into a list of data frames (which is kind of like the converse of a data frame containing a list-column). To make sure it’s easy to follow, we will only keep 5 rows from each continent.

set.seed(23489)
gapminder_list <- gapminder %>% split(gapminder$continent) %>%
  map(~sample_n(., 5))
gapminder_list
## $Africa
##        country year      pop continent lifeExp gdpPercap
## 1    Swaziland 1992   962344    Africa  58.474 3553.0224
## 2     Botswana 1957   474639    Africa  49.618  918.2325
## 3     Cameroon 1962  5793633    Africa  42.643 1399.6074
## 4        Kenya 1972 12044785    Africa  53.559 1222.3600
## 5 South Africa 1962 18356657    Africa  49.951 5768.7297
## 
## $Americas
##     country year      pop continent lifeExp gdpPercap
## 1 Guatemala 1957  3640876  Americas  44.142  2617.156
## 2    Panama 1962  1215725  Americas  61.817  3536.540
## 3      Peru 1952  8025700  Americas  43.902  3758.523
## 4    Brazil 1957 65551171  Americas  53.285  2487.366
## 5  Honduras 1962  2090162  Americas  48.041  2291.157
## 
## $Asia
##              country year      pop continent lifeExp gdpPercap
## 1 West Bank and Gaza 1972  1089572      Asia  56.532 3133.4093
## 2           Thailand 1982 48827160      Asia  64.597 2393.2198
## 3           Cambodia 2007 14131858      Asia  59.723 1713.7787
## 4           Cambodia 1952  4693836      Asia  39.417  368.4693
## 5            Lebanon 2002  3677780      Asia  71.028 9313.9388
## 
## $Europe
##       country year      pop continent lifeExp gdpPercap
## 1     Hungary 1972 10394091    Europe  69.760  10168.66
## 2      Serbia 1997 10336594    Europe  72.232   7914.32
## 3     Croatia 2007  4493312    Europe  75.748  14619.22
## 4 Switzerland 1957  5126000    Europe  70.560  17909.49
## 5     Denmark 1972  4991596    Europe  73.470  18866.21
## 
## $Oceania
##       country year      pop continent lifeExp gdpPercap
## 1 New Zealand 2002  3908037   Oceania   79.11  23189.80
## 2   Australia 1982 15184200   Oceania   74.74  19477.01
## 3   Australia 1997 18565243   Oceania   78.83  26997.94
## 4 New Zealand 1987  3317166   Oceania   74.32  19007.19
## 5   Australia 1962 10794968   Oceania   70.93  12217.23

Keep/Discard: select_if for lists

keep() only keeps elements of a list that satisfy a given condition, much like select_if() selects columns of a data frame that satisfy a given condition.

The following code only keeps the gapminder continent data frames (the elements of the list) that have an average (among the sample of 5 rows) life expectancy of at least 70.

gapminder_list %>%
  keep(~{mean(.x$lifeExp) > 70})
## $Europe
##       country year      pop continent lifeExp gdpPercap
## 1     Hungary 1972 10394091    Europe  69.760  10168.66
## 2      Serbia 1997 10336594    Europe  72.232   7914.32
## 3     Croatia 2007  4493312    Europe  75.748  14619.22
## 4 Switzerland 1957  5126000    Europe  70.560  17909.49
## 5     Denmark 1972  4991596    Europe  73.470  18866.21
## 
## $Oceania
##       country year      pop continent lifeExp gdpPercap
## 1 New Zealand 2002  3908037   Oceania   79.11  23189.80
## 2   Australia 1982 15184200   Oceania   74.74  19477.01
## 3   Australia 1997 18565243   Oceania   78.83  26997.94
## 4 New Zealand 1987  3317166   Oceania   74.32  19007.19
## 5   Australia 1962 10794968   Oceania   70.93  12217.23

discard() does the opposite of keep(): it discards any elements that satisfy your logical condition.

Reduce

reduce() is designed to combine (reduces) all of the elements of a list into a single object by iteratively applying a binary function (a function that takes two inputs).

For instance, applying a reduce function to add up all of the elements of the vector c(1, 2, 3) is like doing sum(sum(1, 2), 3): first it applies sum to 1 and 2, then it applies sum again to the output of sum(1, 2) and 3.

reduce(c(1, 2, 3), sum)
## [1] 6

accumulate() also returns the intermediate values.

accumulate(c(1, 2, 3), sum)
## [1] 1 3 6

An example of when reduce() might come in handy is when you want to perform many left_join()s in a row, or to do repeated rbinds() (e.g. to bind the rows of the list back together into a single data frame)

gapminder_list %>%
  reduce(rbind)
##               country year      pop continent lifeExp  gdpPercap
## 1           Swaziland 1992   962344    Africa  58.474  3553.0224
## 2            Botswana 1957   474639    Africa  49.618   918.2325
## 3            Cameroon 1962  5793633    Africa  42.643  1399.6074
## 4               Kenya 1972 12044785    Africa  53.559  1222.3600
## 5        South Africa 1962 18356657    Africa  49.951  5768.7297
## 6           Guatemala 1957  3640876  Americas  44.142  2617.1560
## 7              Panama 1962  1215725  Americas  61.817  3536.5403
## 8                Peru 1952  8025700  Americas  43.902  3758.5234
## 9              Brazil 1957 65551171  Americas  53.285  2487.3660
## 10           Honduras 1962  2090162  Americas  48.041  2291.1568
## 11 West Bank and Gaza 1972  1089572      Asia  56.532  3133.4093
## 12           Thailand 1982 48827160      Asia  64.597  2393.2198
## 13           Cambodia 2007 14131858      Asia  59.723  1713.7787
## 14           Cambodia 1952  4693836      Asia  39.417   368.4693
## 15            Lebanon 2002  3677780      Asia  71.028  9313.9388
## 16            Hungary 1972 10394091    Europe  69.760 10168.6561
## 17             Serbia 1997 10336594    Europe  72.232  7914.3203
## 18            Croatia 2007  4493312    Europe  75.748 14619.2227
## 19        Switzerland 1957  5126000    Europe  70.560 17909.4897
## 20            Denmark 1972  4991596    Europe  73.470 18866.2072
## 21        New Zealand 2002  3908037   Oceania  79.110 23189.8014
## 22          Australia 1982 15184200   Oceania  74.740 19477.0093
## 23          Australia 1997 18565243   Oceania  78.830 26997.9366
## 24        New Zealand 1987  3317166   Oceania  74.320 19007.1913
## 25          Australia 1962 10794968   Oceania  70.930 12217.2269

Logical statements for lists

Asking logical questions of a list can be done using every() and some(). For instance to ask whether every continent has average life expectancy greater than 70, you can use every()

gapminder_list %>% every(~{mean(.x$life) > 70})
## [1] FALSE

To ask whether some continents have average life expectancy greater than 70, you can use some()

gapminder_list %>% some(~{mean(.x$life) > 70})
## [1] TRUE

An equivalent of %in% for lists is has_element().

list(1, c(2, 5, 1), "a") %>% has_element("a")
## [1] TRUE

Most of these functions also work on vectors.

Now go forth and purrr!

Answer to advanced exercise

The following code produces the table from the exercise above

gapminder %>% 
  group_by(continent) %>% 
  nest() %>%
  mutate(lm_obj = map(data, ~lm(lifeExp ~ pop + year + gdpPercap, data = .))) %>%
  mutate(lm_tidy = map(lm_obj, broom::tidy)) %>%
  ungroup() %>%
  transmute(continent, lm_tidy) %>%
  unnest(cols = c(lm_tidy))
## # A tibble: 20 x 6
##    continent term         estimate std.error statistic  p.value
##    <fct>     <chr>           <dbl>     <dbl>     <dbl>    <dbl>
##  1 Asia      (Intercept) -7.83e+ 2   4.83e+1  -16.2    1.22e-45
##  2 Asia      pop          4.23e-11   2.04e-9    0.0207 9.83e- 1
##  3 Asia      year         4.25e- 1   2.44e-2   17.4    1.13e-50
##  4 Asia      gdpPercap    2.51e- 4   3.01e-5    8.34   1.31e-15
##  5 Europe    (Intercept) -1.61e+ 2   2.28e+1   -7.09   7.44e-12
##  6 Europe    pop         -8.18e- 9   7.80e-9   -1.05   2.95e- 1
##  7 Europe    year         1.16e- 1   1.16e-2    9.96   8.88e-21
##  8 Europe    gdpPercap    3.25e- 4   2.15e-5   15.2    2.21e-40
##  9 Africa    (Intercept) -4.70e+ 2   3.39e+1  -13.9    2.17e-38
## 10 Africa    pop         -3.68e- 9   1.89e-8   -0.195  8.45e- 1
## 11 Africa    year         2.61e- 1   1.71e-2   15.2    1.07e-44
## 12 Africa    gdpPercap    1.12e- 3   1.01e-4   11.1    2.46e-26
## 13 Americas  (Intercept) -5.33e+ 2   4.10e+1  -13.0    6.40e-31
## 14 Americas  pop         -2.15e- 8   8.62e-9   -2.49   1.32e- 2
## 15 Americas  year         3.00e- 1   2.08e-2   14.4    3.79e-36
## 16 Americas  gdpPercap    6.75e- 4   7.15e-5    9.44   1.13e-18
## 17 Oceania   (Intercept) -2.10e+ 2   5.12e+1   -4.10   5.61e- 4
## 18 Oceania   pop          8.37e- 9   3.34e-8    0.251  8.05e- 1
## 19 Oceania   year         1.42e- 1   2.65e-2    5.34   3.19e- 5
## 20 Oceania   gdpPercap    2.03e- 4   8.47e-5    2.39   2.66e- 2