Using the recipes package for easy pre-processing

Having to apply the same pre-processing steps to training, testing and validation data to do some machine learning can be surprisingly frustrating. But thanks to the recipes R package, it's now super-duper easy. Instead of having five functions and maybe hundreds of lines of code, you can preprocess multiple datasets using a single 'recipe' in fewer than 10 lines of code.

Rebecca Barter

Pre-processing data in R used to be the bane of my existence. For something that should be fairly straightforward, it often really wasn’t. Often my frustrations stemmed from simple things such as factor variables having different levels in the training data and test data, or a variable having missing values in the test data but not in the training data. I’d write a function that would pre-process the training data, and when I’d try to apply it to the test data, R would cry and yell and just be generally unpleasant.

Thankfully most of the pain of pre-processing is now in the past thanks to the recipes R package that is a part of the new “tidymodels” package ecosystem (which, I guess is supposed to be equivalent to the data-focused “tidyverse” package ecosystem that includes dplyr, tidyr, and other super awesome packages like that). Recipes was developed by Max Kuhn and Hadley Wickham.

Those who have ever seen Hadley Wickham give a talk will know that baking and data are inherently related (see photo below).

A photo I took at an R Ladies SF meetup of Hadley's cupcake recipes.

Figure 1: A photo I took at an R Ladies SF meetup of Hadley’s cupcake recipes.

So let’s get baking!

The fundamentals of pre-processing your data using recipes

Creating a recipe has three steps:

  1. Get the ingredients (recipe()): specify the response variable and predictor variables

  2. Write the recipe (step_zzz()): define the pre-processing steps, such as imputation, creating dummy variables, scaling, and more

  3. Prepare the recipe (prep()): provide a dataset to base each step on (e.g. if one of the steps is to remove variables that only have one unique value, then you need to give it a dataset so it can decide which variables satisfy this criteria to ensure that it is doing the same thing to every dataset you apply it to)

  4. Bake the recipe (bake()): apply the pre-processing steps to your datasets

In this blog post I’ll walk you through these three steps, touching on the wide range of things that recipes can do, while hopefully convincing you that recipes makes life really easy and that you should use it next time you need to do some pre-processing.

A simple example: cupcakes or muffins?

To keep things in the theme, I’m going to use a dataset from Alice Zhao’s git repo that I found when I typed “cupcake dataset” into Google. Our goal will be to classify recipes as either cupcakes or muffins based on the quantities used for each of the ingredients. So perhaps we will learn two things today: (1) how to use the recipes package, and (2) the difference between cupcakes and muffins.

# set up so that all variables of tibbles are printed
options(dplyr.width = Inf)
# load useful libraries
library(tidyverse)
library(recipes) # could also load the tidymodels package
# load in the data
muffin_cupcake_data_orig <- read_csv("https://raw.githubusercontent.com/adashofdata/muffin-cupcake/master/recipes_muffins_cupcakes.csv")
# look at data
muffin_cupcake_data_orig
## # A tibble: 20 x 9
##    Type    Flour  Milk Sugar Butter   Egg `Baking Powder` Vanilla  Salt
##    <chr>   <int> <int> <int>  <int> <int>           <int>   <int> <int>
##  1 Muffin     55    28     3      7     5               2       0     0
##  2 Muffin     47    24    12      6     9               1       0     0
##  3 Muffin     47    23    18      6     4               1       0     0
##  4 Muffin     45    11    17     17     8               1       0     0
##  5 Muffin     50    25    12      6     5               2       1     0
##  6 Muffin     55    27     3      7     5               2       1     0
##  7 Muffin     54    27     7      5     5               2       0     0
##  8 Muffin     47    26    10     10     4               1       0     0
##  9 Muffin     50    17    17      8     6               1       0     0
## 10 Muffin     50    17    17     11     4               1       0     0
## 11 Cupcake    39     0    26     19    14               1       1     0
## 12 Cupcake    42    21    16     10     8               3       0     0
## 13 Cupcake    34    17    20     20     5               2       1     0
## 14 Cupcake    39    13    17     19    10               1       1     0
## 15 Cupcake    38    15    23     15     8               0       1     0
## 16 Cupcake    42    18    25      9     5               1       0     0
## 17 Cupcake    36    14    21     14    11               2       1     0
## 18 Cupcake    38    15    31      8     6               1       1     0
## 19 Cupcake    36    16    24     12     9               1       1     0
## 20 Cupcake    34    17    23     11    13               0       1     0

Since the space in the column name Baking Powder is going to really annoy me, I’m going to do a quick clean where I convert all of the column names to lower case and replace the space with an underscore.

As a side note, I’ve started naming all of my temporary function arguments (lambda functions?) with a period preceding the name. I find it makes it a lot easier to read. As another side note, if you’ve never seen the rename_all() function before, check out my blog post on scoped verbs!

muffin_cupcake_data <- muffin_cupcake_data_orig %>%
  # rename all columns 
  rename_all(function(.name) {
    .name %>% 
      # replace all names with the lowercase versions
      tolower %>%
      # replace all spaces with underscores
      str_replace(" ", "_")
    })
# check that this did what I wanted
muffin_cupcake_data
## # A tibble: 20 x 9
##    type    flour  milk sugar butter   egg baking_powder vanilla  salt
##    <chr>   <int> <int> <int>  <int> <int>         <int>   <int> <int>
##  1 Muffin     55    28     3      7     5             2       0     0
##  2 Muffin     47    24    12      6     9             1       0     0
##  3 Muffin     47    23    18      6     4             1       0     0
##  4 Muffin     45    11    17     17     8             1       0     0
##  5 Muffin     50    25    12      6     5             2       1     0
##  6 Muffin     55    27     3      7     5             2       1     0
##  7 Muffin     54    27     7      5     5             2       0     0
##  8 Muffin     47    26    10     10     4             1       0     0
##  9 Muffin     50    17    17      8     6             1       0     0
## 10 Muffin     50    17    17     11     4             1       0     0
## 11 Cupcake    39     0    26     19    14             1       1     0
## 12 Cupcake    42    21    16     10     8             3       0     0
## 13 Cupcake    34    17    20     20     5             2       1     0
## 14 Cupcake    39    13    17     19    10             1       1     0
## 15 Cupcake    38    15    23     15     8             0       1     0
## 16 Cupcake    42    18    25      9     5             1       0     0
## 17 Cupcake    36    14    21     14    11             2       1     0
## 18 Cupcake    38    15    31      8     6             1       1     0
## 19 Cupcake    36    16    24     12     9             1       1     0
## 20 Cupcake    34    17    23     11    13             0       1     0

Since recipes does a lot of useful stuff for categorical variables as well as with missing values, I’m going to modify the data a little bit so that it’s a bit more interesting (for educational purposes only - don’t ever actually modify your data so it’s more interesting, in science that’s called “fraud”, and fraud is bad).

# add an additional ingredients column that is categorical
muffin_cupcake_data <- muffin_cupcake_data %>%
  mutate(additional_ingredients = c("fruit", 
                                    "fruit", 
                                    "none", 
                                    "nuts", 
                                    "fruit", 
                                    "fruit", 
                                    "nuts", 
                                    "none", 
                                    "none", 
                                    "nuts",
                                    "icing",
                                    "icing",
                                    "fruit",
                                    "none",
                                    "fruit",
                                    "icing",
                                    "none",
                                    "fruit",
                                    "icing",
                                    "icing"))
# add some random missing values here and there just for fun
set.seed(26738)
muffin_cupcake_data <- muffin_cupcake_data %>%
  # only add missing values to numeric columns
  mutate_if(is.numeric,
            function(x) {
              # randomly decide if 0, 2, or 3 values will be missing from each column
              n_missing <- sample(0:3, 8, replace = TRUE)
              # replace n_missing randomly selected values from each column with NA
              x[sample(1:20, n_missing)] <- NA
              return(x)
              })
muffin_cupcake_data
## # A tibble: 20 x 10
##    type    flour  milk sugar butter   egg baking_powder vanilla  salt
##    <chr>   <int> <int> <int>  <int> <int>         <int>   <int> <int>
##  1 Muffin     55    28     3      7     5             2       0     0
##  2 Muffin     47    24    12      6     9             1       0     0
##  3 Muffin     47    23    18      6    NA             1       0     0
##  4 Muffin     45    11    17     17     8             1      NA     0
##  5 Muffin     50    NA    12      6     5             2       1     0
##  6 Muffin     55    27     3      7     5             2       1     0
##  7 Muffin     54    NA     7      5     5             2       0     0
##  8 Muffin     47    26    10     10    NA             1       0     0
##  9 Muffin     50    17    17      8     6             1       0     0
## 10 Muffin     50    17    17     11     4             1       0     0
## 11 Cupcake    39     0    NA     19    14             1       1     0
## 12 Cupcake    42    21    16     10     8             3       0     0
## 13 Cupcake    34    17    20     20     5             2       1     0
## 14 Cupcake    39    13    17     19    10             1       1     0
## 15 Cupcake    38    15    23     15     8             0       1     0
## 16 Cupcake    42    18    25      9     5             1      NA     0
## 17 Cupcake    36    14    21     14    NA             2       1     0
## 18 Cupcake    38    15    31      8     6             1      NA    NA
## 19 Cupcake    36    16    24     12     9             1       1     0
## 20 Cupcake    34    17    23     11    13             0       1     0
##    additional_ingredients
##    <chr>                 
##  1 fruit                 
##  2 fruit                 
##  3 none                  
##  4 nuts                  
##  5 fruit                 
##  6 fruit                 
##  7 nuts                  
##  8 none                  
##  9 none                  
## 10 nuts                  
## 11 icing                 
## 12 icing                 
## 13 fruit                 
## 14 none                  
## 15 fruit                 
## 16 icing                 
## 17 none                  
## 18 fruit                 
## 19 icing                 
## 20 icing

Finally, I’m going to split my data into training and test sets, so that you can see how nicely our recipe can be applied to multiple data frames.

library(rsample)
muffin_cupcake_split <- initial_split(muffin_cupcake_data)
muffin_cupcake_train <- training(muffin_cupcake_split)
muffin_cupcake_test <- testing(muffin_cupcake_split)
rm(muffin_cupcake_data)

Our training data is

muffin_cupcake_train
## # A tibble: 15 x 10
##    type    flour  milk sugar butter   egg baking_powder vanilla  salt
##    <chr>   <int> <int> <int>  <int> <int>         <int>   <int> <int>
##  1 Muffin     47    24    12      6     9             1       0     0
##  2 Muffin     45    11    17     17     8             1      NA     0
##  3 Muffin     50    NA    12      6     5             2       1     0
##  4 Muffin     55    27     3      7     5             2       1     0
##  5 Muffin     54    NA     7      5     5             2       0     0
##  6 Muffin     47    26    10     10    NA             1       0     0
##  7 Muffin     50    17    17      8     6             1       0     0
##  8 Muffin     50    17    17     11     4             1       0     0
##  9 Cupcake    42    21    16     10     8             3       0     0
## 10 Cupcake    39    13    17     19    10             1       1     0
## 11 Cupcake    38    15    23     15     8             0       1     0
## 12 Cupcake    36    14    21     14    NA             2       1     0
## 13 Cupcake    38    15    31      8     6             1      NA    NA
## 14 Cupcake    36    16    24     12     9             1       1     0
## 15 Cupcake    34    17    23     11    13             0       1     0
##    additional_ingredients
##    <chr>                 
##  1 fruit                 
##  2 nuts                  
##  3 fruit                 
##  4 fruit                 
##  5 nuts                  
##  6 none                  
##  7 none                  
##  8 nuts                  
##  9 icing                 
## 10 none                  
## 11 fruit                 
## 12 none                  
## 13 fruit                 
## 14 icing                 
## 15 icing

and our testing data is

muffin_cupcake_test
## # A tibble: 5 x 10
##   type    flour  milk sugar butter   egg baking_powder vanilla  salt
##   <chr>   <int> <int> <int>  <int> <int>         <int>   <int> <int>
## 1 Muffin     55    28     3      7     5             2       0     0
## 2 Muffin     47    23    18      6    NA             1       0     0
## 3 Cupcake    39     0    NA     19    14             1       1     0
## 4 Cupcake    34    17    20     20     5             2       1     0
## 5 Cupcake    42    18    25      9     5             1      NA     0
##   additional_ingredients
##   <chr>                 
## 1 fruit                 
## 2 none                  
## 3 icing                 
## 4 fruit                 
## 5 icing

Writing and applying the recipe

Now that we’ve set up our data, we’re ready to write some recipes and do some baking! The first thing we need to do is get the ingredients. We can use formula notation within the recipe() function to do this: the thing we’re trying to predict is the variable to the left of the ~, and the predictor variables are the things to the right of it (Since I’m including all of my variables, I could have written type ~ .).

# define the recipe (it looks a lot like applying the lm function)
model_recipe <- recipe(type ~ flour + milk + sugar + butter + egg + 
                         baking_powder + vanilla + salt + additional_ingredients, 
                       data = muffin_cupcake_train)

If we print a summary of the model_recipe object, it just shows us the variables we’ve specified, their type, and whether they’re a predictor or an outcome.

summary(model_recipe)
## # A tibble: 10 x 4
##    variable               type    role      source  
##    <chr>                  <chr>   <chr>     <chr>   
##  1 flour                  numeric predictor original
##  2 milk                   numeric predictor original
##  3 sugar                  numeric predictor original
##  4 butter                 numeric predictor original
##  5 egg                    numeric predictor original
##  6 baking_powder          numeric predictor original
##  7 vanilla                numeric predictor original
##  8 salt                   numeric predictor original
##  9 additional_ingredients nominal predictor original
## 10 type                   nominal outcome   original

Writing the recipe steps

So now we have our ingredients, we are ready to write the recipe (i.e. describe our pre-processing steps). We write the recipe one step at a time. We have many steps to choose from, including:

  • step_dummy(): creating dummy variables from categorical variables.

  • step_zzzimpute(): where instead of “zzz” it is the name of a method, such as step_knnimpute(), step_meanimpute(), step_modeimpute(). I find that the fancier imputation methods are reeeeally slow for decently large datasets, so I would probably do this step outside of the recipes package unless you just want to do a quick mean or mode impute (which, to be honest, I often do).

  • step_scale(): normalize to have a standard deviation of 1.

  • step_center(): center to have a mean of 0.

  • step_range(): normalize numeric data to be within a pre-defined range of values.

  • step_pca(): create principal component variables from your data.

  • step_nzv(): remove variables that have (or almost have) the same value for every data point.

You can also create your own step (which I’ve never felt the need to do, but the details of which can be found here https://tidymodels.github.io/recipes/articles/Custom_Steps.html).

In each step, you need to specify which variables you want to apply it to. There are many ways to do this:

  1. Specifying the variable name(s) as the first argument

  2. Standard dplyr selectors:

    • everything() applies the step to all columns,

    • contains() allows you to specify column names that contain a specific string,

    • starts_with() allows you to specify column names that start with a sepcific string,

    • etc

  3. Functions that specify the role of the variables:

    • all_predictors() applies the step to the predictor variables only

    • all_outcomes() applies the step to the outcome variable(s) only

  4. Functions that specify the type of the variables:

    • all_nominal() applies the step to all variables that are nominal (categorical)

    • all_numeric() applies the step to all variables that are numeric

To ignore a specific column, you can specify it’s name with a negative sign as a variable (just like you would in select())

# define the steps we want to apply
model_recipe_steps <- model_recipe %>% 
  # mean impute numeric variables
  step_meanimpute(all_numeric()) %>%
  # convert the additional ingredients variable to dummy variables
  step_dummy(additional_ingredients) %>%
  # rescale all numeric variables except for vanilla, salt and baking powder to lie between 0 and 1
  step_range(all_numeric(), min = 0, max = 1, -vanilla, -salt, -baking_powder) %>%
  # remove predictor variables that are almost the same for every entry
  step_nzv(all_predictors()) 
model_recipe_steps
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Operations:
## 
## Mean Imputation for all_numeric()
## Dummy variables from additional_ingredients
## Range scaling to [0,1] for all_numeric(), -vanilla, ...
## Sparse, unbalanced variable filter on all_predictors()

Note that the order in which you apply the steps does matter to some extent. The recommended ordering (taken from here) is

  1. Impute

  2. Individual transformations for skewness and other issues

  3. Discretize (if needed and if you have no other choice)

  4. Create dummy variables

  5. Create interactions

  6. Normalization steps (center, scale, range, etc)

  7. Multivariate transformation (e.g. PCA, spatial sign, etc)

Preparing the recipe

Next, we need to provide a dataset on which to base the pre-processing steps. This allows the same recipe to be applied to multiple datasets.

prepped_recipe <- prep(model_recipe_steps, training = muffin_cupcake_train)
prepped_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Training data contained 15 data points and 6 incomplete rows. 
## 
## Operations:
## 
## Mean Imputation for flour, milk, sugar, butter, ... [trained]
## Dummy variables from additional_ingredients [trained]
## Range scaling to [0,1] for flour, milk, sugar, butter, ... [trained]
## Sparse, unbalanced variable filter removed salt [trained]

Bake the recipe

Next, you apply your recipe to your datasets.

So what did our recipe do?

  • step_meanimpute(all_numeric()) imputed all of the missing values with the mean value for that variable

  • step_dummy(additional_ingredients) converted the additional_ingredients into three dummy variables corresponding to three of the four levels of the original variable

  • step_range(all_numeric(), min = 0, max = 1, -vanilla, -salt, -baking_powder) converted the range of all of the numeric variables except for those specified to lie between 0 and 1

  • step_nzv(all_predictors()) removed the salt variable since it was 0 across all rows (except where it was missing)

muffin_cupcake_train_preprocessed <- bake(prepped_recipe, muffin_cupcake_train) 
muffin_cupcake_train_preprocessed
## # A tibble: 15 x 11
##    type     flour  milk sugar butter   egg baking_powder vanilla
##    <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl>         <int>   <dbl>
##  1 Muffin  0.619  0.812 0.321 0.0714 0.556             1   0    
##  2 Muffin  0.524  0     0.5   0.857  0.444             1   0.538
##  3 Muffin  0.762  0.433 0.321 0.0714 0.111             2   1    
##  4 Muffin  1      1     0     0.143  0.111             2   1    
##  5 Muffin  0.952  0.433 0.143 0      0.111             2   0    
##  6 Muffin  0.619  0.938 0.25  0.357  0.376             1   0    
##  7 Muffin  0.762  0.375 0.5   0.214  0.222             1   0    
##  8 Muffin  0.762  0.375 0.5   0.429  0                 1   0    
##  9 Cupcake 0.381  0.625 0.464 0.357  0.444             3   0    
## 10 Cupcake 0.238  0.125 0.5   1      0.667             1   1    
## 11 Cupcake 0.190  0.25  0.714 0.714  0.444             0   1    
## 12 Cupcake 0.0952 0.188 0.643 0.643  0.376             2   1    
## 13 Cupcake 0.190  0.25  1     0.214  0.222             1   0.538
## 14 Cupcake 0.0952 0.312 0.75  0.5    0.556             1   1    
## 15 Cupcake 0      0.375 0.714 0.429  1                 0   1    
##    additional_ingredients_icing additional_ingredients_none
##                           <dbl>                       <dbl>
##  1                            0                           0
##  2                            0                           0
##  3                            0                           0
##  4                            0                           0
##  5                            0                           0
##  6                            0                           1
##  7                            0                           1
##  8                            0                           0
##  9                            1                           0
## 10                            0                           1
## 11                            0                           0
## 12                            0                           1
## 13                            0                           0
## 14                            1                           0
## 15                            1                           0
##    additional_ingredients_nuts
##                          <dbl>
##  1                           0
##  2                           1
##  3                           0
##  4                           0
##  5                           1
##  6                           0
##  7                           0
##  8                           1
##  9                           0
## 10                           0
## 11                           0
## 12                           0
## 13                           0
## 14                           0
## 15                           0
muffin_cupcake_test_preprocessed <- bake(prepped_recipe, muffin_cupcake_test)
muffin_cupcake_test_preprocessed
## # A tibble: 5 x 11
##   type    flour  milk sugar butter   egg baking_powder vanilla
##   <fct>   <dbl> <dbl> <dbl>  <dbl> <dbl>         <int>   <dbl>
## 1 Muffin  1     1     0     0.143  0.111             2   0    
## 2 Muffin  0.619 0.75  0.536 0.0714 0.376             1   0    
## 3 Cupcake 0.238 0     0.488 1      1                 1   1    
## 4 Cupcake 0     0.375 0.607 1      0.111             2   1    
## 5 Cupcake 0.381 0.438 0.786 0.286  0.111             1   0.538
##   additional_ingredients_icing additional_ingredients_none
##                          <dbl>                       <dbl>
## 1                            0                           0
## 2                            0                           1
## 3                            1                           0
## 4                            0                           0
## 5                            1                           0
##   additional_ingredients_nuts
##                         <dbl>
## 1                           0
## 2                           0
## 3                           0
## 4                           0
## 5                           0