A basic tutorial of caret: the machine learning package in R

R has a wide number of packages for machine learning (ML), which is great, but also quite frustrating since each package was designed independently and has very different syntax, inputs and outputs. Caret unifies these packages into a single package with constant syntax, saving everyone a lot of frustration and time!

Rebecca Barter

Note: This tutorial was based on an older version of the abalone data that had a binary old varibale rather than a numeric age variable. It has been modified lightly so that it uses a manual old variable (is the abalone older than 10 or not) and ignores the numeric age variable.

Materials prepared by Rebecca Barter. Package developed by Max Kuhn.

An interactive Jupyter Notebook version of this tutorial can be found at https://github.com/rlbarter/STAT-215A-Fall-2017/tree/master/week11. Feel free to download it and use for your own learning or teaching adventures!

R has a wide number of packages for machine learning (ML), which is great, but also quite frustrating since each package was designed independently and has very different syntax, inputs and outputs.

This means that if you want to do machine learning in R, you have to learn a large number of separate methods.

Recognizing this, Max Kuhn (at the time working in drug discovery at Pfizer, now at RStudio) put together a single package for performing any machine learning method you like. This package is called caret. Caret stands for Classification And Regression Training. Apparently caret has little to do with our orange friend, the carrot.

Not only does caret allow you to run a plethora of ML methods, it also provides tools for auxiliary techniques such as:

  • Data preparation (imputation, centering/scaling data, removing correlated predictors, reducing skewness)

  • Data splitting

  • Variable selection

  • Model evaluation

An extensive vignette for caret can be found here: https://topepo.github.io/caret/index.html

A simple view of caret: the default train function

To implement your machine learning model of choice using caret you will use the train function. The types of modeling options available are many and are listed here: https://topepo.github.io/caret/available-models.html. In the example below, we will use the ranger implementation of random forest to predict whether abalone are “old” or not based on a bunch of physical properties of the abalone (sex, height, weight, diameter, etc). The abalone data came from the UCI Machine Learning repository (we split the data into a training and test set).

First we load the data into R:

# load in packages
library(caret)
library(ranger)
library(tidyverse)
library(e1071)
# load in abalone dataset
abalone_data <- read.table("../data/abalone.data", sep = ",")
# load in column names
colnames(abalone_data) <- c("sex", "length", "diameter", "height", 
                            "whole.weight", "shucked.weight", 
                            "viscera.weight", "shell.weight", "age")
# add a logical variable for "old" (age > 10)
abalone_data <- abalone_data %>%
  mutate(old = age > 10) %>%
  # remove the "age" variable
  select(-age)
# split into training and testing
set.seed(23489)
train_index <- sample(1:nrow(abalone_data), 0.9 * nrow(abalone_data))
abalone_train <- abalone_data[train_index, ]
abalone_test <- abalone_data[-train_index, ]
# remove the original dataset
rm(abalone_data)
# view the first 6 rows of the training data
head(abalone_train)
##      sex length diameter height whole.weight shucked.weight viscera.weight
## 3670   I  0.585    0.460  0.140       0.7635         0.3260         0.1530
## 249    I  0.305    0.245  0.075       0.1560         0.0675         0.0380
## 498    F  0.605    0.485  0.165       1.0105         0.4350         0.2090
## 1889   F  0.565    0.445  0.125       0.8305         0.3135         0.1785
## 3488   I  0.510    0.405  0.130       0.5990         0.3065         0.1155
## 1852   I  0.485    0.370  0.115       0.4570         0.1885         0.0965
##      shell.weight   old
## 3670       0.2650 FALSE
## 249        0.0450 FALSE
## 498        0.3000  TRUE
## 1889       0.2300  TRUE
## 3488       0.1485 FALSE
## 1852       0.1500 FALSE

It looks like we have 3,759 abalone:

dim(abalone_train)
## [1] 3759    9

Time to fit a random forest model using caret. Anytime we want to fit a model using train we tell it which model to fit by providing a formula for the first argument (as.factor(old) ~ . means that we want to model old as a function of all of the other variables). Then we need to provide a method (we specify "ranger" to implement randomForest).

# fit a random forest model (using ranger)
rf_fit <- train(as.factor(old) ~ ., 
                data = abalone_train, 
                method = "ranger")

By default, the train function without any arguments re-runs the model over 25 bootstrap samples and across 3 options of the tuning parameter (the tuning parameter for ranger is mtry; the number of randomly selected predictors at each cut in the tree).

rf_fit
## Random Forest 
## 
## 3759 samples
##    8 predictor
##    2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3759, 3759, 3759, 3759, 3759, 3759, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##   2     gini        0.7888300  0.5774737
##   2     extratrees  0.7837155  0.5670365
##   5     gini        0.7822016  0.5642150
##   5     extratrees  0.7826548  0.5650665
##   9     gini        0.7764632  0.5527355
##   9     extratrees  0.7838949  0.5675586
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2 and splitrule = gini.

To test the data on an independent test set is equally as simple using the inbuilt predict function.

# predict the outcome on a test set
abalone_rf_pred <- predict(rf_fit, abalone_test)
# compare predicted outcome and true outcome
confusionMatrix(abalone_rf_pred, as.factor(abalone_test$old))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    81  143
##      TRUE    192    2
##                                           
##                Accuracy : 0.1986          
##                  95% CI : (0.1614, 0.2401)
##     No Information Rate : 0.6531          
##     P-Value [Acc > NIR] : 1.000000        
##                                           
##                   Kappa : -0.6389         
##  Mcnemar's Test P-Value : 0.008728        
##                                           
##             Sensitivity : 0.29670         
##             Specificity : 0.01379         
##          Pos Pred Value : 0.36161         
##          Neg Pred Value : 0.01031         
##              Prevalence : 0.65311         
##          Detection Rate : 0.19378         
##    Detection Prevalence : 0.53589         
##       Balanced Accuracy : 0.15525         
##                                           
##        'Positive' Class : FALSE           
## 

Getting a little fancier with caret

We have now seen how to fit a model along with the default resampling implementation (bootstrapping) and parameter selection. While this is great, there are many more things we could do with caret.

Pre-processing (preProcess)

There are a number of pre-processing steps that are easily implemented by caret. Several stand-alone functions from caret target specific issues that might arise when setting up the model. These include

  • dummyVars: creating dummy variables from categorical variables with multiple categories

  • nearZeroVar: identifying zero- and near zero-variance predictors (these may cause issues when subsampling)

  • findCorrelation: identifying correlated predictors

  • findLinearCombos: identify linear dependencies between predictors

In addition to these individual functions, there also exists the preProcess function which can be used to perform more common tasks such as centering and scaling, imputation and transformation. preProcess takes in a data frame to be processed and a method which can be any of “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”, “knnImpute”, “bagImpute”, “medianImpute”, “pca”, “ica”, “spatialSign”, “corr”, “zv”, “nzv”, and “conditionalX”.

# center, scale and perform a YeoJohnson transformation
# identify and remove variables with near zero variance
# perform pca
abalone_no_nzv_pca <- preProcess(select(abalone_train, - old), 
                        method = c("center", "scale", "YeoJohnson", "nzv", "pca"))
abalone_no_nzv_pca
## Created from 3759 samples and 8 variables
## 
## Pre-processing:
##   - centered (7)
##   - ignored (1)
##   - principal component signal extraction (7)
##   - scaled (7)
##   - Yeo-Johnson transformation (5)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -1.83, 0.07, -0.81, -1.82, -1.13
## PCA needed 2 components to capture 95 percent of the variance
# identify which variables were ignored, centered, scaled, etc
abalone_no_nzv_pca$method
## $center
## [1] "length"         "diameter"       "height"         "whole.weight"  
## [5] "shucked.weight" "viscera.weight" "shell.weight"  
## 
## $scale
## [1] "length"         "diameter"       "height"         "whole.weight"  
## [5] "shucked.weight" "viscera.weight" "shell.weight"  
## 
## $YeoJohnson
## [1] "height"         "whole.weight"   "shucked.weight" "viscera.weight"
## [5] "shell.weight"  
## 
## $pca
## [1] "length"         "diameter"       "height"         "whole.weight"  
## [5] "shucked.weight" "viscera.weight" "shell.weight"  
## 
## $ignore
## [1] "sex"
# identify the principal components
abalone_no_nzv_pca$rotation
##                       PC1         PC2
## length         -0.3816405 -0.07870604
## diameter       -0.3819720 -0.03188148
## height         -0.3608405  0.88215353
## whole.weight   -0.3860649 -0.18699204
## shucked.weight -0.3774996 -0.36039743
## viscera.weight -0.3791196 -0.21933075
## shell.weight   -0.3780984  0.04044471

Data splitting (createDataPartition and groupKFold)

Generating subsets of the data is easy with the createDataPartition function. While this function can be used to simply generate training and testing sets, it can also be used to subset the data while respecting important groupings that exist within the data.

First, we show an example of performing general sample splitting to generate 10 different 80% subsamples.

# identify the indices of 10 80% subsamples of the iris data
train_index <- createDataPartition(iris$Species,
                                   p = 0.8,
                                   list = FALSE,
                                   times = 10)
# look at the first 6 indices of each subsample
head(train_index)
##      Resample01 Resample02 Resample03 Resample04 Resample05 Resample06
## [1,]          3          1          1          2          1          1
## [2,]          8          2          2          5          4          2
## [3,]          9          4          3          7          5          3
## [4,]         10          5          5          8          6          4
## [5,]         11          6          6         10          7          6
## [6,]         13          7          7         11          8          7
##      Resample07 Resample08 Resample09 Resample10
## [1,]          2          1          1          1
## [2,]          4          2          2          3
## [3,]          5          3          3          4
## [4,]          6          4          4          6
## [5,]          9          5          5          7
## [6,]         10          6          6          8

While the above is incredibly useful, it is also very easy to do using a for loop. Not so exciting.

Something that IS more exciting is the ability to do K-fold cross validation which respects groupings in the data. The groupKFold function does just that!

As an example, let’s consider the following made-up abalone groups so that each sequential set of 5 abalone that appear in the dataset together are in the same group. For simplicity we will only consider the first 50 abalone.

# add a madeup grouping variable that groupes each subsequent 5 abalone together
# filter to the first 50 abalone for simplicity
abalone_grouped <- cbind(abalone_train[1:50, ], group = rep(1:10, each = 5))
head(abalone_grouped, 10)
##      sex length diameter height whole.weight shucked.weight viscera.weight
## 3670   I  0.585    0.460  0.140       0.7635         0.3260         0.1530
## 249    I  0.305    0.245  0.075       0.1560         0.0675         0.0380
## 498    F  0.605    0.485  0.165       1.0105         0.4350         0.2090
## 1889   F  0.565    0.445  0.125       0.8305         0.3135         0.1785
## 3488   I  0.510    0.405  0.130       0.5990         0.3065         0.1155
## 1852   I  0.485    0.370  0.115       0.4570         0.1885         0.0965
## 2880   I  0.470    0.385  0.130       0.5870         0.2640         0.1170
## 3203   F  0.620    0.485  0.220       1.5110         0.5095         0.2840
## 365    F  0.620    0.500  0.175       1.1860         0.4985         0.3015
## 2230   M  0.370    0.280  0.095       0.2225         0.0805         0.0510
##      shell.weight   old group
## 3670       0.2650 FALSE     1
## 249        0.0450 FALSE     1
## 498        0.3000  TRUE     1
## 1889       0.2300  TRUE     1
## 3488       0.1485 FALSE     1
## 1852       0.1500 FALSE     2
## 2880       0.1740 FALSE     2
## 3203       0.5100  TRUE     2
## 365        0.3500  TRUE     2
## 2230       0.0750 FALSE     2

The following code performs 10-fold cross-validation while respecting the groups in the abalone data. That is, each group of abalone must always appear in the same group together.

# perform grouped K means
group_folds <- groupKFold(abalone_grouped$group, k = 10)
group_folds
## $Fold01
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $Fold02
##  [1]  1  2  3  4  5 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $Fold03
##  [1]  1  2  3  4  5  6  7  8  9 10 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $Fold04
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $Fold05
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $Fold06
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $Fold07
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $Fold08
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 41 42 43 44 45 46 47 48 49 50
## 
## $Fold09
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 46 47 48 49 50
## 
## $Fold10
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Resampling options (trainControl)

One of the most important part of training ML models is tuning parameters. You can use the trainControl function to specify a number of parameters (including sampling parameters) in your model. The object that is outputted from trainControl will be provided as an argument for train.

set.seed(998)
# create a testing and training set
in_training <- createDataPartition(abalone_train$old, p = .75, list = FALSE)
training <- abalone_train[ in_training,]
testing  <- abalone_train[-in_training,]
# specify that the resampling method is 
fit_control <- trainControl(## 10-fold CV
                           method = "cv",
                           number = 10)
# run a random forest model
set.seed(825)
rf_fit <- train(as.factor(old) ~ ., 
                data = abalone_train, 
                method = "ranger",
                trControl = fit_control)
rf_fit
## Random Forest 
## 
## 3759 samples
##    8 predictor
##    2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3382, 3383, 3384, 3383, 3383, 3382, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##   2     gini        0.7871944  0.5188557
##   2     extratrees  0.7869327  0.5113079
##   5     gini        0.7821376  0.5100375
##   5     extratrees  0.7858653  0.5172792
##   9     gini        0.7762887  0.4969119
##   9     extratrees  0.7855958  0.5175170
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 1.

We could instead use our grouped folds (rather than random CV folds) by assigning the index argument of trainControl to be grouped_folds.

# specify that the resampling method is 
group_fit_control <- trainControl(## use grouped CV folds
                                  index = group_folds,
                                  method = "cv")
set.seed(825)
rf_fit <- train(as.factor(old) ~ ., 
                data = select(abalone_grouped, - group), 
                method = "ranger",
                trControl = group_fit_control)
rf_fit
## Random Forest 
## 
## 50 samples
##  8 predictor
##  2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 45, 45, 45, 45, 45, 45, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy  Kappa    
##   2     gini        0.70      0.4316517
##   2     extratrees  0.76      0.5305528
##   5     gini        0.68      0.3931901
##   5     extratrees  0.76      0.5305528
##   9     gini        0.70      0.4191642
##   9     extratrees  0.76      0.5305528
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule =
##  extratrees and min.node.size = 1.

You can also pass functions to trainControl that would have otherwise been passed to preProcess.

Model parameter tuning options (tuneGrid =)

You could specify your own tuning grid for model parameters using the tuneGrid argument of the train function. For example, you can define a grid of parameter combinations.

# define a grid of parameter options to try
rf_grid <- expand.grid(mtry = c(2, 3, 4, 5),
                      splitrule = c("gini", "extratrees"),
                      min.node.size = c(1, 3, 5))
rf_grid
##    mtry  splitrule min.node.size
## 1     2       gini             1
## 2     3       gini             1
## 3     4       gini             1
## 4     5       gini             1
## 5     2 extratrees             1
## 6     3 extratrees             1
## 7     4 extratrees             1
## 8     5 extratrees             1
## 9     2       gini             3
## 10    3       gini             3
## 11    4       gini             3
## 12    5       gini             3
## 13    2 extratrees             3
## 14    3 extratrees             3
## 15    4 extratrees             3
## 16    5 extratrees             3
## 17    2       gini             5
## 18    3       gini             5
## 19    4       gini             5
## 20    5       gini             5
## 21    2 extratrees             5
## 22    3 extratrees             5
## 23    4 extratrees             5
## 24    5 extratrees             5
# re-fit the model with the parameter grid
rf_fit <- train(as.factor(old) ~ ., 
                data = select(abalone_grouped, - group), 
                method = "ranger",
                trControl = group_fit_control,
                # provide a grid of parameters
                tuneGrid = rf_grid)
rf_fit
## Random Forest 
## 
## 50 samples
##  8 predictor
##  2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 45, 45, 45, 45, 45, 45, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   min.node.size  Accuracy  Kappa    
##   2     gini        1              0.68      0.3786214
##   2     gini        3              0.68      0.3786214
##   2     gini        5              0.70      0.4316517
##   2     extratrees  1              0.76      0.5305528
##   2     extratrees  3              0.76      0.5305528
##   2     extratrees  5              0.76      0.5305528
##   3     gini        1              0.68      0.3931901
##   3     gini        3              0.68      0.3931901
##   3     gini        5              0.70      0.4316517
##   3     extratrees  1              0.76      0.5305528
##   3     extratrees  3              0.76      0.5305528
##   3     extratrees  5              0.76      0.5305528
##   4     gini        1              0.70      0.4316517
##   4     gini        3              0.68      0.3931901
##   4     gini        5              0.72      0.4765235
##   4     extratrees  1              0.76      0.5305528
##   4     extratrees  3              0.76      0.5305528
##   4     extratrees  5              0.76      0.5305528
##   5     gini        1              0.68      0.3931901
##   5     gini        3              0.70      0.4380619
##   5     gini        5              0.70      0.4191642
##   5     extratrees  1              0.76      0.5305528
##   5     extratrees  3              0.76      0.5305528
##   5     extratrees  5              0.76      0.5305528
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule =
##  extratrees and min.node.size = 1.

Advanced topics

This tutorial has only scratched the surface of all of the options in the caret package. To find out more, see the extensive vignette https://topepo.github.io/caret/index.html.