Tidymodels: tidy machine learning in R

The tidyverse's take on machine learning is finally here. Tidymodels forms the basis of tidy machine learning, and this post provides a whirlwind tour to get you started.

Rebecca Barter

There’s a new modeling pipeline in town: tidymodels. Over the past few years, tidymodels has been gradually emerging as the tidyverse’s machine learning toolkit.

Why tidymodels? Well, it turns out that R has a consistency problem. Since everything was made by different people and using different principles, everything has a slightly different interface, and trying to keep everything in line can be frustrating. Several years ago, Max Kuhn (formerly at Pfeizer, now at RStudio) developed the caret R package (see my caret tutorial) aimed at creating a uniform interface for the massive variety of machine learning models that exist in R. Caret was great in a lot of ways, but also limited in others. In my own use, I found it to be quite slow whenever I tried to use on problems of any kind of modest size.

That said, caret was a great starting point, so RStudio hired Max Kuhn to work on a tidy version of caret, and he and many other people have developed what has become tidymodels. Tidymodels has been in development for a few years, with snippets of it being released as they were developed (see my post on the recipes package). I’ve been holding off writing a post about tidymodels until it seemed as though the different pieces fit together sufficiently for it to all feel cohesive. I feel like they’re finally there - which means it is time for me to learn it! While caret isn’t going anywhere (you can continue to use caret, and your existing caret code isn’t going to stop working), tidymodels will eventually make it redundant.

The main resources I used to learn tidymodels were Alison Hill’s slides from Introduction to Machine Learning with the Tidyverse, which contains all the slides for the course she prepared with Garrett Grolemund for RStudio::conf(2020), and Edgar Ruiz’s Gentle introduction to tidymodels on the RStudio website.

Note that throughout this post I’ll be assuming basic tidyverse knowledge, primarily of dplyr (e.g. piping %>% and function such as mutate()). Fortunately, for all you purrr-phobes out there, purrr is not required. If you’d like to brush up on your tidyverse skills, check out my Introduction to the Tidyverse posts. If you’d like to learn purrr (purrr is very handy for working with tidymodels but is no longer a requirement), check out my purrr post.

What is tidymodels

Much like the tidyverse consists of many core packages, such as ggplot2 and dplyr, tidymodels also consists of several core packages, including

  • rsample: for sample splitting (e.g. train/test or cross-validation)

  • recipes: for pre-processing

  • parsnip: for specifying the model

  • yardstick: for evaluating the model

Similarly to how you can load the entire tidyverse suite of packages by typing library(tidyverse), you can load the entire tidymodels suite of packages by typing library(tidymodels).

We will also be using the tune package (for parameter tuning procedure) and the workflows package (for putting everything together) that I had thought were a part of CRAN’s tidymodels package bundle, but apparently they aren’t. These will need to be loaded separately for now.

Unlike in my tidyverse post, I won’t base this post around the packages themselves, but I will mention the packages in passing.

Getting set up

First we need to load some libraries: tidymodels and tidyverse.

# load the relevant tidymodels libraries
library(tidymodels)
library(tidyverse)
library(workflows)
library(tune)

If you don’t already have the tidymodels library (or any of the other libraries) installed, then you’ll need to install it (once only) using install.packages("tidymodels").

We will use the Pima Indian Women’s diabetes dataset which contains information on 768 Pima Indian women’s diabetes status, as well as many predictive features such as the number of pregnancies (pregnant), plasma glucose concentration (glucose), diastolic blood pressure (pressure), triceps skin fold thickness (triceps), 2-hour serum insulin (insulin), BMI (mass), diabetes pedigree function (pedigree), and their age (age). In case you were wondering, the Pima Indians are a group of Native Americans living in an area consisting of what is now central and southern Arizona. The short name, “Pima” is believed to have come from a phrase meaning “I don’t know,” which they used repeatedly in their initial meetings with Spanish colonists. Thanks Wikipedia!

# load the Pima Indians dataset from the mlbench dataset
library(mlbench)
data(PimaIndiansDiabetes)
# rename dataset to have shorter name because lazy
diabetes_orig <- PimaIndiansDiabetes
diabetes_orig
##     pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1          6     148       72      35       0 33.6    0.627  50      pos
## 2          1      85       66      29       0 26.6    0.351  31      neg
## 3          8     183       64       0       0 23.3    0.672  32      pos
## 4          1      89       66      23      94 28.1    0.167  21      neg
## 5          0     137       40      35     168 43.1    2.288  33      pos
## 6          5     116       74       0       0 25.6    0.201  30      neg
## 7          3      78       50      32      88 31.0    0.248  26      pos
## 8         10     115        0       0       0 35.3    0.134  29      neg
## 9          2     197       70      45     543 30.5    0.158  53      pos
## 10         8     125       96       0       0  0.0    0.232  54      pos
## 11         4     110       92       0       0 37.6    0.191  30      neg
## 12        10     168       74       0       0 38.0    0.537  34      pos
## 13        10     139       80       0       0 27.1    1.441  57      neg
## 14         1     189       60      23     846 30.1    0.398  59      pos
## 15         5     166       72      19     175 25.8    0.587  51      pos
## 16         7     100        0       0       0 30.0    0.484  32      pos
## 17         0     118       84      47     230 45.8    0.551  31      pos
## 18         7     107       74       0       0 29.6    0.254  31      pos
## 19         1     103       30      38      83 43.3    0.183  33      neg
## 20         1     115       70      30      96 34.6    0.529  32      pos
## 21         3     126       88      41     235 39.3    0.704  27      neg
## 22         8      99       84       0       0 35.4    0.388  50      neg
## 23         7     196       90       0       0 39.8    0.451  41      pos
## 24         9     119       80      35       0 29.0    0.263  29      pos
## 25        11     143       94      33     146 36.6    0.254  51      pos
## 26        10     125       70      26     115 31.1    0.205  41      pos
## 27         7     147       76       0       0 39.4    0.257  43      pos
## 28         1      97       66      15     140 23.2    0.487  22      neg
## 29        13     145       82      19     110 22.2    0.245  57      neg
## 30         5     117       92       0       0 34.1    0.337  38      neg
## 31         5     109       75      26       0 36.0    0.546  60      neg
## 32         3     158       76      36     245 31.6    0.851  28      pos
## 33         3      88       58      11      54 24.8    0.267  22      neg
## 34         6      92       92       0       0 19.9    0.188  28      neg
## 35        10     122       78      31       0 27.6    0.512  45      neg
## 36         4     103       60      33     192 24.0    0.966  33      neg
## 37        11     138       76       0       0 33.2    0.420  35      neg
## 38         9     102       76      37       0 32.9    0.665  46      pos
## 39         2      90       68      42       0 38.2    0.503  27      pos
## 40         4     111       72      47     207 37.1    1.390  56      pos
## 41         3     180       64      25      70 34.0    0.271  26      neg
## 42         7     133       84       0       0 40.2    0.696  37      neg
## 43         7     106       92      18       0 22.7    0.235  48      neg
## 44         9     171      110      24     240 45.4    0.721  54      pos
## 45         7     159       64       0       0 27.4    0.294  40      neg
## 46         0     180       66      39       0 42.0    1.893  25      pos
## 47         1     146       56       0       0 29.7    0.564  29      neg
## 48         2      71       70      27       0 28.0    0.586  22      neg
## 49         7     103       66      32       0 39.1    0.344  31      pos
## 50         7     105        0       0       0  0.0    0.305  24      neg
## 51         1     103       80      11      82 19.4    0.491  22      neg
## 52         1     101       50      15      36 24.2    0.526  26      neg
## 53         5      88       66      21      23 24.4    0.342  30      neg
## 54         8     176       90      34     300 33.7    0.467  58      pos
## 55         7     150       66      42     342 34.7    0.718  42      neg
## 56         1      73       50      10       0 23.0    0.248  21      neg
## 57         7     187       68      39     304 37.7    0.254  41      pos
## 58         0     100       88      60     110 46.8    0.962  31      neg
## 59         0     146       82       0       0 40.5    1.781  44      neg
## 60         0     105       64      41     142 41.5    0.173  22      neg
## 61         2      84        0       0       0  0.0    0.304  21      neg
## 62         8     133       72       0       0 32.9    0.270  39      pos
## 63         5      44       62       0       0 25.0    0.587  36      neg
## 64         2     141       58      34     128 25.4    0.699  24      neg
## 65         7     114       66       0       0 32.8    0.258  42      pos
## 66         5      99       74      27       0 29.0    0.203  32      neg
## 67         0     109       88      30       0 32.5    0.855  38      pos
## 68         2     109       92       0       0 42.7    0.845  54      neg
## 69         1      95       66      13      38 19.6    0.334  25      neg
## 70         4     146       85      27     100 28.9    0.189  27      neg
## 71         2     100       66      20      90 32.9    0.867  28      pos
## 72         5     139       64      35     140 28.6    0.411  26      neg
## 73        13     126       90       0       0 43.4    0.583  42      pos
## 74         4     129       86      20     270 35.1    0.231  23      neg
## 75         1      79       75      30       0 32.0    0.396  22      neg
## 76         1       0       48      20       0 24.7    0.140  22      neg
## 77         7      62       78       0       0 32.6    0.391  41      neg
## 78         5      95       72      33       0 37.7    0.370  27      neg
## 79         0     131        0       0       0 43.2    0.270  26      pos
## 80         2     112       66      22       0 25.0    0.307  24      neg
## 81         3     113       44      13       0 22.4    0.140  22      neg
## 82         2      74        0       0       0  0.0    0.102  22      neg
## 83         7      83       78      26      71 29.3    0.767  36      neg
## 84         0     101       65      28       0 24.6    0.237  22      neg
## 85         5     137      108       0       0 48.8    0.227  37      pos
## 86         2     110       74      29     125 32.4    0.698  27      neg
## 87        13     106       72      54       0 36.6    0.178  45      neg
## 88         2     100       68      25      71 38.5    0.324  26      neg
## 89        15     136       70      32     110 37.1    0.153  43      pos
## 90         1     107       68      19       0 26.5    0.165  24      neg
## 91         1      80       55       0       0 19.1    0.258  21      neg
## 92         4     123       80      15     176 32.0    0.443  34      neg
## 93         7      81       78      40      48 46.7    0.261  42      neg
## 94         4     134       72       0       0 23.8    0.277  60      pos
## 95         2     142       82      18      64 24.7    0.761  21      neg
## 96         6     144       72      27     228 33.9    0.255  40      neg
## 97         2      92       62      28       0 31.6    0.130  24      neg
## 98         1      71       48      18      76 20.4    0.323  22      neg
## 99         6      93       50      30      64 28.7    0.356  23      neg
## 100        1     122       90      51     220 49.7    0.325  31      pos
## 101        1     163       72       0       0 39.0    1.222  33      pos
## 102        1     151       60       0       0 26.1    0.179  22      neg
## 103        0     125       96       0       0 22.5    0.262  21      neg
## 104        1      81       72      18      40 26.6    0.283  24      neg
## 105        2      85       65       0       0 39.6    0.930  27      neg
## 106        1     126       56      29     152 28.7    0.801  21      neg
## 107        1      96      122       0       0 22.4    0.207  27      neg
## 108        4     144       58      28     140 29.5    0.287  37      neg
## 109        3      83       58      31      18 34.3    0.336  25      neg
## 110        0      95       85      25      36 37.4    0.247  24      pos
## 111        3     171       72      33     135 33.3    0.199  24      pos
## 112        8     155       62      26     495 34.0    0.543  46      pos
## 113        1      89       76      34      37 31.2    0.192  23      neg
## 114        4      76       62       0       0 34.0    0.391  25      neg
## 115        7     160       54      32     175 30.5    0.588  39      pos
## 116        4     146       92       0       0 31.2    0.539  61      pos
## 117        5     124       74       0       0 34.0    0.220  38      pos
## 118        5      78       48       0       0 33.7    0.654  25      neg
## 119        4      97       60      23       0 28.2    0.443  22      neg
## 120        4      99       76      15      51 23.2    0.223  21      neg
## 121        0     162       76      56     100 53.2    0.759  25      pos
## 122        6     111       64      39       0 34.2    0.260  24      neg
## 123        2     107       74      30     100 33.6    0.404  23      neg
## 124        5     132       80       0       0 26.8    0.186  69      neg
## 125        0     113       76       0       0 33.3    0.278  23      pos
## 126        1      88       30      42      99 55.0    0.496  26      pos
## 127        3     120       70      30     135 42.9    0.452  30      neg
## 128        1     118       58      36      94 33.3    0.261  23      neg
## 129        1     117       88      24     145 34.5    0.403  40      pos
## 130        0     105       84       0       0 27.9    0.741  62      pos
## 131        4     173       70      14     168 29.7    0.361  33      pos
## 132        9     122       56       0       0 33.3    1.114  33      pos
## 133        3     170       64      37     225 34.5    0.356  30      pos
## 134        8      84       74      31       0 38.3    0.457  39      neg
## 135        2      96       68      13      49 21.1    0.647  26      neg
## 136        2     125       60      20     140 33.8    0.088  31      neg
## 137        0     100       70      26      50 30.8    0.597  21      neg
## 138        0      93       60      25      92 28.7    0.532  22      neg
## 139        0     129       80       0       0 31.2    0.703  29      neg
## 140        5     105       72      29     325 36.9    0.159  28      neg
## 141        3     128       78       0       0 21.1    0.268  55      neg
## 142        5     106       82      30       0 39.5    0.286  38      neg
## 143        2     108       52      26      63 32.5    0.318  22      neg
## 144       10     108       66       0       0 32.4    0.272  42      pos
## 145        4     154       62      31     284 32.8    0.237  23      neg
## 146        0     102       75      23       0  0.0    0.572  21      neg
## 147        9      57       80      37       0 32.8    0.096  41      neg
## 148        2     106       64      35     119 30.5    1.400  34      neg
## 149        5     147       78       0       0 33.7    0.218  65      neg
## 150        2      90       70      17       0 27.3    0.085  22      neg
## 151        1     136       74      50     204 37.4    0.399  24      neg
## 152        4     114       65       0       0 21.9    0.432  37      neg
## 153        9     156       86      28     155 34.3    1.189  42      pos
## 154        1     153       82      42     485 40.6    0.687  23      neg
## 155        8     188       78       0       0 47.9    0.137  43      pos
## 156        7     152       88      44       0 50.0    0.337  36      pos
## 157        2      99       52      15      94 24.6    0.637  21      neg
## 158        1     109       56      21     135 25.2    0.833  23      neg
## 159        2      88       74      19      53 29.0    0.229  22      neg
## 160       17     163       72      41     114 40.9    0.817  47      pos
## 161        4     151       90      38       0 29.7    0.294  36      neg
## 162        7     102       74      40     105 37.2    0.204  45      neg
## 163        0     114       80      34     285 44.2    0.167  27      neg
## 164        2     100       64      23       0 29.7    0.368  21      neg
## 165        0     131       88       0       0 31.6    0.743  32      pos
## 166        6     104       74      18     156 29.9    0.722  41      pos
## 167        3     148       66      25       0 32.5    0.256  22      neg
## 168        4     120       68       0       0 29.6    0.709  34      neg
## 169        4     110       66       0       0 31.9    0.471  29      neg
## 170        3     111       90      12      78 28.4    0.495  29      neg
## 171        6     102       82       0       0 30.8    0.180  36      pos
## 172        6     134       70      23     130 35.4    0.542  29      pos
## 173        2      87        0      23       0 28.9    0.773  25      neg
## 174        1      79       60      42      48 43.5    0.678  23      neg
## 175        2      75       64      24      55 29.7    0.370  33      neg
## 176        8     179       72      42     130 32.7    0.719  36      pos
## 177        6      85       78       0       0 31.2    0.382  42      neg
## 178        0     129      110      46     130 67.1    0.319  26      pos
## 179        5     143       78       0       0 45.0    0.190  47      neg
## 180        5     130       82       0       0 39.1    0.956  37      pos
## 181        6      87       80       0       0 23.2    0.084  32      neg
## 182        0     119       64      18      92 34.9    0.725  23      neg
## 183        1       0       74      20      23 27.7    0.299  21      neg
## 184        5      73       60       0       0 26.8    0.268  27      neg
## 185        4     141       74       0       0 27.6    0.244  40      neg
## 186        7     194       68      28       0 35.9    0.745  41      pos
## 187        8     181       68      36     495 30.1    0.615  60      pos
## 188        1     128       98      41      58 32.0    1.321  33      pos
## 189        8     109       76      39     114 27.9    0.640  31      pos
## 190        5     139       80      35     160 31.6    0.361  25      pos
## 191        3     111       62       0       0 22.6    0.142  21      neg
## 192        9     123       70      44      94 33.1    0.374  40      neg
## 193        7     159       66       0       0 30.4    0.383  36      pos
## 194       11     135        0       0       0 52.3    0.578  40      pos
## 195        8      85       55      20       0 24.4    0.136  42      neg
## 196        5     158       84      41     210 39.4    0.395  29      pos
## 197        1     105       58       0       0 24.3    0.187  21      neg
## 198        3     107       62      13      48 22.9    0.678  23      pos
## 199        4     109       64      44      99 34.8    0.905  26      pos
## 200        4     148       60      27     318 30.9    0.150  29      pos
## 201        0     113       80      16       0 31.0    0.874  21      neg
## 202        1     138       82       0       0 40.1    0.236  28      neg
## 203        0     108       68      20       0 27.3    0.787  32      neg
## 204        2      99       70      16      44 20.4    0.235  27      neg
## 205        6     103       72      32     190 37.7    0.324  55      neg
## 206        5     111       72      28       0 23.9    0.407  27      neg
## 207        8     196       76      29     280 37.5    0.605  57      pos
## 208        5     162      104       0       0 37.7    0.151  52      pos
## 209        1      96       64      27      87 33.2    0.289  21      neg
## 210        7     184       84      33       0 35.5    0.355  41      pos
## 211        2      81       60      22       0 27.7    0.290  25      neg
## 212        0     147       85      54       0 42.8    0.375  24      neg
## 213        7     179       95      31       0 34.2    0.164  60      neg
## 214        0     140       65      26     130 42.6    0.431  24      pos
## 215        9     112       82      32     175 34.2    0.260  36      pos
## 216       12     151       70      40     271 41.8    0.742  38      pos
## 217        5     109       62      41     129 35.8    0.514  25      pos
## 218        6     125       68      30     120 30.0    0.464  32      neg
## 219        5      85       74      22       0 29.0    1.224  32      pos
## 220        5     112       66       0       0 37.8    0.261  41      pos
## 221        0     177       60      29     478 34.6    1.072  21      pos
## 222        2     158       90       0       0 31.6    0.805  66      pos
## 223        7     119        0       0       0 25.2    0.209  37      neg
## 224        7     142       60      33     190 28.8    0.687  61      neg
## 225        1     100       66      15      56 23.6    0.666  26      neg
## 226        1      87       78      27      32 34.6    0.101  22      neg
## 227        0     101       76       0       0 35.7    0.198  26      neg
## 228        3     162       52      38       0 37.2    0.652  24      pos
## 229        4     197       70      39     744 36.7    2.329  31      neg
## 230        0     117       80      31      53 45.2    0.089  24      neg
## 231        4     142       86       0       0 44.0    0.645  22      pos
## 232        6     134       80      37     370 46.2    0.238  46      pos
## 233        1      79       80      25      37 25.4    0.583  22      neg
## 234        4     122       68       0       0 35.0    0.394  29      neg
## 235        3      74       68      28      45 29.7    0.293  23      neg
## 236        4     171       72       0       0 43.6    0.479  26      pos
## 237        7     181       84      21     192 35.9    0.586  51      pos
## 238        0     179       90      27       0 44.1    0.686  23      pos
## 239        9     164       84      21       0 30.8    0.831  32      pos
## 240        0     104       76       0       0 18.4    0.582  27      neg
## 241        1      91       64      24       0 29.2    0.192  21      neg
## 242        4      91       70      32      88 33.1    0.446  22      neg
## 243        3     139       54       0       0 25.6    0.402  22      pos
## 244        6     119       50      22     176 27.1    1.318  33      pos
## 245        2     146       76      35     194 38.2    0.329  29      neg
## 246        9     184       85      15       0 30.0    1.213  49      pos
## 247       10     122       68       0       0 31.2    0.258  41      neg
## 248        0     165       90      33     680 52.3    0.427  23      neg
## 249        9     124       70      33     402 35.4    0.282  34      neg
## 250        1     111       86      19       0 30.1    0.143  23      neg
## 251        9     106       52       0       0 31.2    0.380  42      neg
## 252        2     129       84       0       0 28.0    0.284  27      neg
## 253        2      90       80      14      55 24.4    0.249  24      neg
## 254        0      86       68      32       0 35.8    0.238  25      neg
## 255       12      92       62       7     258 27.6    0.926  44      pos
## 256        1     113       64      35       0 33.6    0.543  21      pos
## 257        3     111       56      39       0 30.1    0.557  30      neg
## 258        2     114       68      22       0 28.7    0.092  25      neg
## 259        1     193       50      16     375 25.9    0.655  24      neg
## 260       11     155       76      28     150 33.3    1.353  51      pos
## 261        3     191       68      15     130 30.9    0.299  34      neg
## 262        3     141        0       0       0 30.0    0.761  27      pos
## 263        4      95       70      32       0 32.1    0.612  24      neg
## 264        3     142       80      15       0 32.4    0.200  63      neg
## 265        4     123       62       0       0 32.0    0.226  35      pos
## 266        5      96       74      18      67 33.6    0.997  43      neg
## 267        0     138        0       0       0 36.3    0.933  25      pos
## 268        2     128       64      42       0 40.0    1.101  24      neg
## 269        0     102       52       0       0 25.1    0.078  21      neg
## 270        2     146        0       0       0 27.5    0.240  28      pos
## 271       10     101       86      37       0 45.6    1.136  38      pos
## 272        2     108       62      32      56 25.2    0.128  21      neg
## 273        3     122       78       0       0 23.0    0.254  40      neg
## 274        1      71       78      50      45 33.2    0.422  21      neg
## 275       13     106       70       0       0 34.2    0.251  52      neg
## 276        2     100       70      52      57 40.5    0.677  25      neg
## 277        7     106       60      24       0 26.5    0.296  29      pos
## 278        0     104       64      23     116 27.8    0.454  23      neg
## 279        5     114       74       0       0 24.9    0.744  57      neg
## 280        2     108       62      10     278 25.3    0.881  22      neg
## 281        0     146       70       0       0 37.9    0.334  28      pos
## 282       10     129       76      28     122 35.9    0.280  39      neg
## 283        7     133       88      15     155 32.4    0.262  37      neg
## 284        7     161       86       0       0 30.4    0.165  47      pos
## 285        2     108       80       0       0 27.0    0.259  52      pos
## 286        7     136       74      26     135 26.0    0.647  51      neg
## 287        5     155       84      44     545 38.7    0.619  34      neg
## 288        1     119       86      39     220 45.6    0.808  29      pos
## 289        4      96       56      17      49 20.8    0.340  26      neg
## 290        5     108       72      43      75 36.1    0.263  33      neg
## 291        0      78       88      29      40 36.9    0.434  21      neg
## 292        0     107       62      30      74 36.6    0.757  25      pos
## 293        2     128       78      37     182 43.3    1.224  31      pos
## 294        1     128       48      45     194 40.5    0.613  24      pos
## 295        0     161       50       0       0 21.9    0.254  65      neg
## 296        6     151       62      31     120 35.5    0.692  28      neg
## 297        2     146       70      38     360 28.0    0.337  29      pos
## 298        0     126       84      29     215 30.7    0.520  24      neg
## 299       14     100       78      25     184 36.6    0.412  46      pos
## 300        8     112       72       0       0 23.6    0.840  58      neg
## 301        0     167        0       0       0 32.3    0.839  30      pos
## 302        2     144       58      33     135 31.6    0.422  25      pos
## 303        5      77       82      41      42 35.8    0.156  35      neg
## 304        5     115       98       0       0 52.9    0.209  28      pos
## 305        3     150       76       0       0 21.0    0.207  37      neg
## 306        2     120       76      37     105 39.7    0.215  29      neg
## 307       10     161       68      23     132 25.5    0.326  47      pos
## 308        0     137       68      14     148 24.8    0.143  21      neg
## 309        0     128       68      19     180 30.5    1.391  25      pos
## 310        2     124       68      28     205 32.9    0.875  30      pos
## 311        6      80       66      30       0 26.2    0.313  41      neg
## 312        0     106       70      37     148 39.4    0.605  22      neg
## 313        2     155       74      17      96 26.6    0.433  27      pos
## 314        3     113       50      10      85 29.5    0.626  25      neg
## 315        7     109       80      31       0 35.9    1.127  43      pos
## 316        2     112       68      22      94 34.1    0.315  26      neg
## 317        3      99       80      11      64 19.3    0.284  30      neg
## 318        3     182       74       0       0 30.5    0.345  29      pos
## 319        3     115       66      39     140 38.1    0.150  28      neg
## 320        6     194       78       0       0 23.5    0.129  59      pos
## 321        4     129       60      12     231 27.5    0.527  31      neg
## 322        3     112       74      30       0 31.6    0.197  25      pos
## 323        0     124       70      20       0 27.4    0.254  36      pos
## 324       13     152       90      33      29 26.8    0.731  43      pos
## 325        2     112       75      32       0 35.7    0.148  21      neg
## 326        1     157       72      21     168 25.6    0.123  24      neg
## 327        1     122       64      32     156 35.1    0.692  30      pos
## 328       10     179       70       0       0 35.1    0.200  37      neg
## 329        2     102       86      36     120 45.5    0.127  23      pos
## 330        6     105       70      32      68 30.8    0.122  37      neg
## 331        8     118       72      19       0 23.1    1.476  46      neg
## 332        2      87       58      16      52 32.7    0.166  25      neg
## 333        1     180        0       0       0 43.3    0.282  41      pos
## 334       12     106       80       0       0 23.6    0.137  44      neg
## 335        1      95       60      18      58 23.9    0.260  22      neg
## 336        0     165       76      43     255 47.9    0.259  26      neg
## 337        0     117        0       0       0 33.8    0.932  44      neg
## 338        5     115       76       0       0 31.2    0.343  44      pos
## 339        9     152       78      34     171 34.2    0.893  33      pos
## 340        7     178       84       0       0 39.9    0.331  41      pos
## 341        1     130       70      13     105 25.9    0.472  22      neg
## 342        1      95       74      21      73 25.9    0.673  36      neg
## 343        1       0       68      35       0 32.0    0.389  22      neg
## 344        5     122       86       0       0 34.7    0.290  33      neg
## 345        8      95       72       0       0 36.8    0.485  57      neg
## 346        8     126       88      36     108 38.5    0.349  49      neg
## 347        1     139       46      19      83 28.7    0.654  22      neg
## 348        3     116        0       0       0 23.5    0.187  23      neg
## 349        3      99       62      19      74 21.8    0.279  26      neg
## 350        5       0       80      32       0 41.0    0.346  37      pos
## 351        4      92       80       0       0 42.2    0.237  29      neg
## 352        4     137       84       0       0 31.2    0.252  30      neg
## 353        3      61       82      28       0 34.4    0.243  46      neg
## 354        1      90       62      12      43 27.2    0.580  24      neg
## 355        3      90       78       0       0 42.7    0.559  21      neg
## 356        9     165       88       0       0 30.4    0.302  49      pos
## 357        1     125       50      40     167 33.3    0.962  28      pos
## 358       13     129        0      30       0 39.9    0.569  44      pos
## 359       12      88       74      40      54 35.3    0.378  48      neg
## 360        1     196       76      36     249 36.5    0.875  29      pos
## 361        5     189       64      33     325 31.2    0.583  29      pos
## 362        5     158       70       0       0 29.8    0.207  63      neg
## 363        5     103      108      37       0 39.2    0.305  65      neg
## 364        4     146       78       0       0 38.5    0.520  67      pos
## 365        4     147       74      25     293 34.9    0.385  30      neg
## 366        5      99       54      28      83 34.0    0.499  30      neg
## 367        6     124       72       0       0 27.6    0.368  29      pos
## 368        0     101       64      17       0 21.0    0.252  21      neg
## 369        3      81       86      16      66 27.5    0.306  22      neg
## 370        1     133      102      28     140 32.8    0.234  45      pos
## 371        3     173       82      48     465 38.4    2.137  25      pos
## 372        0     118       64      23      89  0.0    1.731  21      neg
## 373        0      84       64      22      66 35.8    0.545  21      neg
## 374        2     105       58      40      94 34.9    0.225  25      neg
## 375        2     122       52      43     158 36.2    0.816  28      neg
## 376       12     140       82      43     325 39.2    0.528  58      pos
## 377        0      98       82      15      84 25.2    0.299  22      neg
## 378        1      87       60      37      75 37.2    0.509  22      neg
## 379        4     156       75       0       0 48.3    0.238  32      pos
## 380        0      93      100      39      72 43.4    1.021  35      neg
## 381        1     107       72      30      82 30.8    0.821  24      neg
## 382        0     105       68      22       0 20.0    0.236  22      neg
## 383        1     109       60       8     182 25.4    0.947  21      neg
## 384        1      90       62      18      59 25.1    1.268  25      neg
## 385        1     125       70      24     110 24.3    0.221  25      neg
## 386        1     119       54      13      50 22.3    0.205  24      neg
## 387        5     116       74      29       0 32.3    0.660  35      pos
## 388        8     105      100      36       0 43.3    0.239  45      pos
## 389        5     144       82      26     285 32.0    0.452  58      pos
## 390        3     100       68      23      81 31.6    0.949  28      neg
## 391        1     100       66      29     196 32.0    0.444  42      neg
## 392        5     166       76       0       0 45.7    0.340  27      pos
## 393        1     131       64      14     415 23.7    0.389  21      neg
## 394        4     116       72      12      87 22.1    0.463  37      neg
## 395        4     158       78       0       0 32.9    0.803  31      pos
## 396        2     127       58      24     275 27.7    1.600  25      neg
## 397        3      96       56      34     115 24.7    0.944  39      neg
## 398        0     131       66      40       0 34.3    0.196  22      pos
## 399        3      82       70       0       0 21.1    0.389  25      neg
## 400        3     193       70      31       0 34.9    0.241  25      pos
## 401        4      95       64       0       0 32.0    0.161  31      pos
## 402        6     137       61       0       0 24.2    0.151  55      neg
## 403        5     136       84      41      88 35.0    0.286  35      pos
## 404        9      72       78      25       0 31.6    0.280  38      neg
## 405        5     168       64       0       0 32.9    0.135  41      pos
## 406        2     123       48      32     165 42.1    0.520  26      neg
## 407        4     115       72       0       0 28.9    0.376  46      pos
## 408        0     101       62       0       0 21.9    0.336  25      neg
## 409        8     197       74       0       0 25.9    1.191  39      pos
## 410        1     172       68      49     579 42.4    0.702  28      pos
## 411        6     102       90      39       0 35.7    0.674  28      neg
## 412        1     112       72      30     176 34.4    0.528  25      neg
## 413        1     143       84      23     310 42.4    1.076  22      neg
## 414        1     143       74      22      61 26.2    0.256  21      neg
## 415        0     138       60      35     167 34.6    0.534  21      pos
## 416        3     173       84      33     474 35.7    0.258  22      pos
## 417        1      97       68      21       0 27.2    1.095  22      neg
## 418        4     144       82      32       0 38.5    0.554  37      pos
## 419        1      83       68       0       0 18.2    0.624  27      neg
## 420        3     129       64      29     115 26.4    0.219  28      pos
## 421        1     119       88      41     170 45.3    0.507  26      neg
## 422        2      94       68      18      76 26.0    0.561  21      neg
## 423        0     102       64      46      78 40.6    0.496  21      neg
## 424        2     115       64      22       0 30.8    0.421  21      neg
## 425        8     151       78      32     210 42.9    0.516  36      pos
## 426        4     184       78      39     277 37.0    0.264  31      pos
## 427        0      94        0       0       0  0.0    0.256  25      neg
## 428        1     181       64      30     180 34.1    0.328  38      pos
## 429        0     135       94      46     145 40.6    0.284  26      neg
## 430        1      95       82      25     180 35.0    0.233  43      pos
## 431        2      99        0       0       0 22.2    0.108  23      neg
## 432        3      89       74      16      85 30.4    0.551  38      neg
## 433        1      80       74      11      60 30.0    0.527  22      neg
## 434        2     139       75       0       0 25.6    0.167  29      neg
## 435        1      90       68       8       0 24.5    1.138  36      neg
## 436        0     141        0       0       0 42.4    0.205  29      pos
## 437       12     140       85      33       0 37.4    0.244  41      neg
## 438        5     147       75       0       0 29.9    0.434  28      neg
## 439        1      97       70      15       0 18.2    0.147  21      neg
## 440        6     107       88       0       0 36.8    0.727  31      neg
## 441        0     189      104      25       0 34.3    0.435  41      pos
## 442        2      83       66      23      50 32.2    0.497  22      neg
## 443        4     117       64      27     120 33.2    0.230  24      neg
## 444        8     108       70       0       0 30.5    0.955  33      pos
## 445        4     117       62      12       0 29.7    0.380  30      pos
## 446        0     180       78      63      14 59.4    2.420  25      pos
## 447        1     100       72      12      70 25.3    0.658  28      neg
## 448        0      95       80      45      92 36.5    0.330  26      neg
## 449        0     104       64      37      64 33.6    0.510  22      pos
## 450        0     120       74      18      63 30.5    0.285  26      neg
## 451        1      82       64      13      95 21.2    0.415  23      neg
## 452        2     134       70       0       0 28.9    0.542  23      pos
## 453        0      91       68      32     210 39.9    0.381  25      neg
## 454        2     119        0       0       0 19.6    0.832  72      neg
## 455        2     100       54      28     105 37.8    0.498  24      neg
## 456       14     175       62      30       0 33.6    0.212  38      pos
## 457        1     135       54       0       0 26.7    0.687  62      neg
## 458        5      86       68      28      71 30.2    0.364  24      neg
## 459       10     148       84      48     237 37.6    1.001  51      pos
## 460        9     134       74      33      60 25.9    0.460  81      neg
## 461        9     120       72      22      56 20.8    0.733  48      neg
## 462        1      71       62       0       0 21.8    0.416  26      neg
## 463        8      74       70      40      49 35.3    0.705  39      neg
## 464        5      88       78      30       0 27.6    0.258  37      neg
## 465       10     115       98       0       0 24.0    1.022  34      neg
## 466        0     124       56      13     105 21.8    0.452  21      neg
## 467        0      74       52      10      36 27.8    0.269  22      neg
## 468        0      97       64      36     100 36.8    0.600  25      neg
## 469        8     120        0       0       0 30.0    0.183  38      pos
## 470        6     154       78      41     140 46.1    0.571  27      neg
## 471        1     144       82      40       0 41.3    0.607  28      neg
## 472        0     137       70      38       0 33.2    0.170  22      neg
## 473        0     119       66      27       0 38.8    0.259  22      neg
## 474        7     136       90       0       0 29.9    0.210  50      neg
## 475        4     114       64       0       0 28.9    0.126  24      neg
## 476        0     137       84      27       0 27.3    0.231  59      neg
## 477        2     105       80      45     191 33.7    0.711  29      pos
## 478        7     114       76      17     110 23.8    0.466  31      neg
## 479        8     126       74      38      75 25.9    0.162  39      neg
## 480        4     132       86      31       0 28.0    0.419  63      neg
## 481        3     158       70      30     328 35.5    0.344  35      pos
## 482        0     123       88      37       0 35.2    0.197  29      neg
## 483        4      85       58      22      49 27.8    0.306  28      neg
## 484        0      84       82      31     125 38.2    0.233  23      neg
## 485        0     145        0       0       0 44.2    0.630  31      pos
## 486        0     135       68      42     250 42.3    0.365  24      pos
## 487        1     139       62      41     480 40.7    0.536  21      neg
## 488        0     173       78      32     265 46.5    1.159  58      neg
## 489        4      99       72      17       0 25.6    0.294  28      neg
## 490        8     194       80       0       0 26.1    0.551  67      neg
## 491        2      83       65      28      66 36.8    0.629  24      neg
## 492        2      89       90      30       0 33.5    0.292  42      neg
## 493        4      99       68      38       0 32.8    0.145  33      neg
## 494        4     125       70      18     122 28.9    1.144  45      pos
## 495        3      80        0       0       0  0.0    0.174  22      neg
## 496        6     166       74       0       0 26.6    0.304  66      neg
## 497        5     110       68       0       0 26.0    0.292  30      neg
## 498        2      81       72      15      76 30.1    0.547  25      neg
## 499        7     195       70      33     145 25.1    0.163  55      pos
## 500        6     154       74      32     193 29.3    0.839  39      neg
## 501        2     117       90      19      71 25.2    0.313  21      neg
## 502        3      84       72      32       0 37.2    0.267  28      neg
## 503        6       0       68      41       0 39.0    0.727  41      pos
## 504        7      94       64      25      79 33.3    0.738  41      neg
## 505        3      96       78      39       0 37.3    0.238  40      neg
## 506       10      75       82       0       0 33.3    0.263  38      neg
## 507        0     180       90      26      90 36.5    0.314  35      pos
## 508        1     130       60      23     170 28.6    0.692  21      neg
## 509        2      84       50      23      76 30.4    0.968  21      neg
## 510        8     120       78       0       0 25.0    0.409  64      neg
## 511       12      84       72      31       0 29.7    0.297  46      pos
## 512        0     139       62      17     210 22.1    0.207  21      neg
## 513        9      91       68       0       0 24.2    0.200  58      neg
## 514        2      91       62       0       0 27.3    0.525  22      neg
## 515        3      99       54      19      86 25.6    0.154  24      neg
## 516        3     163       70      18     105 31.6    0.268  28      pos
## 517        9     145       88      34     165 30.3    0.771  53      pos
## 518        7     125       86       0       0 37.6    0.304  51      neg
## 519       13      76       60       0       0 32.8    0.180  41      neg
## 520        6     129       90       7     326 19.6    0.582  60      neg
## 521        2      68       70      32      66 25.0    0.187  25      neg
## 522        3     124       80      33     130 33.2    0.305  26      neg
## 523        6     114        0       0       0  0.0    0.189  26      neg
## 524        9     130       70       0       0 34.2    0.652  45      pos
## 525        3     125       58       0       0 31.6    0.151  24      neg
## 526        3      87       60      18       0 21.8    0.444  21      neg
## 527        1      97       64      19      82 18.2    0.299  21      neg
## 528        3     116       74      15     105 26.3    0.107  24      neg
## 529        0     117       66      31     188 30.8    0.493  22      neg
## 530        0     111       65       0       0 24.6    0.660  31      neg
## 531        2     122       60      18     106 29.8    0.717  22      neg
## 532        0     107       76       0       0 45.3    0.686  24      neg
## 533        1      86       66      52      65 41.3    0.917  29      neg
## 534        6      91        0       0       0 29.8    0.501  31      neg
## 535        1      77       56      30      56 33.3    1.251  24      neg
## 536        4     132        0       0       0 32.9    0.302  23      pos
## 537        0     105       90       0       0 29.6    0.197  46      neg
## 538        0      57       60       0       0 21.7    0.735  67      neg
## 539        0     127       80      37     210 36.3    0.804  23      neg
## 540        3     129       92      49     155 36.4    0.968  32      pos
## 541        8     100       74      40     215 39.4    0.661  43      pos
## 542        3     128       72      25     190 32.4    0.549  27      pos
## 543       10      90       85      32       0 34.9    0.825  56      pos
## 544        4      84       90      23      56 39.5    0.159  25      neg
## 545        1      88       78      29      76 32.0    0.365  29      neg
## 546        8     186       90      35     225 34.5    0.423  37      pos
## 547        5     187       76      27     207 43.6    1.034  53      pos
## 548        4     131       68      21     166 33.1    0.160  28      neg
## 549        1     164       82      43      67 32.8    0.341  50      neg
## 550        4     189      110      31       0 28.5    0.680  37      neg
## 551        1     116       70      28       0 27.4    0.204  21      neg
## 552        3      84       68      30     106 31.9    0.591  25      neg
## 553        6     114       88       0       0 27.8    0.247  66      neg
## 554        1      88       62      24      44 29.9    0.422  23      neg
## 555        1      84       64      23     115 36.9    0.471  28      neg
## 556        7     124       70      33     215 25.5    0.161  37      neg
## 557        1      97       70      40       0 38.1    0.218  30      neg
## 558        8     110       76       0       0 27.8    0.237  58      neg
## 559       11     103       68      40       0 46.2    0.126  42      neg
## 560       11      85       74       0       0 30.1    0.300  35      neg
## 561        6     125       76       0       0 33.8    0.121  54      pos
## 562        0     198       66      32     274 41.3    0.502  28      pos
## 563        1      87       68      34      77 37.6    0.401  24      neg
## 564        6      99       60      19      54 26.9    0.497  32      neg
## 565        0      91       80       0       0 32.4    0.601  27      neg
## 566        2      95       54      14      88 26.1    0.748  22      neg
## 567        1      99       72      30      18 38.6    0.412  21      neg
## 568        6      92       62      32     126 32.0    0.085  46      neg
## 569        4     154       72      29     126 31.3    0.338  37      neg
## 570        0     121       66      30     165 34.3    0.203  33      pos
## 571        3      78       70       0       0 32.5    0.270  39      neg
## 572        2     130       96       0       0 22.6    0.268  21      neg
## 573        3     111       58      31      44 29.5    0.430  22      neg
## 574        2      98       60      17     120 34.7    0.198  22      neg
## 575        1     143       86      30     330 30.1    0.892  23      neg
## 576        1     119       44      47      63 35.5    0.280  25      neg
## 577        6     108       44      20     130 24.0    0.813  35      neg
## 578        2     118       80       0       0 42.9    0.693  21      pos
## 579       10     133       68       0       0 27.0    0.245  36      neg
## 580        2     197       70      99       0 34.7    0.575  62      pos
## 581        0     151       90      46       0 42.1    0.371  21      pos
## 582        6     109       60      27       0 25.0    0.206  27      neg
## 583       12     121       78      17       0 26.5    0.259  62      neg
## 584        8     100       76       0       0 38.7    0.190  42      neg
## 585        8     124       76      24     600 28.7    0.687  52      pos
## 586        1      93       56      11       0 22.5    0.417  22      neg
## 587        8     143       66       0       0 34.9    0.129  41      pos
## 588        6     103       66       0       0 24.3    0.249  29      neg
## 589        3     176       86      27     156 33.3    1.154  52      pos
## 590        0      73        0       0       0 21.1    0.342  25      neg
## 591       11     111       84      40       0 46.8    0.925  45      pos
## 592        2     112       78      50     140 39.4    0.175  24      neg
## 593        3     132       80       0       0 34.4    0.402  44      pos
## 594        2      82       52      22     115 28.5    1.699  25      neg
## 595        6     123       72      45     230 33.6    0.733  34      neg
## 596        0     188       82      14     185 32.0    0.682  22      pos
## 597        0      67       76       0       0 45.3    0.194  46      neg
## 598        1      89       24      19      25 27.8    0.559  21      neg
## 599        1     173       74       0       0 36.8    0.088  38      pos
## 600        1     109       38      18     120 23.1    0.407  26      neg
## 601        1     108       88      19       0 27.1    0.400  24      neg
## 602        6      96        0       0       0 23.7    0.190  28      neg
## 603        1     124       74      36       0 27.8    0.100  30      neg
## 604        7     150       78      29     126 35.2    0.692  54      pos
## 605        4     183        0       0       0 28.4    0.212  36      pos
## 606        1     124       60      32       0 35.8    0.514  21      neg
## 607        1     181       78      42     293 40.0    1.258  22      pos
## 608        1      92       62      25      41 19.5    0.482  25      neg
## 609        0     152       82      39     272 41.5    0.270  27      neg
## 610        1     111       62      13     182 24.0    0.138  23      neg
## 611        3     106       54      21     158 30.9    0.292  24      neg
## 612        3     174       58      22     194 32.9    0.593  36      pos
## 613        7     168       88      42     321 38.2    0.787  40      pos
## 614        6     105       80      28       0 32.5    0.878  26      neg
## 615       11     138       74      26     144 36.1    0.557  50      pos
## 616        3     106       72       0       0 25.8    0.207  27      neg
## 617        6     117       96       0       0 28.7    0.157  30      neg
## 618        2      68       62      13      15 20.1    0.257  23      neg
## 619        9     112       82      24       0 28.2    1.282  50      pos
## 620        0     119        0       0       0 32.4    0.141  24      pos
## 621        2     112       86      42     160 38.4    0.246  28      neg
## 622        2      92       76      20       0 24.2    1.698  28      neg
## 623        6     183       94       0       0 40.8    1.461  45      neg
## 624        0      94       70      27     115 43.5    0.347  21      neg
## 625        2     108       64       0       0 30.8    0.158  21      neg
## 626        4      90       88      47      54 37.7    0.362  29      neg
## 627        0     125       68       0       0 24.7    0.206  21      neg
## 628        0     132       78       0       0 32.4    0.393  21      neg
## 629        5     128       80       0       0 34.6    0.144  45      neg
## 630        4      94       65      22       0 24.7    0.148  21      neg
## 631        7     114       64       0       0 27.4    0.732  34      pos
## 632        0     102       78      40      90 34.5    0.238  24      neg
## 633        2     111       60       0       0 26.2    0.343  23      neg
## 634        1     128       82      17     183 27.5    0.115  22      neg
## 635       10      92       62       0       0 25.9    0.167  31      neg
## 636       13     104       72       0       0 31.2    0.465  38      pos
## 637        5     104       74       0       0 28.8    0.153  48      neg
## 638        2      94       76      18      66 31.6    0.649  23      neg
## 639        7      97       76      32      91 40.9    0.871  32      pos
## 640        1     100       74      12      46 19.5    0.149  28      neg
## 641        0     102       86      17     105 29.3    0.695  27      neg
## 642        4     128       70       0       0 34.3    0.303  24      neg
## 643        6     147       80       0       0 29.5    0.178  50      pos
## 644        4      90        0       0       0 28.0    0.610  31      neg
## 645        3     103       72      30     152 27.6    0.730  27      neg
## 646        2     157       74      35     440 39.4    0.134  30      neg
## 647        1     167       74      17     144 23.4    0.447  33      pos
## 648        0     179       50      36     159 37.8    0.455  22      pos
## 649       11     136       84      35     130 28.3    0.260  42      pos
## 650        0     107       60      25       0 26.4    0.133  23      neg
## 651        1      91       54      25     100 25.2    0.234  23      neg
## 652        1     117       60      23     106 33.8    0.466  27      neg
## 653        5     123       74      40      77 34.1    0.269  28      neg
## 654        2     120       54       0       0 26.8    0.455  27      neg
## 655        1     106       70      28     135 34.2    0.142  22      neg
## 656        2     155       52      27     540 38.7    0.240  25      pos
## 657        2     101       58      35      90 21.8    0.155  22      neg
## 658        1     120       80      48     200 38.9    1.162  41      neg
## 659       11     127      106       0       0 39.0    0.190  51      neg
## 660        3      80       82      31      70 34.2    1.292  27      pos
## 661       10     162       84       0       0 27.7    0.182  54      neg
## 662        1     199       76      43       0 42.9    1.394  22      pos
## 663        8     167      106      46     231 37.6    0.165  43      pos
## 664        9     145       80      46     130 37.9    0.637  40      pos
## 665        6     115       60      39       0 33.7    0.245  40      pos
## 666        1     112       80      45     132 34.8    0.217  24      neg
## 667        4     145       82      18       0 32.5    0.235  70      pos
## 668       10     111       70      27       0 27.5    0.141  40      pos
## 669        6      98       58      33     190 34.0    0.430  43      neg
## 670        9     154       78      30     100 30.9    0.164  45      neg
## 671        6     165       68      26     168 33.6    0.631  49      neg
## 672        1      99       58      10       0 25.4    0.551  21      neg
## 673       10      68      106      23      49 35.5    0.285  47      neg
## 674        3     123      100      35     240 57.3    0.880  22      neg
## 675        8      91       82       0       0 35.6    0.587  68      neg
## 676        6     195       70       0       0 30.9    0.328  31      pos
## 677        9     156       86       0       0 24.8    0.230  53      pos
## 678        0      93       60       0       0 35.3    0.263  25      neg
## 679        3     121       52       0       0 36.0    0.127  25      pos
## 680        2     101       58      17     265 24.2    0.614  23      neg
## 681        2      56       56      28      45 24.2    0.332  22      neg
## 682        0     162       76      36       0 49.6    0.364  26      pos
## 683        0      95       64      39     105 44.6    0.366  22      neg
## 684        4     125       80       0       0 32.3    0.536  27      pos
## 685        5     136       82       0       0  0.0    0.640  69      neg
## 686        2     129       74      26     205 33.2    0.591  25      neg
## 687        3     130       64       0       0 23.1    0.314  22      neg
## 688        1     107       50      19       0 28.3    0.181  29      neg
## 689        1     140       74      26     180 24.1    0.828  23      neg
## 690        1     144       82      46     180 46.1    0.335  46      pos
## 691        8     107       80       0       0 24.6    0.856  34      neg
## 692       13     158      114       0       0 42.3    0.257  44      pos
## 693        2     121       70      32      95 39.1    0.886  23      neg
## 694        7     129       68      49     125 38.5    0.439  43      pos
## 695        2      90       60       0       0 23.5    0.191  25      neg
## 696        7     142       90      24     480 30.4    0.128  43      pos
## 697        3     169       74      19     125 29.9    0.268  31      pos
## 698        0      99        0       0       0 25.0    0.253  22      neg
## 699        4     127       88      11     155 34.5    0.598  28      neg
## 700        4     118       70       0       0 44.5    0.904  26      neg
## 701        2     122       76      27     200 35.9    0.483  26      neg
## 702        6     125       78      31       0 27.6    0.565  49      pos
## 703        1     168       88      29       0 35.0    0.905  52      pos
## 704        2     129        0       0       0 38.5    0.304  41      neg
## 705        4     110       76      20     100 28.4    0.118  27      neg
## 706        6      80       80      36       0 39.8    0.177  28      neg
## 707       10     115        0       0       0  0.0    0.261  30      pos
## 708        2     127       46      21     335 34.4    0.176  22      neg
## 709        9     164       78       0       0 32.8    0.148  45      pos
## 710        2      93       64      32     160 38.0    0.674  23      pos
## 711        3     158       64      13     387 31.2    0.295  24      neg
## 712        5     126       78      27      22 29.6    0.439  40      neg
## 713       10     129       62      36       0 41.2    0.441  38      pos
## 714        0     134       58      20     291 26.4    0.352  21      neg
## 715        3     102       74       0       0 29.5    0.121  32      neg
## 716        7     187       50      33     392 33.9    0.826  34      pos
## 717        3     173       78      39     185 33.8    0.970  31      pos
## 718       10      94       72      18       0 23.1    0.595  56      neg
## 719        1     108       60      46     178 35.5    0.415  24      neg
## 720        5      97       76      27       0 35.6    0.378  52      pos
## 721        4      83       86      19       0 29.3    0.317  34      neg
## 722        1     114       66      36     200 38.1    0.289  21      neg
## 723        1     149       68      29     127 29.3    0.349  42      pos
## 724        5     117       86      30     105 39.1    0.251  42      neg
## 725        1     111       94       0       0 32.8    0.265  45      neg
## 726        4     112       78      40       0 39.4    0.236  38      neg
## 727        1     116       78      29     180 36.1    0.496  25      neg
## 728        0     141       84      26       0 32.4    0.433  22      neg
## 729        2     175       88       0       0 22.9    0.326  22      neg
## 730        2      92       52       0       0 30.1    0.141  22      neg
## 731        3     130       78      23      79 28.4    0.323  34      pos
## 732        8     120       86       0       0 28.4    0.259  22      pos
## 733        2     174       88      37     120 44.5    0.646  24      pos
## 734        2     106       56      27     165 29.0    0.426  22      neg
## 735        2     105       75       0       0 23.3    0.560  53      neg
## 736        4      95       60      32       0 35.4    0.284  28      neg
## 737        0     126       86      27     120 27.4    0.515  21      neg
## 738        8      65       72      23       0 32.0    0.600  42      neg
## 739        2      99       60      17     160 36.6    0.453  21      neg
## 740        1     102       74       0       0 39.5    0.293  42      pos
## 741       11     120       80      37     150 42.3    0.785  48      pos
## 742        3     102       44      20      94 30.8    0.400  26      neg
## 743        1     109       58      18     116 28.5    0.219  22      neg
## 744        9     140       94       0       0 32.7    0.734  45      pos
## 745       13     153       88      37     140 40.6    1.174  39      neg
## 746       12     100       84      33     105 30.0    0.488  46      neg
## 747        1     147       94      41       0 49.3    0.358  27      pos
## 748        1      81       74      41      57 46.3    1.096  32      neg
## 749        3     187       70      22     200 36.4    0.408  36      pos
## 750        6     162       62       0       0 24.3    0.178  50      pos
## 751        4     136       70       0       0 31.2    1.182  22      pos
## 752        1     121       78      39      74 39.0    0.261  28      neg
## 753        3     108       62      24       0 26.0    0.223  25      neg
## 754        0     181       88      44     510 43.3    0.222  26      pos
## 755        8     154       78      32       0 32.4    0.443  45      pos
## 756        1     128       88      39     110 36.5    1.057  37      pos
## 757        7     137       90      41       0 32.0    0.391  39      neg
## 758        0     123       72       0       0 36.3    0.258  52      pos
## 759        1     106       76       0       0 37.5    0.197  26      neg
## 760        6     190       92       0       0 35.5    0.278  66      pos
## 761        2      88       58      26      16 28.4    0.766  22      neg
## 762        9     170       74      31       0 44.0    0.403  43      pos
## 763        9      89       62       0       0 22.5    0.142  33      neg
## 764       10     101       76      48     180 32.9    0.171  63      neg
## 765        2     122       70      27       0 36.8    0.340  27      neg
## 766        5     121       72      23     112 26.2    0.245  30      neg
## 767        1     126       60       0       0 30.1    0.349  47      pos
## 768        1      93       70      31       0 30.4    0.315  23      neg

A quick exploration reveals that there are more zeros in the data than expected (especially since a BMI or tricep skin fold thickness of 0 is impossible), implying that missing values are recorded as zeros. See for instance the histogram of the tricep skin fold thickness, which has a number of 0 entries that are set apart from the other entries.

ggplot(diabetes_orig) +
  geom_histogram(aes(x = triceps))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This phenomena can also seen in the glucose, pressure, insulin and mass variables. Thus, we convert the 0 entries in all variables (other than “pregnant”) to NA. To do that, we use the mutate_at() function (which will soon be superseded by mutate() with across()) to specify which variables we want to apply our mutating function to, and we use the if_else() function to specify what to replace the value with if the condition is true or false.

diabetes_clean <- diabetes_orig %>%
  mutate_at(vars(triceps, glucose, pressure, insulin, mass), 
            function(.var) { 
              if_else(condition = (.var == 0), # if true (i.e. the entry is 0)
                      true = as.numeric(NA),  # replace the value with NA
                      false = .var # otherwise leave it as it is
                      )
            })

Our data is ready. Hopefully you’ve replenished your cup of tea (or coffee if you’re into that for some reason). Let’s start making some tidy models!

Split into train/test

First, let’s split our dataset into training and testing data. The training data will be used to fit our model and tune its parameters, where the testing data will be used to evaluate our final model’s performance.

This split can be done automatically using the inital_split() function (from rsample) which creates a special “split” object.

set.seed(234589)
# split the data into trainng (75%) and testing (25%)
diabetes_split <- initial_split(diabetes_clean, 
                                prop = 3/4)
diabetes_split
## <Training/Validation/Total>
## <576/192/768>

The printed output of diabetes_split, our split object, tells us how many observations we have in the training set, the testing set, and overall: <train/test/total>.

The training and testing sets can be extracted from the “split” object using the training() and testing() functions. Although, we won’t actually use these objects in the pipeline (we will be using the diabetes_split object itself).

# extract training and testing sets
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)

At some point we’re going to want to do some parameter tuning, and to do that we’re going to want to use cross-validation. So we can create a cross-validated version of the training set in preparation for that moment using vfold_cv().

# create CV object from training data
diabetes_cv <- vfold_cv(diabetes_train)

Define a recipe

Recipes allow you to specify the role of each variable as an outcome or predictor variable (using a “formula”), and any pre-processing steps you want to conduct (such as normalization, imputation, PCA, etc).

Creating a recipe has two parts (layered on top of one another using pipes %>%):

  1. Specify the formula (recipe()): specify the outcome variable and predictor variables

  2. Specify pre-processing steps (step_zzz()): define the pre-processing steps, such as imputation, creating dummy variables, scaling, and more

For instance, we can define the following recipe

# define the recipe
diabetes_recipe <- 
  # which consists of the formula (outcome ~ predictors)
  recipe(diabetes ~ pregnant + glucose + pressure + triceps + 
           insulin + mass + pedigree + age, 
         data = diabetes_clean) %>%
  # and some pre-processing steps
  step_normalize(all_numeric()) %>%
  step_knnimpute(all_predictors())
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
##   Consider formula(paste(x, collapse = " ")) instead.

If you’ve ever seen formulas before (e.g. using the lm() function in R), you might have noticed that we could have written our formula much more efficiently using the formula short-hand where . represents all of the variables in the data: outcome ~ . will fit a model that predicts the outcome using all other columns.

The full list of pre-processing steps available can be found here. In the recipe steps above we used the functions all_numeric() and all_predictors() as arguments to the pre-processing steps. These are called “role selections”, and they specify that we want to apply the step to “all numeric” variables or “all predictor variables”. The list of all potential role selectors can be found by typing ?selections into your console.

Note that we used the original diabetes_clean data object (we set recipe(..., data = diabetes_clean)), rather than the diabetes_train object or the diabetes_split object. It turns out we could have used any of these. All recipes takes from the data object at this point is the names and roles of the outcome and predictor variables. We will apply this recipe to specific datasets later. This means that for large data sets, the head of the data could be used to pass the recipe a smaller data set to save time and memory.

Indeed, if we print a summary of the diabetes_recipe object, it just shows us how many predictor variables we’ve specified and the steps we’ve specified (but it doesn’t actually implement them yet!).

diabetes_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          8
## 
## Operations:
## 
## Centering and scaling for all_numeric
## K-nearest neighbor imputation for all_predictors

If you want to extract the pre-processed dataset itself, you can first prep() the recipe for a specific dataset and juice() the prepped recipe to extract the pre-processed data. It turns out that extracting the pre-processed data isn’t actually necessary for the pipeline, since this will be done under the hood when the model is fit, but sometimes it’s useful anyway.

diabetes_train_preprocessed <- diabetes_recipe %>%
  # apply the recipe to the training data
  prep(diabetes_train) %>%
  # extract the pre-processed training dataset
  juice()
diabetes_train_preprocessed
## # A tibble: 576 x 9
##    pregnant glucose pressure triceps insulin   mass pedigree     age diabetes
##       <dbl>   <dbl>    <dbl>   <dbl>   <dbl>  <dbl>    <dbl>   <dbl> <fct>   
##  1   0.673   0.837   -0.0581  0.616   0.328   0.187    0.531  1.37   pos     
##  2  -0.824  -1.21    -0.537   0.0354 -0.770  -0.831   -0.359 -0.205  neg     
##  3   1.27    1.98    -0.697   0.229   1.33   -1.31     0.676 -0.122  pos     
##  4  -1.12    0.478   -2.61    0.616   0.0669  1.57     5.89  -0.0396 pos     
##  5   0.374  -0.205    0.102  -0.700  -0.404  -0.977   -0.843 -0.287  neg     
##  6   1.87   -0.238    0.229   0.229   0.558   0.435   -1.06  -0.370  neg     
##  7  -0.525   2.43    -0.218   1.58    3.15   -0.264   -0.982  1.61   pos     
##  8   1.27    0.0877   1.86   -0.178   0.863   0.318   -0.743  1.70   pos     
##  9   0.0743 -0.401    1.54    0.674   0.139   0.769   -0.875 -0.287  neg     
## 10   1.87    0.544    0.581  -0.448   0.666  -0.758    3.16   1.94   neg     
## # … with 566 more rows

I wrote a much longer post on recipes if you’d like to check out more details. However, note that the preparation and bake steps described in that post are no longer necessary in the tidymodels pipeline, since they’re now implemented under the hood by the later model fitting functions in this pipeline.

Specify the model

So far we’ve split our data into training/testing, and we’ve specified our pre-processing steps using a recipe. The next thing we want to specify is our model (using the parsnip package).

Parsnip offers a unified interface for the massive variety of models that exist in R. This means that you only have to learn one way of specifying a model, and you can use this specification and have it generate a linear model, a random forest model, a support vector machine model, and more with a single line of code.

There are a few primary components that you need to provide for the model specification

  1. The model type: what kind of model you want to fit, set using a different function depending on the model, such as rand_forest() for random forest, logistic_reg() for logistic regression, svm_poly() for a polynomial SVM model etc. The full list of models available via parsnip can be found here.

  2. The arguments: the model parameter values (now consistently named across different models), set using set_args().

  3. The engine: the underlying package the model should come from (e.g. “ranger” for the ranger implementation of Random Forest), set using set_engine().

  4. The mode: the type of prediction - since several packages can do both classification (binary/categorical prediction) and regression (continuous prediction), set using set_mode().

For instance, if we want to fit a random forest model as implemented by the ranger package for the purpose of classification and we want to tune the mtry parameter (the number of randomly selected variables to be considered at each split in the trees), then we would define the following model specification:

rf_model <- 
  # specify that the model is a random forest
  rand_forest() %>%
  # specify that the `mtry` parameter needs to be tuned
  set_args(mtry = tune()) %>%
  # select the engine/package that underlies the model
  set_engine("ranger", importance = "impurity") %>%
  # choose either the continuous regression or binary classification mode
  set_mode("classification") 

If you want to be able to examine the variable importance of your final model later, you will need to set importance argument when setting the engine. For ranger, the importance options are "impurity" or "permutation".

As another example, the following code would instead specify a logistic regression model from the glm package.

lr_model <- 
  # specify that the model is a random forest
  logistic_reg() %>%
  # select the engine/package that underlies the model
  set_engine("glm") %>%
  # choose either the continuous regression or binary classification mode
  set_mode("classification") 

Note that this code doesn’t actually fit the model. Like the recipe, it just outlines a description of the model. Moreover, setting a parameter to tune() means that it will be tuned later in the tune stage of the pipeline (i.e. the value of the parameter that yields the best performance will be chosen). You could also just specify a particular value of the parameter if you don’t want to tune it e.g. using set_args(mtry = 4).

Another thing to note is that nothing about this model specification is specific to the diabetes dataset.

Put it all together in a workflow

We’re now ready to put the model and recipes together into a workflow. You initiate a workflow using workflow() (from the workflows package) and then you can add a recipe and add a model to it.

# set the workflow
rf_workflow <- workflow() %>%
  # add the recipe
  add_recipe(diabetes_recipe) %>%
  # add the model
  add_model(rf_model)

Note that we still haven’t yet implemented the pre-processing steps in the recipe nor have we fit the model. We’ve just written the framework. It is only when we tune the parameters or fit the model that the recipe and model frameworks are actually implemented.

Tune the parameters

Since we had a parameter that we designated to be tuned (mtry), we need to tune it (i.e. choose the value that leads to the best performance) before fitting our model. If you don’t have any parameters to tune, you can skip this step.

Note that we will do our tuning using the cross-validation object (diabetes_cv). To do this, we specify the range of mtry values we want to try, and then we add a tuning layer to our workflow using tune_grid() (from the tune package). Note that we focus on two metrics: accuracy and roc_auc (from the yardstick package).

# specify which values eant to try
rf_grid <- expand.grid(mtry = c(3, 4, 5))
# extract results
rf_tune_results <- rf_workflow %>%
  tune_grid(resamples = diabetes_cv, #CV object
            grid = rf_grid, # grid of values to try
            metrics = metric_set(accuracy, roc_auc) # metrics we care about
            )
## i Creating pre-processing data to finalize unknown parameter: mtry

You can tune multiple parameters at once by providing multiple parameters to the expand.grid() function, e.g. expand.grid(mtry = c(3, 4, 5), trees = c(100, 500)).

It’s always a good idea to explore the results of the cross-validation. collect_metrics() is a really handy function that can be used in a variety of circumstances to extract any metrics that have been calculated within the object it’s being used on. In this case, the metrics come from the cross-validation performance across the different values of the parameters.

# print results
rf_tune_results %>%
  collect_metrics()
## # A tibble: 6 x 6
##    mtry .metric  .estimator  mean     n std_err
##   <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
## 1     3 accuracy binary     0.758    10  0.0233
## 2     3 roc_auc  binary     0.829    10  0.0155
## 3     4 accuracy binary     0.753    10  0.0222
## 4     4 roc_auc  binary     0.829    10  0.0161
## 5     5 accuracy binary     0.751    10  0.0220
## 6     5 roc_auc  binary     0.827    10  0.0147

Across both accuracy and AUC, mtry = 4 yields the best performance (just).

Finalize the workflow

We want to add a layer to our workflow that corresponds to the tuned parameter, i.e. sets mtry to be the value that yielded the best results. If you didn’t tune any parameters, you can skip this step.

We can extract the best value for the accuracy metric by applying the select_best() function to the tune object.

param_final <- rf_tune_results %>%
  select_best(metric = "accuracy")
param_final
## # A tibble: 1 x 1
##    mtry
##   <dbl>
## 1     3

Then we can add this parameter to the workflow using the finalize_workflow() function.

rf_workflow <- rf_workflow %>%
  finalize_workflow(param_final)

Evaluate the model on the test set

Now we’ve defined our recipe, our model, and tuned the model’s parameters, we’re ready to actually fit the final model. Since all of this information is contained within the workflow object, we will apply the last_fit() function to our workflow and our train/test split object. This will automatically train the model specified by the workflow using the training data, and produce evaluations based on the test set.

rf_fit <- rf_workflow %>%
  # fit on the training set and evaluate on test set
  last_fit(diabetes_split)

Note that the fit object that is created is a data-frame-like object; specifically, it is a tibble with list columns.

rf_fit
## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples  
## # A tibble: 1 x 6
##   splits        id           .metrics      .notes      .predictions    .workflow
##   <list>        <chr>        <list>        <list>      <list>          <list>   
## 1 <split [576/… train/test … <tibble [2 ×… <tibble [0… <tibble [192 ×… <workflo…

This is a really nice feature of tidymodels (and is what makes it work so nicely with the tidyverse) since you can do all of your tidyverse operations to the model object. While truly taking advantage of this flexibility requires proficiency with purrr, if you don’t want to deal with purrr and list-columns, there are functions that can extract the relevant information from the fit object that remove the need for purrr as we will see below.

Since we supplied the train/test object when we fit the workflow, the metrics are evaluated on the test set. Now when we use the collect_metrics() function (recall we used this when tuning our parameters), it extracts the performance of the final model (since rf_fit now consists of a single final model) applied to the test set.

test_performance <- rf_fit %>% collect_metrics()
test_performance
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.745
## 2 roc_auc  binary         0.834

Overall the performance is very good, with an accuracy of 0.74 and an AUC of 0.82.

You can also extract the test set predictions themselves using the collect_predictions() function. Note that there are 192 rows in the predictions object below which matches the number of test set observations (just to give you some evidence that these are based on the test set rather than the training set).

# generate predictions from the test set
test_predictions <- rf_fit %>% collect_predictions()
test_predictions
## # A tibble: 192 x 6
##    id               .pred_neg .pred_pos  .row .pred_class diabetes
##    <chr>                <dbl>     <dbl> <int> <fct>       <fct>   
##  1 train/test split    0.994    0.00592     4 neg         neg     
##  2 train/test split    0.962    0.0384      7 neg         pos     
##  3 train/test split    0.0648   0.935      12 pos         pos     
##  4 train/test split    0.713    0.287      19 neg         neg     
##  5 train/test split    0.596    0.404      21 neg         neg     
##  6 train/test split    0.535    0.465      25 neg         pos     
##  7 train/test split    0.475    0.525      27 pos         pos     
##  8 train/test split    0.720    0.280      29 neg         neg     
##  9 train/test split    0.939    0.0611     34 neg         neg     
## 10 train/test split    0.373    0.627      35 pos         neg     
## # … with 182 more rows

Since this is just a normal data frame/tibble object, we can generate summaries and plots such as a confusion matrix.

# generate a confusion matrix
test_predictions %>% 
  conf_mat(truth = diabetes, estimate = .pred_class)
##           Truth
## Prediction neg pos
##        neg 106  33
##        pos  16  37

We could also plot distributions of the predicted probability distributions for each class.

test_predictions %>%
  ggplot() +
  geom_density(aes(x = .pred_pos, fill = diabetes), 
               alpha = 0.5)

If you’re familiar with purrr, you could use purrr functions to extract the predictions column using pull(). The following code does almost the same thing as collect_predictions(). You could similarly have done this with the .metrics column.

test_predictions <- rf_fit %>% pull(.predictions)
test_predictions
## [[1]]
## # A tibble: 192 x 5
##    .pred_neg .pred_pos  .row .pred_class diabetes
##        <dbl>     <dbl> <int> <fct>       <fct>   
##  1    0.994    0.00592     4 neg         neg     
##  2    0.962    0.0384      7 neg         pos     
##  3    0.0648   0.935      12 pos         pos     
##  4    0.713    0.287      19 neg         neg     
##  5    0.596    0.404      21 neg         neg     
##  6    0.535    0.465      25 neg         pos     
##  7    0.475    0.525      27 pos         pos     
##  8    0.720    0.280      29 neg         neg     
##  9    0.939    0.0611     34 neg         neg     
## 10    0.373    0.627      35 pos         neg     
## # … with 182 more rows

Fitting and using your final model

The previous section evaluated the model trained on the training data using the testing data. But once you’ve determined your final model, you often want to train it on your full dataset and then use it to predict the response for new data.

If you want to use your model to predict the response for new observations, you need to use the fit() function on your workflow and the dataset that you want to fit the final model on (e.g. the complete training + testing dataset).

final_model <- fit(rf_workflow, diabetes_clean)

The final_model object contains a few things including the ranger object trained with the parameters established through the workflow contained in rf_workflow based on the data in diabetes_clean (the combined training and testing data).

final_model
## ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## ● step_normalize()
## ● step_knnimpute()
## 
## ── Model ──────────────────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~3, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      768 
## Number of independent variables:  8 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1570876

If we wanted to predict the diabetes status of a new woman, we could use the normal predict() function.

For instance, below we define the data for a new woman.

new_woman <- tribble(~pregnant, ~glucose, ~pressure, ~triceps, ~insulin, ~mass, ~pedigree, ~age,
                     2, 95, 70, 31, 102, 28.2, 0.67, 47)
new_woman
## # A tibble: 1 x 8
##   pregnant glucose pressure triceps insulin  mass pedigree   age
##      <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl>
## 1        2      95       70      31     102  28.2     0.67    47

The predicted diabetes status of this new woman is “negative”.

predict(final_model, new_data = new_woman)
## # A tibble: 1 x 1
##   .pred_class
##   <fct>      
## 1 neg

Variable importance

If you want to extract the variable importance scores from your model, as far as I can tell, for now you need to extract the model object from the fit() object (which for us is called final_model). The function that extracts the model is pull_workflow_fit() and then you need to grab the fit object that the output contains.

ranger_obj <- pull_workflow_fit(final_model)$fit
ranger_obj
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~3, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      768 
## Number of independent variables:  8 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1570876

Then you can extract the variable importance from the ranger object itself (variable.importance is a specific object contained within ranger output - this will need to be adapted for the specific object type of other models).

ranger_obj$variable.importance
## pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
## 16.66632 75.52923 17.43526 23.08703 52.40944 41.10770 29.85125 33.08012