Tidymodels: tidy machine learning in R
The tidyverse's take on machine learning is finally here. Tidymodels forms the basis of tidy machine learning, and this post provides a whirlwind tour to get you started.
There’s a new modeling pipeline in town: tidymodels. Over the past few years, tidymodels has been gradually emerging as the tidyverse’s machine learning toolkit.
Why tidymodels? Well, it turns out that R has a consistency problem. Since everything was made by different people and using different principles, everything has a slightly different interface, and trying to keep everything in line can be frustrating. Several years ago, Max Kuhn (formerly at Pfeizer, now at RStudio) developed the caret R package (see my caret tutorial) aimed at creating a uniform interface for the massive variety of machine learning models that exist in R. Caret was great in a lot of ways, but also limited in others. In my own use, I found it to be quite slow whenever I tried to use on problems of any kind of modest size.
That said, caret was a great starting point, so RStudio hired Max Kuhn to work on a tidy version of caret, and he and many other people have developed what has become tidymodels. Tidymodels has been in development for a few years, with snippets of it being released as they were developed (see my post on the recipes package). I’ve been holding off writing a post about tidymodels until it seemed as though the different pieces fit together sufficiently for it to all feel cohesive. I feel like they’re finally there - which means it is time for me to learn it! While caret isn’t going anywhere (you can continue to use caret, and your existing caret code isn’t going to stop working), tidymodels will eventually make it redundant.
The main resources I used to learn tidymodels were Alison Hill’s slides from Introduction to Machine Learning with the Tidyverse, which contains all the slides for the course she prepared with Garrett Grolemund for RStudio::conf(2020), and Edgar Ruiz’s Gentle introduction to tidymodels on the RStudio website.
Note that throughout this post I’ll be assuming basic tidyverse knowledge, primarily of dplyr (e.g. piping %>%
and function such as mutate()
). Fortunately, for all you purrr-phobes out there, purrr is not required. If you’d like to brush up on your tidyverse skills, check out my Introduction to the Tidyverse posts. If you’d like to learn purrr (purrr is very handy for working with tidymodels but is no longer a requirement), check out my purrr post.
What is tidymodels
Much like the tidyverse consists of many core packages, such as ggplot2 and dplyr, tidymodels also consists of several core packages, including
rsample
: for sample splitting (e.g. train/test or cross-validation)recipes
: for pre-processingparsnip
: for specifying the modelyardstick
: for evaluating the model
Similarly to how you can load the entire tidyverse suite of packages by typing library(tidyverse)
, you can load the entire tidymodels suite of packages by typing library(tidymodels)
.
We will also be using the tune
package (for parameter tuning procedure) and the workflows
package (for putting everything together) that I had thought were a part of CRAN’s tidymodels package bundle, but apparently they aren’t. These will need to be loaded separately for now.
Unlike in my tidyverse post, I won’t base this post around the packages themselves, but I will mention the packages in passing.
Getting set up
First we need to load some libraries: tidymodels
and tidyverse
.
# load the relevant tidymodels libraries
library(tidymodels)
library(tidyverse)
library(workflows)
library(tune)
If you don’t already have the tidymodels library (or any of the other libraries) installed, then you’ll need to install it (once only) using install.packages("tidymodels")
.
We will use the Pima Indian Women’s diabetes dataset which contains information on 768 Pima Indian women’s diabetes status, as well as many predictive features such as the number of pregnancies (pregnant), plasma glucose concentration (glucose), diastolic blood pressure (pressure), triceps skin fold thickness (triceps), 2-hour serum insulin (insulin), BMI (mass), diabetes pedigree function (pedigree), and their age (age). In case you were wondering, the Pima Indians are a group of Native Americans living in an area consisting of what is now central and southern Arizona. The short name, “Pima” is believed to have come from a phrase meaning “I don’t know,” which they used repeatedly in their initial meetings with Spanish colonists. Thanks Wikipedia!
# load the Pima Indians dataset from the mlbench dataset
library(mlbench)
data(PimaIndiansDiabetes)
# rename dataset to have shorter name because lazy
diabetes_orig <- PimaIndiansDiabetes
diabetes_orig
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 0 0 25.6 0.201 30 neg
## 7 3 78 50 32 88 31.0 0.248 26 pos
## 8 10 115 0 0 0 35.3 0.134 29 neg
## 9 2 197 70 45 543 30.5 0.158 53 pos
## 10 8 125 96 0 0 0.0 0.232 54 pos
## 11 4 110 92 0 0 37.6 0.191 30 neg
## 12 10 168 74 0 0 38.0 0.537 34 pos
## 13 10 139 80 0 0 27.1 1.441 57 neg
## 14 1 189 60 23 846 30.1 0.398 59 pos
## 15 5 166 72 19 175 25.8 0.587 51 pos
## 16 7 100 0 0 0 30.0 0.484 32 pos
## 17 0 118 84 47 230 45.8 0.551 31 pos
## 18 7 107 74 0 0 29.6 0.254 31 pos
## 19 1 103 30 38 83 43.3 0.183 33 neg
## 20 1 115 70 30 96 34.6 0.529 32 pos
## 21 3 126 88 41 235 39.3 0.704 27 neg
## 22 8 99 84 0 0 35.4 0.388 50 neg
## 23 7 196 90 0 0 39.8 0.451 41 pos
## 24 9 119 80 35 0 29.0 0.263 29 pos
## 25 11 143 94 33 146 36.6 0.254 51 pos
## 26 10 125 70 26 115 31.1 0.205 41 pos
## 27 7 147 76 0 0 39.4 0.257 43 pos
## 28 1 97 66 15 140 23.2 0.487 22 neg
## 29 13 145 82 19 110 22.2 0.245 57 neg
## 30 5 117 92 0 0 34.1 0.337 38 neg
## 31 5 109 75 26 0 36.0 0.546 60 neg
## 32 3 158 76 36 245 31.6 0.851 28 pos
## 33 3 88 58 11 54 24.8 0.267 22 neg
## 34 6 92 92 0 0 19.9 0.188 28 neg
## 35 10 122 78 31 0 27.6 0.512 45 neg
## 36 4 103 60 33 192 24.0 0.966 33 neg
## 37 11 138 76 0 0 33.2 0.420 35 neg
## 38 9 102 76 37 0 32.9 0.665 46 pos
## 39 2 90 68 42 0 38.2 0.503 27 pos
## 40 4 111 72 47 207 37.1 1.390 56 pos
## 41 3 180 64 25 70 34.0 0.271 26 neg
## 42 7 133 84 0 0 40.2 0.696 37 neg
## 43 7 106 92 18 0 22.7 0.235 48 neg
## 44 9 171 110 24 240 45.4 0.721 54 pos
## 45 7 159 64 0 0 27.4 0.294 40 neg
## 46 0 180 66 39 0 42.0 1.893 25 pos
## 47 1 146 56 0 0 29.7 0.564 29 neg
## 48 2 71 70 27 0 28.0 0.586 22 neg
## 49 7 103 66 32 0 39.1 0.344 31 pos
## 50 7 105 0 0 0 0.0 0.305 24 neg
## 51 1 103 80 11 82 19.4 0.491 22 neg
## 52 1 101 50 15 36 24.2 0.526 26 neg
## 53 5 88 66 21 23 24.4 0.342 30 neg
## 54 8 176 90 34 300 33.7 0.467 58 pos
## 55 7 150 66 42 342 34.7 0.718 42 neg
## 56 1 73 50 10 0 23.0 0.248 21 neg
## 57 7 187 68 39 304 37.7 0.254 41 pos
## 58 0 100 88 60 110 46.8 0.962 31 neg
## 59 0 146 82 0 0 40.5 1.781 44 neg
## 60 0 105 64 41 142 41.5 0.173 22 neg
## 61 2 84 0 0 0 0.0 0.304 21 neg
## 62 8 133 72 0 0 32.9 0.270 39 pos
## 63 5 44 62 0 0 25.0 0.587 36 neg
## 64 2 141 58 34 128 25.4 0.699 24 neg
## 65 7 114 66 0 0 32.8 0.258 42 pos
## 66 5 99 74 27 0 29.0 0.203 32 neg
## 67 0 109 88 30 0 32.5 0.855 38 pos
## 68 2 109 92 0 0 42.7 0.845 54 neg
## 69 1 95 66 13 38 19.6 0.334 25 neg
## 70 4 146 85 27 100 28.9 0.189 27 neg
## 71 2 100 66 20 90 32.9 0.867 28 pos
## 72 5 139 64 35 140 28.6 0.411 26 neg
## 73 13 126 90 0 0 43.4 0.583 42 pos
## 74 4 129 86 20 270 35.1 0.231 23 neg
## 75 1 79 75 30 0 32.0 0.396 22 neg
## 76 1 0 48 20 0 24.7 0.140 22 neg
## 77 7 62 78 0 0 32.6 0.391 41 neg
## 78 5 95 72 33 0 37.7 0.370 27 neg
## 79 0 131 0 0 0 43.2 0.270 26 pos
## 80 2 112 66 22 0 25.0 0.307 24 neg
## 81 3 113 44 13 0 22.4 0.140 22 neg
## 82 2 74 0 0 0 0.0 0.102 22 neg
## 83 7 83 78 26 71 29.3 0.767 36 neg
## 84 0 101 65 28 0 24.6 0.237 22 neg
## 85 5 137 108 0 0 48.8 0.227 37 pos
## 86 2 110 74 29 125 32.4 0.698 27 neg
## 87 13 106 72 54 0 36.6 0.178 45 neg
## 88 2 100 68 25 71 38.5 0.324 26 neg
## 89 15 136 70 32 110 37.1 0.153 43 pos
## 90 1 107 68 19 0 26.5 0.165 24 neg
## 91 1 80 55 0 0 19.1 0.258 21 neg
## 92 4 123 80 15 176 32.0 0.443 34 neg
## 93 7 81 78 40 48 46.7 0.261 42 neg
## 94 4 134 72 0 0 23.8 0.277 60 pos
## 95 2 142 82 18 64 24.7 0.761 21 neg
## 96 6 144 72 27 228 33.9 0.255 40 neg
## 97 2 92 62 28 0 31.6 0.130 24 neg
## 98 1 71 48 18 76 20.4 0.323 22 neg
## 99 6 93 50 30 64 28.7 0.356 23 neg
## 100 1 122 90 51 220 49.7 0.325 31 pos
## 101 1 163 72 0 0 39.0 1.222 33 pos
## 102 1 151 60 0 0 26.1 0.179 22 neg
## 103 0 125 96 0 0 22.5 0.262 21 neg
## 104 1 81 72 18 40 26.6 0.283 24 neg
## 105 2 85 65 0 0 39.6 0.930 27 neg
## 106 1 126 56 29 152 28.7 0.801 21 neg
## 107 1 96 122 0 0 22.4 0.207 27 neg
## 108 4 144 58 28 140 29.5 0.287 37 neg
## 109 3 83 58 31 18 34.3 0.336 25 neg
## 110 0 95 85 25 36 37.4 0.247 24 pos
## 111 3 171 72 33 135 33.3 0.199 24 pos
## 112 8 155 62 26 495 34.0 0.543 46 pos
## 113 1 89 76 34 37 31.2 0.192 23 neg
## 114 4 76 62 0 0 34.0 0.391 25 neg
## 115 7 160 54 32 175 30.5 0.588 39 pos
## 116 4 146 92 0 0 31.2 0.539 61 pos
## 117 5 124 74 0 0 34.0 0.220 38 pos
## 118 5 78 48 0 0 33.7 0.654 25 neg
## 119 4 97 60 23 0 28.2 0.443 22 neg
## 120 4 99 76 15 51 23.2 0.223 21 neg
## 121 0 162 76 56 100 53.2 0.759 25 pos
## 122 6 111 64 39 0 34.2 0.260 24 neg
## 123 2 107 74 30 100 33.6 0.404 23 neg
## 124 5 132 80 0 0 26.8 0.186 69 neg
## 125 0 113 76 0 0 33.3 0.278 23 pos
## 126 1 88 30 42 99 55.0 0.496 26 pos
## 127 3 120 70 30 135 42.9 0.452 30 neg
## 128 1 118 58 36 94 33.3 0.261 23 neg
## 129 1 117 88 24 145 34.5 0.403 40 pos
## 130 0 105 84 0 0 27.9 0.741 62 pos
## 131 4 173 70 14 168 29.7 0.361 33 pos
## 132 9 122 56 0 0 33.3 1.114 33 pos
## 133 3 170 64 37 225 34.5 0.356 30 pos
## 134 8 84 74 31 0 38.3 0.457 39 neg
## 135 2 96 68 13 49 21.1 0.647 26 neg
## 136 2 125 60 20 140 33.8 0.088 31 neg
## 137 0 100 70 26 50 30.8 0.597 21 neg
## 138 0 93 60 25 92 28.7 0.532 22 neg
## 139 0 129 80 0 0 31.2 0.703 29 neg
## 140 5 105 72 29 325 36.9 0.159 28 neg
## 141 3 128 78 0 0 21.1 0.268 55 neg
## 142 5 106 82 30 0 39.5 0.286 38 neg
## 143 2 108 52 26 63 32.5 0.318 22 neg
## 144 10 108 66 0 0 32.4 0.272 42 pos
## 145 4 154 62 31 284 32.8 0.237 23 neg
## 146 0 102 75 23 0 0.0 0.572 21 neg
## 147 9 57 80 37 0 32.8 0.096 41 neg
## 148 2 106 64 35 119 30.5 1.400 34 neg
## 149 5 147 78 0 0 33.7 0.218 65 neg
## 150 2 90 70 17 0 27.3 0.085 22 neg
## 151 1 136 74 50 204 37.4 0.399 24 neg
## 152 4 114 65 0 0 21.9 0.432 37 neg
## 153 9 156 86 28 155 34.3 1.189 42 pos
## 154 1 153 82 42 485 40.6 0.687 23 neg
## 155 8 188 78 0 0 47.9 0.137 43 pos
## 156 7 152 88 44 0 50.0 0.337 36 pos
## 157 2 99 52 15 94 24.6 0.637 21 neg
## 158 1 109 56 21 135 25.2 0.833 23 neg
## 159 2 88 74 19 53 29.0 0.229 22 neg
## 160 17 163 72 41 114 40.9 0.817 47 pos
## 161 4 151 90 38 0 29.7 0.294 36 neg
## 162 7 102 74 40 105 37.2 0.204 45 neg
## 163 0 114 80 34 285 44.2 0.167 27 neg
## 164 2 100 64 23 0 29.7 0.368 21 neg
## 165 0 131 88 0 0 31.6 0.743 32 pos
## 166 6 104 74 18 156 29.9 0.722 41 pos
## 167 3 148 66 25 0 32.5 0.256 22 neg
## 168 4 120 68 0 0 29.6 0.709 34 neg
## 169 4 110 66 0 0 31.9 0.471 29 neg
## 170 3 111 90 12 78 28.4 0.495 29 neg
## 171 6 102 82 0 0 30.8 0.180 36 pos
## 172 6 134 70 23 130 35.4 0.542 29 pos
## 173 2 87 0 23 0 28.9 0.773 25 neg
## 174 1 79 60 42 48 43.5 0.678 23 neg
## 175 2 75 64 24 55 29.7 0.370 33 neg
## 176 8 179 72 42 130 32.7 0.719 36 pos
## 177 6 85 78 0 0 31.2 0.382 42 neg
## 178 0 129 110 46 130 67.1 0.319 26 pos
## 179 5 143 78 0 0 45.0 0.190 47 neg
## 180 5 130 82 0 0 39.1 0.956 37 pos
## 181 6 87 80 0 0 23.2 0.084 32 neg
## 182 0 119 64 18 92 34.9 0.725 23 neg
## 183 1 0 74 20 23 27.7 0.299 21 neg
## 184 5 73 60 0 0 26.8 0.268 27 neg
## 185 4 141 74 0 0 27.6 0.244 40 neg
## 186 7 194 68 28 0 35.9 0.745 41 pos
## 187 8 181 68 36 495 30.1 0.615 60 pos
## 188 1 128 98 41 58 32.0 1.321 33 pos
## 189 8 109 76 39 114 27.9 0.640 31 pos
## 190 5 139 80 35 160 31.6 0.361 25 pos
## 191 3 111 62 0 0 22.6 0.142 21 neg
## 192 9 123 70 44 94 33.1 0.374 40 neg
## 193 7 159 66 0 0 30.4 0.383 36 pos
## 194 11 135 0 0 0 52.3 0.578 40 pos
## 195 8 85 55 20 0 24.4 0.136 42 neg
## 196 5 158 84 41 210 39.4 0.395 29 pos
## 197 1 105 58 0 0 24.3 0.187 21 neg
## 198 3 107 62 13 48 22.9 0.678 23 pos
## 199 4 109 64 44 99 34.8 0.905 26 pos
## 200 4 148 60 27 318 30.9 0.150 29 pos
## 201 0 113 80 16 0 31.0 0.874 21 neg
## 202 1 138 82 0 0 40.1 0.236 28 neg
## 203 0 108 68 20 0 27.3 0.787 32 neg
## 204 2 99 70 16 44 20.4 0.235 27 neg
## 205 6 103 72 32 190 37.7 0.324 55 neg
## 206 5 111 72 28 0 23.9 0.407 27 neg
## 207 8 196 76 29 280 37.5 0.605 57 pos
## 208 5 162 104 0 0 37.7 0.151 52 pos
## 209 1 96 64 27 87 33.2 0.289 21 neg
## 210 7 184 84 33 0 35.5 0.355 41 pos
## 211 2 81 60 22 0 27.7 0.290 25 neg
## 212 0 147 85 54 0 42.8 0.375 24 neg
## 213 7 179 95 31 0 34.2 0.164 60 neg
## 214 0 140 65 26 130 42.6 0.431 24 pos
## 215 9 112 82 32 175 34.2 0.260 36 pos
## 216 12 151 70 40 271 41.8 0.742 38 pos
## 217 5 109 62 41 129 35.8 0.514 25 pos
## 218 6 125 68 30 120 30.0 0.464 32 neg
## 219 5 85 74 22 0 29.0 1.224 32 pos
## 220 5 112 66 0 0 37.8 0.261 41 pos
## 221 0 177 60 29 478 34.6 1.072 21 pos
## 222 2 158 90 0 0 31.6 0.805 66 pos
## 223 7 119 0 0 0 25.2 0.209 37 neg
## 224 7 142 60 33 190 28.8 0.687 61 neg
## 225 1 100 66 15 56 23.6 0.666 26 neg
## 226 1 87 78 27 32 34.6 0.101 22 neg
## 227 0 101 76 0 0 35.7 0.198 26 neg
## 228 3 162 52 38 0 37.2 0.652 24 pos
## 229 4 197 70 39 744 36.7 2.329 31 neg
## 230 0 117 80 31 53 45.2 0.089 24 neg
## 231 4 142 86 0 0 44.0 0.645 22 pos
## 232 6 134 80 37 370 46.2 0.238 46 pos
## 233 1 79 80 25 37 25.4 0.583 22 neg
## 234 4 122 68 0 0 35.0 0.394 29 neg
## 235 3 74 68 28 45 29.7 0.293 23 neg
## 236 4 171 72 0 0 43.6 0.479 26 pos
## 237 7 181 84 21 192 35.9 0.586 51 pos
## 238 0 179 90 27 0 44.1 0.686 23 pos
## 239 9 164 84 21 0 30.8 0.831 32 pos
## 240 0 104 76 0 0 18.4 0.582 27 neg
## 241 1 91 64 24 0 29.2 0.192 21 neg
## 242 4 91 70 32 88 33.1 0.446 22 neg
## 243 3 139 54 0 0 25.6 0.402 22 pos
## 244 6 119 50 22 176 27.1 1.318 33 pos
## 245 2 146 76 35 194 38.2 0.329 29 neg
## 246 9 184 85 15 0 30.0 1.213 49 pos
## 247 10 122 68 0 0 31.2 0.258 41 neg
## 248 0 165 90 33 680 52.3 0.427 23 neg
## 249 9 124 70 33 402 35.4 0.282 34 neg
## 250 1 111 86 19 0 30.1 0.143 23 neg
## 251 9 106 52 0 0 31.2 0.380 42 neg
## 252 2 129 84 0 0 28.0 0.284 27 neg
## 253 2 90 80 14 55 24.4 0.249 24 neg
## 254 0 86 68 32 0 35.8 0.238 25 neg
## 255 12 92 62 7 258 27.6 0.926 44 pos
## 256 1 113 64 35 0 33.6 0.543 21 pos
## 257 3 111 56 39 0 30.1 0.557 30 neg
## 258 2 114 68 22 0 28.7 0.092 25 neg
## 259 1 193 50 16 375 25.9 0.655 24 neg
## 260 11 155 76 28 150 33.3 1.353 51 pos
## 261 3 191 68 15 130 30.9 0.299 34 neg
## 262 3 141 0 0 0 30.0 0.761 27 pos
## 263 4 95 70 32 0 32.1 0.612 24 neg
## 264 3 142 80 15 0 32.4 0.200 63 neg
## 265 4 123 62 0 0 32.0 0.226 35 pos
## 266 5 96 74 18 67 33.6 0.997 43 neg
## 267 0 138 0 0 0 36.3 0.933 25 pos
## 268 2 128 64 42 0 40.0 1.101 24 neg
## 269 0 102 52 0 0 25.1 0.078 21 neg
## 270 2 146 0 0 0 27.5 0.240 28 pos
## 271 10 101 86 37 0 45.6 1.136 38 pos
## 272 2 108 62 32 56 25.2 0.128 21 neg
## 273 3 122 78 0 0 23.0 0.254 40 neg
## 274 1 71 78 50 45 33.2 0.422 21 neg
## 275 13 106 70 0 0 34.2 0.251 52 neg
## 276 2 100 70 52 57 40.5 0.677 25 neg
## 277 7 106 60 24 0 26.5 0.296 29 pos
## 278 0 104 64 23 116 27.8 0.454 23 neg
## 279 5 114 74 0 0 24.9 0.744 57 neg
## 280 2 108 62 10 278 25.3 0.881 22 neg
## 281 0 146 70 0 0 37.9 0.334 28 pos
## 282 10 129 76 28 122 35.9 0.280 39 neg
## 283 7 133 88 15 155 32.4 0.262 37 neg
## 284 7 161 86 0 0 30.4 0.165 47 pos
## 285 2 108 80 0 0 27.0 0.259 52 pos
## 286 7 136 74 26 135 26.0 0.647 51 neg
## 287 5 155 84 44 545 38.7 0.619 34 neg
## 288 1 119 86 39 220 45.6 0.808 29 pos
## 289 4 96 56 17 49 20.8 0.340 26 neg
## 290 5 108 72 43 75 36.1 0.263 33 neg
## 291 0 78 88 29 40 36.9 0.434 21 neg
## 292 0 107 62 30 74 36.6 0.757 25 pos
## 293 2 128 78 37 182 43.3 1.224 31 pos
## 294 1 128 48 45 194 40.5 0.613 24 pos
## 295 0 161 50 0 0 21.9 0.254 65 neg
## 296 6 151 62 31 120 35.5 0.692 28 neg
## 297 2 146 70 38 360 28.0 0.337 29 pos
## 298 0 126 84 29 215 30.7 0.520 24 neg
## 299 14 100 78 25 184 36.6 0.412 46 pos
## 300 8 112 72 0 0 23.6 0.840 58 neg
## 301 0 167 0 0 0 32.3 0.839 30 pos
## 302 2 144 58 33 135 31.6 0.422 25 pos
## 303 5 77 82 41 42 35.8 0.156 35 neg
## 304 5 115 98 0 0 52.9 0.209 28 pos
## 305 3 150 76 0 0 21.0 0.207 37 neg
## 306 2 120 76 37 105 39.7 0.215 29 neg
## 307 10 161 68 23 132 25.5 0.326 47 pos
## 308 0 137 68 14 148 24.8 0.143 21 neg
## 309 0 128 68 19 180 30.5 1.391 25 pos
## 310 2 124 68 28 205 32.9 0.875 30 pos
## 311 6 80 66 30 0 26.2 0.313 41 neg
## 312 0 106 70 37 148 39.4 0.605 22 neg
## 313 2 155 74 17 96 26.6 0.433 27 pos
## 314 3 113 50 10 85 29.5 0.626 25 neg
## 315 7 109 80 31 0 35.9 1.127 43 pos
## 316 2 112 68 22 94 34.1 0.315 26 neg
## 317 3 99 80 11 64 19.3 0.284 30 neg
## 318 3 182 74 0 0 30.5 0.345 29 pos
## 319 3 115 66 39 140 38.1 0.150 28 neg
## 320 6 194 78 0 0 23.5 0.129 59 pos
## 321 4 129 60 12 231 27.5 0.527 31 neg
## 322 3 112 74 30 0 31.6 0.197 25 pos
## 323 0 124 70 20 0 27.4 0.254 36 pos
## 324 13 152 90 33 29 26.8 0.731 43 pos
## 325 2 112 75 32 0 35.7 0.148 21 neg
## 326 1 157 72 21 168 25.6 0.123 24 neg
## 327 1 122 64 32 156 35.1 0.692 30 pos
## 328 10 179 70 0 0 35.1 0.200 37 neg
## 329 2 102 86 36 120 45.5 0.127 23 pos
## 330 6 105 70 32 68 30.8 0.122 37 neg
## 331 8 118 72 19 0 23.1 1.476 46 neg
## 332 2 87 58 16 52 32.7 0.166 25 neg
## 333 1 180 0 0 0 43.3 0.282 41 pos
## 334 12 106 80 0 0 23.6 0.137 44 neg
## 335 1 95 60 18 58 23.9 0.260 22 neg
## 336 0 165 76 43 255 47.9 0.259 26 neg
## 337 0 117 0 0 0 33.8 0.932 44 neg
## 338 5 115 76 0 0 31.2 0.343 44 pos
## 339 9 152 78 34 171 34.2 0.893 33 pos
## 340 7 178 84 0 0 39.9 0.331 41 pos
## 341 1 130 70 13 105 25.9 0.472 22 neg
## 342 1 95 74 21 73 25.9 0.673 36 neg
## 343 1 0 68 35 0 32.0 0.389 22 neg
## 344 5 122 86 0 0 34.7 0.290 33 neg
## 345 8 95 72 0 0 36.8 0.485 57 neg
## 346 8 126 88 36 108 38.5 0.349 49 neg
## 347 1 139 46 19 83 28.7 0.654 22 neg
## 348 3 116 0 0 0 23.5 0.187 23 neg
## 349 3 99 62 19 74 21.8 0.279 26 neg
## 350 5 0 80 32 0 41.0 0.346 37 pos
## 351 4 92 80 0 0 42.2 0.237 29 neg
## 352 4 137 84 0 0 31.2 0.252 30 neg
## 353 3 61 82 28 0 34.4 0.243 46 neg
## 354 1 90 62 12 43 27.2 0.580 24 neg
## 355 3 90 78 0 0 42.7 0.559 21 neg
## 356 9 165 88 0 0 30.4 0.302 49 pos
## 357 1 125 50 40 167 33.3 0.962 28 pos
## 358 13 129 0 30 0 39.9 0.569 44 pos
## 359 12 88 74 40 54 35.3 0.378 48 neg
## 360 1 196 76 36 249 36.5 0.875 29 pos
## 361 5 189 64 33 325 31.2 0.583 29 pos
## 362 5 158 70 0 0 29.8 0.207 63 neg
## 363 5 103 108 37 0 39.2 0.305 65 neg
## 364 4 146 78 0 0 38.5 0.520 67 pos
## 365 4 147 74 25 293 34.9 0.385 30 neg
## 366 5 99 54 28 83 34.0 0.499 30 neg
## 367 6 124 72 0 0 27.6 0.368 29 pos
## 368 0 101 64 17 0 21.0 0.252 21 neg
## 369 3 81 86 16 66 27.5 0.306 22 neg
## 370 1 133 102 28 140 32.8 0.234 45 pos
## 371 3 173 82 48 465 38.4 2.137 25 pos
## 372 0 118 64 23 89 0.0 1.731 21 neg
## 373 0 84 64 22 66 35.8 0.545 21 neg
## 374 2 105 58 40 94 34.9 0.225 25 neg
## 375 2 122 52 43 158 36.2 0.816 28 neg
## 376 12 140 82 43 325 39.2 0.528 58 pos
## 377 0 98 82 15 84 25.2 0.299 22 neg
## 378 1 87 60 37 75 37.2 0.509 22 neg
## 379 4 156 75 0 0 48.3 0.238 32 pos
## 380 0 93 100 39 72 43.4 1.021 35 neg
## 381 1 107 72 30 82 30.8 0.821 24 neg
## 382 0 105 68 22 0 20.0 0.236 22 neg
## 383 1 109 60 8 182 25.4 0.947 21 neg
## 384 1 90 62 18 59 25.1 1.268 25 neg
## 385 1 125 70 24 110 24.3 0.221 25 neg
## 386 1 119 54 13 50 22.3 0.205 24 neg
## 387 5 116 74 29 0 32.3 0.660 35 pos
## 388 8 105 100 36 0 43.3 0.239 45 pos
## 389 5 144 82 26 285 32.0 0.452 58 pos
## 390 3 100 68 23 81 31.6 0.949 28 neg
## 391 1 100 66 29 196 32.0 0.444 42 neg
## 392 5 166 76 0 0 45.7 0.340 27 pos
## 393 1 131 64 14 415 23.7 0.389 21 neg
## 394 4 116 72 12 87 22.1 0.463 37 neg
## 395 4 158 78 0 0 32.9 0.803 31 pos
## 396 2 127 58 24 275 27.7 1.600 25 neg
## 397 3 96 56 34 115 24.7 0.944 39 neg
## 398 0 131 66 40 0 34.3 0.196 22 pos
## 399 3 82 70 0 0 21.1 0.389 25 neg
## 400 3 193 70 31 0 34.9 0.241 25 pos
## 401 4 95 64 0 0 32.0 0.161 31 pos
## 402 6 137 61 0 0 24.2 0.151 55 neg
## 403 5 136 84 41 88 35.0 0.286 35 pos
## 404 9 72 78 25 0 31.6 0.280 38 neg
## 405 5 168 64 0 0 32.9 0.135 41 pos
## 406 2 123 48 32 165 42.1 0.520 26 neg
## 407 4 115 72 0 0 28.9 0.376 46 pos
## 408 0 101 62 0 0 21.9 0.336 25 neg
## 409 8 197 74 0 0 25.9 1.191 39 pos
## 410 1 172 68 49 579 42.4 0.702 28 pos
## 411 6 102 90 39 0 35.7 0.674 28 neg
## 412 1 112 72 30 176 34.4 0.528 25 neg
## 413 1 143 84 23 310 42.4 1.076 22 neg
## 414 1 143 74 22 61 26.2 0.256 21 neg
## 415 0 138 60 35 167 34.6 0.534 21 pos
## 416 3 173 84 33 474 35.7 0.258 22 pos
## 417 1 97 68 21 0 27.2 1.095 22 neg
## 418 4 144 82 32 0 38.5 0.554 37 pos
## 419 1 83 68 0 0 18.2 0.624 27 neg
## 420 3 129 64 29 115 26.4 0.219 28 pos
## 421 1 119 88 41 170 45.3 0.507 26 neg
## 422 2 94 68 18 76 26.0 0.561 21 neg
## 423 0 102 64 46 78 40.6 0.496 21 neg
## 424 2 115 64 22 0 30.8 0.421 21 neg
## 425 8 151 78 32 210 42.9 0.516 36 pos
## 426 4 184 78 39 277 37.0 0.264 31 pos
## 427 0 94 0 0 0 0.0 0.256 25 neg
## 428 1 181 64 30 180 34.1 0.328 38 pos
## 429 0 135 94 46 145 40.6 0.284 26 neg
## 430 1 95 82 25 180 35.0 0.233 43 pos
## 431 2 99 0 0 0 22.2 0.108 23 neg
## 432 3 89 74 16 85 30.4 0.551 38 neg
## 433 1 80 74 11 60 30.0 0.527 22 neg
## 434 2 139 75 0 0 25.6 0.167 29 neg
## 435 1 90 68 8 0 24.5 1.138 36 neg
## 436 0 141 0 0 0 42.4 0.205 29 pos
## 437 12 140 85 33 0 37.4 0.244 41 neg
## 438 5 147 75 0 0 29.9 0.434 28 neg
## 439 1 97 70 15 0 18.2 0.147 21 neg
## 440 6 107 88 0 0 36.8 0.727 31 neg
## 441 0 189 104 25 0 34.3 0.435 41 pos
## 442 2 83 66 23 50 32.2 0.497 22 neg
## 443 4 117 64 27 120 33.2 0.230 24 neg
## 444 8 108 70 0 0 30.5 0.955 33 pos
## 445 4 117 62 12 0 29.7 0.380 30 pos
## 446 0 180 78 63 14 59.4 2.420 25 pos
## 447 1 100 72 12 70 25.3 0.658 28 neg
## 448 0 95 80 45 92 36.5 0.330 26 neg
## 449 0 104 64 37 64 33.6 0.510 22 pos
## 450 0 120 74 18 63 30.5 0.285 26 neg
## 451 1 82 64 13 95 21.2 0.415 23 neg
## 452 2 134 70 0 0 28.9 0.542 23 pos
## 453 0 91 68 32 210 39.9 0.381 25 neg
## 454 2 119 0 0 0 19.6 0.832 72 neg
## 455 2 100 54 28 105 37.8 0.498 24 neg
## 456 14 175 62 30 0 33.6 0.212 38 pos
## 457 1 135 54 0 0 26.7 0.687 62 neg
## 458 5 86 68 28 71 30.2 0.364 24 neg
## 459 10 148 84 48 237 37.6 1.001 51 pos
## 460 9 134 74 33 60 25.9 0.460 81 neg
## 461 9 120 72 22 56 20.8 0.733 48 neg
## 462 1 71 62 0 0 21.8 0.416 26 neg
## 463 8 74 70 40 49 35.3 0.705 39 neg
## 464 5 88 78 30 0 27.6 0.258 37 neg
## 465 10 115 98 0 0 24.0 1.022 34 neg
## 466 0 124 56 13 105 21.8 0.452 21 neg
## 467 0 74 52 10 36 27.8 0.269 22 neg
## 468 0 97 64 36 100 36.8 0.600 25 neg
## 469 8 120 0 0 0 30.0 0.183 38 pos
## 470 6 154 78 41 140 46.1 0.571 27 neg
## 471 1 144 82 40 0 41.3 0.607 28 neg
## 472 0 137 70 38 0 33.2 0.170 22 neg
## 473 0 119 66 27 0 38.8 0.259 22 neg
## 474 7 136 90 0 0 29.9 0.210 50 neg
## 475 4 114 64 0 0 28.9 0.126 24 neg
## 476 0 137 84 27 0 27.3 0.231 59 neg
## 477 2 105 80 45 191 33.7 0.711 29 pos
## 478 7 114 76 17 110 23.8 0.466 31 neg
## 479 8 126 74 38 75 25.9 0.162 39 neg
## 480 4 132 86 31 0 28.0 0.419 63 neg
## 481 3 158 70 30 328 35.5 0.344 35 pos
## 482 0 123 88 37 0 35.2 0.197 29 neg
## 483 4 85 58 22 49 27.8 0.306 28 neg
## 484 0 84 82 31 125 38.2 0.233 23 neg
## 485 0 145 0 0 0 44.2 0.630 31 pos
## 486 0 135 68 42 250 42.3 0.365 24 pos
## 487 1 139 62 41 480 40.7 0.536 21 neg
## 488 0 173 78 32 265 46.5 1.159 58 neg
## 489 4 99 72 17 0 25.6 0.294 28 neg
## 490 8 194 80 0 0 26.1 0.551 67 neg
## 491 2 83 65 28 66 36.8 0.629 24 neg
## 492 2 89 90 30 0 33.5 0.292 42 neg
## 493 4 99 68 38 0 32.8 0.145 33 neg
## 494 4 125 70 18 122 28.9 1.144 45 pos
## 495 3 80 0 0 0 0.0 0.174 22 neg
## 496 6 166 74 0 0 26.6 0.304 66 neg
## 497 5 110 68 0 0 26.0 0.292 30 neg
## 498 2 81 72 15 76 30.1 0.547 25 neg
## 499 7 195 70 33 145 25.1 0.163 55 pos
## 500 6 154 74 32 193 29.3 0.839 39 neg
## 501 2 117 90 19 71 25.2 0.313 21 neg
## 502 3 84 72 32 0 37.2 0.267 28 neg
## 503 6 0 68 41 0 39.0 0.727 41 pos
## 504 7 94 64 25 79 33.3 0.738 41 neg
## 505 3 96 78 39 0 37.3 0.238 40 neg
## 506 10 75 82 0 0 33.3 0.263 38 neg
## 507 0 180 90 26 90 36.5 0.314 35 pos
## 508 1 130 60 23 170 28.6 0.692 21 neg
## 509 2 84 50 23 76 30.4 0.968 21 neg
## 510 8 120 78 0 0 25.0 0.409 64 neg
## 511 12 84 72 31 0 29.7 0.297 46 pos
## 512 0 139 62 17 210 22.1 0.207 21 neg
## 513 9 91 68 0 0 24.2 0.200 58 neg
## 514 2 91 62 0 0 27.3 0.525 22 neg
## 515 3 99 54 19 86 25.6 0.154 24 neg
## 516 3 163 70 18 105 31.6 0.268 28 pos
## 517 9 145 88 34 165 30.3 0.771 53 pos
## 518 7 125 86 0 0 37.6 0.304 51 neg
## 519 13 76 60 0 0 32.8 0.180 41 neg
## 520 6 129 90 7 326 19.6 0.582 60 neg
## 521 2 68 70 32 66 25.0 0.187 25 neg
## 522 3 124 80 33 130 33.2 0.305 26 neg
## 523 6 114 0 0 0 0.0 0.189 26 neg
## 524 9 130 70 0 0 34.2 0.652 45 pos
## 525 3 125 58 0 0 31.6 0.151 24 neg
## 526 3 87 60 18 0 21.8 0.444 21 neg
## 527 1 97 64 19 82 18.2 0.299 21 neg
## 528 3 116 74 15 105 26.3 0.107 24 neg
## 529 0 117 66 31 188 30.8 0.493 22 neg
## 530 0 111 65 0 0 24.6 0.660 31 neg
## 531 2 122 60 18 106 29.8 0.717 22 neg
## 532 0 107 76 0 0 45.3 0.686 24 neg
## 533 1 86 66 52 65 41.3 0.917 29 neg
## 534 6 91 0 0 0 29.8 0.501 31 neg
## 535 1 77 56 30 56 33.3 1.251 24 neg
## 536 4 132 0 0 0 32.9 0.302 23 pos
## 537 0 105 90 0 0 29.6 0.197 46 neg
## 538 0 57 60 0 0 21.7 0.735 67 neg
## 539 0 127 80 37 210 36.3 0.804 23 neg
## 540 3 129 92 49 155 36.4 0.968 32 pos
## 541 8 100 74 40 215 39.4 0.661 43 pos
## 542 3 128 72 25 190 32.4 0.549 27 pos
## 543 10 90 85 32 0 34.9 0.825 56 pos
## 544 4 84 90 23 56 39.5 0.159 25 neg
## 545 1 88 78 29 76 32.0 0.365 29 neg
## 546 8 186 90 35 225 34.5 0.423 37 pos
## 547 5 187 76 27 207 43.6 1.034 53 pos
## 548 4 131 68 21 166 33.1 0.160 28 neg
## 549 1 164 82 43 67 32.8 0.341 50 neg
## 550 4 189 110 31 0 28.5 0.680 37 neg
## 551 1 116 70 28 0 27.4 0.204 21 neg
## 552 3 84 68 30 106 31.9 0.591 25 neg
## 553 6 114 88 0 0 27.8 0.247 66 neg
## 554 1 88 62 24 44 29.9 0.422 23 neg
## 555 1 84 64 23 115 36.9 0.471 28 neg
## 556 7 124 70 33 215 25.5 0.161 37 neg
## 557 1 97 70 40 0 38.1 0.218 30 neg
## 558 8 110 76 0 0 27.8 0.237 58 neg
## 559 11 103 68 40 0 46.2 0.126 42 neg
## 560 11 85 74 0 0 30.1 0.300 35 neg
## 561 6 125 76 0 0 33.8 0.121 54 pos
## 562 0 198 66 32 274 41.3 0.502 28 pos
## 563 1 87 68 34 77 37.6 0.401 24 neg
## 564 6 99 60 19 54 26.9 0.497 32 neg
## 565 0 91 80 0 0 32.4 0.601 27 neg
## 566 2 95 54 14 88 26.1 0.748 22 neg
## 567 1 99 72 30 18 38.6 0.412 21 neg
## 568 6 92 62 32 126 32.0 0.085 46 neg
## 569 4 154 72 29 126 31.3 0.338 37 neg
## 570 0 121 66 30 165 34.3 0.203 33 pos
## 571 3 78 70 0 0 32.5 0.270 39 neg
## 572 2 130 96 0 0 22.6 0.268 21 neg
## 573 3 111 58 31 44 29.5 0.430 22 neg
## 574 2 98 60 17 120 34.7 0.198 22 neg
## 575 1 143 86 30 330 30.1 0.892 23 neg
## 576 1 119 44 47 63 35.5 0.280 25 neg
## 577 6 108 44 20 130 24.0 0.813 35 neg
## 578 2 118 80 0 0 42.9 0.693 21 pos
## 579 10 133 68 0 0 27.0 0.245 36 neg
## 580 2 197 70 99 0 34.7 0.575 62 pos
## 581 0 151 90 46 0 42.1 0.371 21 pos
## 582 6 109 60 27 0 25.0 0.206 27 neg
## 583 12 121 78 17 0 26.5 0.259 62 neg
## 584 8 100 76 0 0 38.7 0.190 42 neg
## 585 8 124 76 24 600 28.7 0.687 52 pos
## 586 1 93 56 11 0 22.5 0.417 22 neg
## 587 8 143 66 0 0 34.9 0.129 41 pos
## 588 6 103 66 0 0 24.3 0.249 29 neg
## 589 3 176 86 27 156 33.3 1.154 52 pos
## 590 0 73 0 0 0 21.1 0.342 25 neg
## 591 11 111 84 40 0 46.8 0.925 45 pos
## 592 2 112 78 50 140 39.4 0.175 24 neg
## 593 3 132 80 0 0 34.4 0.402 44 pos
## 594 2 82 52 22 115 28.5 1.699 25 neg
## 595 6 123 72 45 230 33.6 0.733 34 neg
## 596 0 188 82 14 185 32.0 0.682 22 pos
## 597 0 67 76 0 0 45.3 0.194 46 neg
## 598 1 89 24 19 25 27.8 0.559 21 neg
## 599 1 173 74 0 0 36.8 0.088 38 pos
## 600 1 109 38 18 120 23.1 0.407 26 neg
## 601 1 108 88 19 0 27.1 0.400 24 neg
## 602 6 96 0 0 0 23.7 0.190 28 neg
## 603 1 124 74 36 0 27.8 0.100 30 neg
## 604 7 150 78 29 126 35.2 0.692 54 pos
## 605 4 183 0 0 0 28.4 0.212 36 pos
## 606 1 124 60 32 0 35.8 0.514 21 neg
## 607 1 181 78 42 293 40.0 1.258 22 pos
## 608 1 92 62 25 41 19.5 0.482 25 neg
## 609 0 152 82 39 272 41.5 0.270 27 neg
## 610 1 111 62 13 182 24.0 0.138 23 neg
## 611 3 106 54 21 158 30.9 0.292 24 neg
## 612 3 174 58 22 194 32.9 0.593 36 pos
## 613 7 168 88 42 321 38.2 0.787 40 pos
## 614 6 105 80 28 0 32.5 0.878 26 neg
## 615 11 138 74 26 144 36.1 0.557 50 pos
## 616 3 106 72 0 0 25.8 0.207 27 neg
## 617 6 117 96 0 0 28.7 0.157 30 neg
## 618 2 68 62 13 15 20.1 0.257 23 neg
## 619 9 112 82 24 0 28.2 1.282 50 pos
## 620 0 119 0 0 0 32.4 0.141 24 pos
## 621 2 112 86 42 160 38.4 0.246 28 neg
## 622 2 92 76 20 0 24.2 1.698 28 neg
## 623 6 183 94 0 0 40.8 1.461 45 neg
## 624 0 94 70 27 115 43.5 0.347 21 neg
## 625 2 108 64 0 0 30.8 0.158 21 neg
## 626 4 90 88 47 54 37.7 0.362 29 neg
## 627 0 125 68 0 0 24.7 0.206 21 neg
## 628 0 132 78 0 0 32.4 0.393 21 neg
## 629 5 128 80 0 0 34.6 0.144 45 neg
## 630 4 94 65 22 0 24.7 0.148 21 neg
## 631 7 114 64 0 0 27.4 0.732 34 pos
## 632 0 102 78 40 90 34.5 0.238 24 neg
## 633 2 111 60 0 0 26.2 0.343 23 neg
## 634 1 128 82 17 183 27.5 0.115 22 neg
## 635 10 92 62 0 0 25.9 0.167 31 neg
## 636 13 104 72 0 0 31.2 0.465 38 pos
## 637 5 104 74 0 0 28.8 0.153 48 neg
## 638 2 94 76 18 66 31.6 0.649 23 neg
## 639 7 97 76 32 91 40.9 0.871 32 pos
## 640 1 100 74 12 46 19.5 0.149 28 neg
## 641 0 102 86 17 105 29.3 0.695 27 neg
## 642 4 128 70 0 0 34.3 0.303 24 neg
## 643 6 147 80 0 0 29.5 0.178 50 pos
## 644 4 90 0 0 0 28.0 0.610 31 neg
## 645 3 103 72 30 152 27.6 0.730 27 neg
## 646 2 157 74 35 440 39.4 0.134 30 neg
## 647 1 167 74 17 144 23.4 0.447 33 pos
## 648 0 179 50 36 159 37.8 0.455 22 pos
## 649 11 136 84 35 130 28.3 0.260 42 pos
## 650 0 107 60 25 0 26.4 0.133 23 neg
## 651 1 91 54 25 100 25.2 0.234 23 neg
## 652 1 117 60 23 106 33.8 0.466 27 neg
## 653 5 123 74 40 77 34.1 0.269 28 neg
## 654 2 120 54 0 0 26.8 0.455 27 neg
## 655 1 106 70 28 135 34.2 0.142 22 neg
## 656 2 155 52 27 540 38.7 0.240 25 pos
## 657 2 101 58 35 90 21.8 0.155 22 neg
## 658 1 120 80 48 200 38.9 1.162 41 neg
## 659 11 127 106 0 0 39.0 0.190 51 neg
## 660 3 80 82 31 70 34.2 1.292 27 pos
## 661 10 162 84 0 0 27.7 0.182 54 neg
## 662 1 199 76 43 0 42.9 1.394 22 pos
## 663 8 167 106 46 231 37.6 0.165 43 pos
## 664 9 145 80 46 130 37.9 0.637 40 pos
## 665 6 115 60 39 0 33.7 0.245 40 pos
## 666 1 112 80 45 132 34.8 0.217 24 neg
## 667 4 145 82 18 0 32.5 0.235 70 pos
## 668 10 111 70 27 0 27.5 0.141 40 pos
## 669 6 98 58 33 190 34.0 0.430 43 neg
## 670 9 154 78 30 100 30.9 0.164 45 neg
## 671 6 165 68 26 168 33.6 0.631 49 neg
## 672 1 99 58 10 0 25.4 0.551 21 neg
## 673 10 68 106 23 49 35.5 0.285 47 neg
## 674 3 123 100 35 240 57.3 0.880 22 neg
## 675 8 91 82 0 0 35.6 0.587 68 neg
## 676 6 195 70 0 0 30.9 0.328 31 pos
## 677 9 156 86 0 0 24.8 0.230 53 pos
## 678 0 93 60 0 0 35.3 0.263 25 neg
## 679 3 121 52 0 0 36.0 0.127 25 pos
## 680 2 101 58 17 265 24.2 0.614 23 neg
## 681 2 56 56 28 45 24.2 0.332 22 neg
## 682 0 162 76 36 0 49.6 0.364 26 pos
## 683 0 95 64 39 105 44.6 0.366 22 neg
## 684 4 125 80 0 0 32.3 0.536 27 pos
## 685 5 136 82 0 0 0.0 0.640 69 neg
## 686 2 129 74 26 205 33.2 0.591 25 neg
## 687 3 130 64 0 0 23.1 0.314 22 neg
## 688 1 107 50 19 0 28.3 0.181 29 neg
## 689 1 140 74 26 180 24.1 0.828 23 neg
## 690 1 144 82 46 180 46.1 0.335 46 pos
## 691 8 107 80 0 0 24.6 0.856 34 neg
## 692 13 158 114 0 0 42.3 0.257 44 pos
## 693 2 121 70 32 95 39.1 0.886 23 neg
## 694 7 129 68 49 125 38.5 0.439 43 pos
## 695 2 90 60 0 0 23.5 0.191 25 neg
## 696 7 142 90 24 480 30.4 0.128 43 pos
## 697 3 169 74 19 125 29.9 0.268 31 pos
## 698 0 99 0 0 0 25.0 0.253 22 neg
## 699 4 127 88 11 155 34.5 0.598 28 neg
## 700 4 118 70 0 0 44.5 0.904 26 neg
## 701 2 122 76 27 200 35.9 0.483 26 neg
## 702 6 125 78 31 0 27.6 0.565 49 pos
## 703 1 168 88 29 0 35.0 0.905 52 pos
## 704 2 129 0 0 0 38.5 0.304 41 neg
## 705 4 110 76 20 100 28.4 0.118 27 neg
## 706 6 80 80 36 0 39.8 0.177 28 neg
## 707 10 115 0 0 0 0.0 0.261 30 pos
## 708 2 127 46 21 335 34.4 0.176 22 neg
## 709 9 164 78 0 0 32.8 0.148 45 pos
## 710 2 93 64 32 160 38.0 0.674 23 pos
## 711 3 158 64 13 387 31.2 0.295 24 neg
## 712 5 126 78 27 22 29.6 0.439 40 neg
## 713 10 129 62 36 0 41.2 0.441 38 pos
## 714 0 134 58 20 291 26.4 0.352 21 neg
## 715 3 102 74 0 0 29.5 0.121 32 neg
## 716 7 187 50 33 392 33.9 0.826 34 pos
## 717 3 173 78 39 185 33.8 0.970 31 pos
## 718 10 94 72 18 0 23.1 0.595 56 neg
## 719 1 108 60 46 178 35.5 0.415 24 neg
## 720 5 97 76 27 0 35.6 0.378 52 pos
## 721 4 83 86 19 0 29.3 0.317 34 neg
## 722 1 114 66 36 200 38.1 0.289 21 neg
## 723 1 149 68 29 127 29.3 0.349 42 pos
## 724 5 117 86 30 105 39.1 0.251 42 neg
## 725 1 111 94 0 0 32.8 0.265 45 neg
## 726 4 112 78 40 0 39.4 0.236 38 neg
## 727 1 116 78 29 180 36.1 0.496 25 neg
## 728 0 141 84 26 0 32.4 0.433 22 neg
## 729 2 175 88 0 0 22.9 0.326 22 neg
## 730 2 92 52 0 0 30.1 0.141 22 neg
## 731 3 130 78 23 79 28.4 0.323 34 pos
## 732 8 120 86 0 0 28.4 0.259 22 pos
## 733 2 174 88 37 120 44.5 0.646 24 pos
## 734 2 106 56 27 165 29.0 0.426 22 neg
## 735 2 105 75 0 0 23.3 0.560 53 neg
## 736 4 95 60 32 0 35.4 0.284 28 neg
## 737 0 126 86 27 120 27.4 0.515 21 neg
## 738 8 65 72 23 0 32.0 0.600 42 neg
## 739 2 99 60 17 160 36.6 0.453 21 neg
## 740 1 102 74 0 0 39.5 0.293 42 pos
## 741 11 120 80 37 150 42.3 0.785 48 pos
## 742 3 102 44 20 94 30.8 0.400 26 neg
## 743 1 109 58 18 116 28.5 0.219 22 neg
## 744 9 140 94 0 0 32.7 0.734 45 pos
## 745 13 153 88 37 140 40.6 1.174 39 neg
## 746 12 100 84 33 105 30.0 0.488 46 neg
## 747 1 147 94 41 0 49.3 0.358 27 pos
## 748 1 81 74 41 57 46.3 1.096 32 neg
## 749 3 187 70 22 200 36.4 0.408 36 pos
## 750 6 162 62 0 0 24.3 0.178 50 pos
## 751 4 136 70 0 0 31.2 1.182 22 pos
## 752 1 121 78 39 74 39.0 0.261 28 neg
## 753 3 108 62 24 0 26.0 0.223 25 neg
## 754 0 181 88 44 510 43.3 0.222 26 pos
## 755 8 154 78 32 0 32.4 0.443 45 pos
## 756 1 128 88 39 110 36.5 1.057 37 pos
## 757 7 137 90 41 0 32.0 0.391 39 neg
## 758 0 123 72 0 0 36.3 0.258 52 pos
## 759 1 106 76 0 0 37.5 0.197 26 neg
## 760 6 190 92 0 0 35.5 0.278 66 pos
## 761 2 88 58 26 16 28.4 0.766 22 neg
## 762 9 170 74 31 0 44.0 0.403 43 pos
## 763 9 89 62 0 0 22.5 0.142 33 neg
## 764 10 101 76 48 180 32.9 0.171 63 neg
## 765 2 122 70 27 0 36.8 0.340 27 neg
## 766 5 121 72 23 112 26.2 0.245 30 neg
## 767 1 126 60 0 0 30.1 0.349 47 pos
## 768 1 93 70 31 0 30.4 0.315 23 neg
A quick exploration reveals that there are more zeros in the data than expected (especially since a BMI or tricep skin fold thickness of 0 is impossible), implying that missing values are recorded as zeros. See for instance the histogram of the tricep skin fold thickness, which has a number of 0 entries that are set apart from the other entries.
ggplot(diabetes_orig) +
geom_histogram(aes(x = triceps))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This phenomena can also seen in the glucose, pressure, insulin and mass variables. Thus, we convert the 0 entries in all variables (other than “pregnant”) to NA
. To do that, we use the mutate_at()
function (which will soon be superseded by mutate()
with across()
) to specify which variables we want to apply our mutating function to, and we use the if_else()
function to specify what to replace the value with if the condition is true or false.
diabetes_clean <- diabetes_orig %>%
mutate_at(vars(triceps, glucose, pressure, insulin, mass),
function(.var) {
if_else(condition = (.var == 0), # if true (i.e. the entry is 0)
true = as.numeric(NA), # replace the value with NA
false = .var # otherwise leave it as it is
)
})
Our data is ready. Hopefully you’ve replenished your cup of tea (or coffee if you’re into that for some reason). Let’s start making some tidy models!
Split into train/test
First, let’s split our dataset into training and testing data. The training data will be used to fit our model and tune its parameters, where the testing data will be used to evaluate our final model’s performance.
This split can be done automatically using the inital_split()
function (from rsample
) which creates a special “split” object.
set.seed(234589)
# split the data into trainng (75%) and testing (25%)
diabetes_split <- initial_split(diabetes_clean,
prop = 3/4)
diabetes_split
## <Training/Validation/Total>
## <576/192/768>
The printed output of diabetes_split
, our split object, tells us how many observations we have in the training set, the testing set, and overall: <train/test/total>
.
The training and testing sets can be extracted from the “split” object using the training()
and testing()
functions. Although, we won’t actually use these objects in the pipeline (we will be using the diabetes_split
object itself).
# extract training and testing sets
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)
At some point we’re going to want to do some parameter tuning, and to do that we’re going to want to use cross-validation. So we can create a cross-validated version of the training set in preparation for that moment using vfold_cv()
.
# create CV object from training data
diabetes_cv <- vfold_cv(diabetes_train)
Define a recipe
Recipes allow you to specify the role of each variable as an outcome or predictor variable (using a “formula”), and any pre-processing steps you want to conduct (such as normalization, imputation, PCA, etc).
Creating a recipe has two parts (layered on top of one another using pipes %>%
):
Specify the formula (
recipe()
): specify the outcome variable and predictor variablesSpecify pre-processing steps (
step_zzz()
): define the pre-processing steps, such as imputation, creating dummy variables, scaling, and more
For instance, we can define the following recipe
# define the recipe
diabetes_recipe <-
# which consists of the formula (outcome ~ predictors)
recipe(diabetes ~ pregnant + glucose + pressure + triceps +
insulin + mass + pedigree + age,
data = diabetes_clean) %>%
# and some pre-processing steps
step_normalize(all_numeric()) %>%
step_knnimpute(all_predictors())
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
## Consider formula(paste(x, collapse = " ")) instead.
If you’ve ever seen formulas before (e.g. using the lm()
function in R), you might have noticed that we could have written our formula much more efficiently using the formula short-hand where .
represents all of the variables in the data: outcome ~ .
will fit a model that predicts the outcome using all other columns.
The full list of pre-processing steps available can be found here. In the recipe steps above we used the functions all_numeric()
and all_predictors()
as arguments to the pre-processing steps. These are called “role selections”, and they specify that we want to apply the step to “all numeric” variables or “all predictor variables”. The list of all potential role selectors can be found by typing ?selections
into your console.
Note that we used the original diabetes_clean
data object (we set recipe(..., data = diabetes_clean)
), rather than the diabetes_train
object or the diabetes_split
object. It turns out we could have used any of these. All recipes takes from the data object at this point is the names and roles of the outcome and predictor variables. We will apply this recipe to specific datasets later. This means that for large data sets, the head of the data could be used to pass the recipe a smaller data set to save time and memory.
Indeed, if we print a summary of the diabetes_recipe
object, it just shows us how many predictor variables we’ve specified and the steps we’ve specified (but it doesn’t actually implement them yet!).
diabetes_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 8
##
## Operations:
##
## Centering and scaling for all_numeric
## K-nearest neighbor imputation for all_predictors
If you want to extract the pre-processed dataset itself, you can first prep()
the recipe for a specific dataset and juice()
the prepped recipe to extract the pre-processed data. It turns out that extracting the pre-processed data isn’t actually necessary for the pipeline, since this will be done under the hood when the model is fit, but sometimes it’s useful anyway.
diabetes_train_preprocessed <- diabetes_recipe %>%
# apply the recipe to the training data
prep(diabetes_train) %>%
# extract the pre-processed training dataset
juice()
diabetes_train_preprocessed
## # A tibble: 576 x 9
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0.673 0.837 -0.0581 0.616 0.328 0.187 0.531 1.37 pos
## 2 -0.824 -1.21 -0.537 0.0354 -0.770 -0.831 -0.359 -0.205 neg
## 3 1.27 1.98 -0.697 0.229 1.33 -1.31 0.676 -0.122 pos
## 4 -1.12 0.478 -2.61 0.616 0.0669 1.57 5.89 -0.0396 pos
## 5 0.374 -0.205 0.102 -0.700 -0.404 -0.977 -0.843 -0.287 neg
## 6 1.87 -0.238 0.229 0.229 0.558 0.435 -1.06 -0.370 neg
## 7 -0.525 2.43 -0.218 1.58 3.15 -0.264 -0.982 1.61 pos
## 8 1.27 0.0877 1.86 -0.178 0.863 0.318 -0.743 1.70 pos
## 9 0.0743 -0.401 1.54 0.674 0.139 0.769 -0.875 -0.287 neg
## 10 1.87 0.544 0.581 -0.448 0.666 -0.758 3.16 1.94 neg
## # … with 566 more rows
I wrote a much longer post on recipes if you’d like to check out more details. However, note that the preparation and bake steps described in that post are no longer necessary in the tidymodels pipeline, since they’re now implemented under the hood by the later model fitting functions in this pipeline.
Specify the model
So far we’ve split our data into training/testing, and we’ve specified our pre-processing steps using a recipe. The next thing we want to specify is our model (using the parsnip
package).
Parsnip offers a unified interface for the massive variety of models that exist in R. This means that you only have to learn one way of specifying a model, and you can use this specification and have it generate a linear model, a random forest model, a support vector machine model, and more with a single line of code.
There are a few primary components that you need to provide for the model specification
The model type: what kind of model you want to fit, set using a different function depending on the model, such as
rand_forest()
for random forest,logistic_reg()
for logistic regression,svm_poly()
for a polynomial SVM model etc. The full list of models available via parsnip can be found here.The arguments: the model parameter values (now consistently named across different models), set using
set_args()
.The engine: the underlying package the model should come from (e.g. “ranger” for the ranger implementation of Random Forest), set using
set_engine()
.The mode: the type of prediction - since several packages can do both classification (binary/categorical prediction) and regression (continuous prediction), set using
set_mode()
.
For instance, if we want to fit a random forest model as implemented by the ranger
package for the purpose of classification and we want to tune the mtry
parameter (the number of randomly selected variables to be considered at each split in the trees), then we would define the following model specification:
rf_model <-
# specify that the model is a random forest
rand_forest() %>%
# specify that the `mtry` parameter needs to be tuned
set_args(mtry = tune()) %>%
# select the engine/package that underlies the model
set_engine("ranger", importance = "impurity") %>%
# choose either the continuous regression or binary classification mode
set_mode("classification")
If you want to be able to examine the variable importance of your final model later, you will need to set importance
argument when setting the engine. For ranger, the importance options are "impurity"
or "permutation"
.
As another example, the following code would instead specify a logistic regression model from the glm
package.
lr_model <-
# specify that the model is a random forest
logistic_reg() %>%
# select the engine/package that underlies the model
set_engine("glm") %>%
# choose either the continuous regression or binary classification mode
set_mode("classification")
Note that this code doesn’t actually fit the model. Like the recipe, it just outlines a description of the model. Moreover, setting a parameter to tune()
means that it will be tuned later in the tune stage of the pipeline (i.e. the value of the parameter that yields the best performance will be chosen). You could also just specify a particular value of the parameter if you don’t want to tune it e.g. using set_args(mtry = 4)
.
Another thing to note is that nothing about this model specification is specific to the diabetes dataset.
Put it all together in a workflow
We’re now ready to put the model and recipes together into a workflow. You initiate a workflow using workflow()
(from the workflows
package) and then you can add a recipe and add a model to it.
# set the workflow
rf_workflow <- workflow() %>%
# add the recipe
add_recipe(diabetes_recipe) %>%
# add the model
add_model(rf_model)
Note that we still haven’t yet implemented the pre-processing steps in the recipe nor have we fit the model. We’ve just written the framework. It is only when we tune the parameters or fit the model that the recipe and model frameworks are actually implemented.
Tune the parameters
Since we had a parameter that we designated to be tuned (mtry
), we need to tune it (i.e. choose the value that leads to the best performance) before fitting our model. If you don’t have any parameters to tune, you can skip this step.
Note that we will do our tuning using the cross-validation object (diabetes_cv
). To do this, we specify the range of mtry
values we want to try, and then we add a tuning layer to our workflow using tune_grid()
(from the tune
package). Note that we focus on two metrics: accuracy
and roc_auc
(from the yardstick
package).
# specify which values eant to try
rf_grid <- expand.grid(mtry = c(3, 4, 5))
# extract results
rf_tune_results <- rf_workflow %>%
tune_grid(resamples = diabetes_cv, #CV object
grid = rf_grid, # grid of values to try
metrics = metric_set(accuracy, roc_auc) # metrics we care about
)
## i Creating pre-processing data to finalize unknown parameter: mtry
You can tune multiple parameters at once by providing multiple parameters to the expand.grid()
function, e.g. expand.grid(mtry = c(3, 4, 5), trees = c(100, 500))
.
It’s always a good idea to explore the results of the cross-validation. collect_metrics()
is a really handy function that can be used in a variety of circumstances to extract any metrics that have been calculated within the object it’s being used on. In this case, the metrics come from the cross-validation performance across the different values of the parameters.
# print results
rf_tune_results %>%
collect_metrics()
## # A tibble: 6 x 6
## mtry .metric .estimator mean n std_err
## <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 3 accuracy binary 0.758 10 0.0233
## 2 3 roc_auc binary 0.829 10 0.0155
## 3 4 accuracy binary 0.753 10 0.0222
## 4 4 roc_auc binary 0.829 10 0.0161
## 5 5 accuracy binary 0.751 10 0.0220
## 6 5 roc_auc binary 0.827 10 0.0147
Across both accuracy and AUC, mtry = 4
yields the best performance (just).
Finalize the workflow
We want to add a layer to our workflow that corresponds to the tuned parameter, i.e. sets mtry
to be the value that yielded the best results. If you didn’t tune any parameters, you can skip this step.
We can extract the best value for the accuracy metric by applying the select_best()
function to the tune object.
param_final <- rf_tune_results %>%
select_best(metric = "accuracy")
param_final
## # A tibble: 1 x 1
## mtry
## <dbl>
## 1 3
Then we can add this parameter to the workflow using the finalize_workflow()
function.
rf_workflow <- rf_workflow %>%
finalize_workflow(param_final)
Evaluate the model on the test set
Now we’ve defined our recipe, our model, and tuned the model’s parameters, we’re ready to actually fit the final model. Since all of this information is contained within the workflow object, we will apply the last_fit()
function to our workflow and our train/test split object. This will automatically train the model specified by the workflow using the training data, and produce evaluations based on the test set.
rf_fit <- rf_workflow %>%
# fit on the training set and evaluate on test set
last_fit(diabetes_split)
Note that the fit object that is created is a data-frame-like object; specifically, it is a tibble with list columns.
rf_fit
## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [576/… train/test … <tibble [2 ×… <tibble [0… <tibble [192 ×… <workflo…
This is a really nice feature of tidymodels (and is what makes it work so nicely with the tidyverse) since you can do all of your tidyverse operations to the model object. While truly taking advantage of this flexibility requires proficiency with purrr, if you don’t want to deal with purrr and list-columns, there are functions that can extract the relevant information from the fit object that remove the need for purrr as we will see below.
Since we supplied the train/test object when we fit the workflow, the metrics are evaluated on the test set. Now when we use the collect_metrics()
function (recall we used this when tuning our parameters), it extracts the performance of the final model (since rf_fit
now consists of a single final model) applied to the test set.
test_performance <- rf_fit %>% collect_metrics()
test_performance
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.745
## 2 roc_auc binary 0.834
Overall the performance is very good, with an accuracy of 0.74 and an AUC of 0.82.
You can also extract the test set predictions themselves using the collect_predictions()
function. Note that there are 192 rows in the predictions object below which matches the number of test set observations (just to give you some evidence that these are based on the test set rather than the training set).
# generate predictions from the test set
test_predictions <- rf_fit %>% collect_predictions()
test_predictions
## # A tibble: 192 x 6
## id .pred_neg .pred_pos .row .pred_class diabetes
## <chr> <dbl> <dbl> <int> <fct> <fct>
## 1 train/test split 0.994 0.00592 4 neg neg
## 2 train/test split 0.962 0.0384 7 neg pos
## 3 train/test split 0.0648 0.935 12 pos pos
## 4 train/test split 0.713 0.287 19 neg neg
## 5 train/test split 0.596 0.404 21 neg neg
## 6 train/test split 0.535 0.465 25 neg pos
## 7 train/test split 0.475 0.525 27 pos pos
## 8 train/test split 0.720 0.280 29 neg neg
## 9 train/test split 0.939 0.0611 34 neg neg
## 10 train/test split 0.373 0.627 35 pos neg
## # … with 182 more rows
Since this is just a normal data frame/tibble object, we can generate summaries and plots such as a confusion matrix.
# generate a confusion matrix
test_predictions %>%
conf_mat(truth = diabetes, estimate = .pred_class)
## Truth
## Prediction neg pos
## neg 106 33
## pos 16 37
We could also plot distributions of the predicted probability distributions for each class.
test_predictions %>%
ggplot() +
geom_density(aes(x = .pred_pos, fill = diabetes),
alpha = 0.5)
If you’re familiar with purrr, you could use purrr functions to extract the predictions column using pull()
. The following code does almost the same thing as collect_predictions()
. You could similarly have done this with the .metrics
column.
test_predictions <- rf_fit %>% pull(.predictions)
test_predictions
## [[1]]
## # A tibble: 192 x 5
## .pred_neg .pred_pos .row .pred_class diabetes
## <dbl> <dbl> <int> <fct> <fct>
## 1 0.994 0.00592 4 neg neg
## 2 0.962 0.0384 7 neg pos
## 3 0.0648 0.935 12 pos pos
## 4 0.713 0.287 19 neg neg
## 5 0.596 0.404 21 neg neg
## 6 0.535 0.465 25 neg pos
## 7 0.475 0.525 27 pos pos
## 8 0.720 0.280 29 neg neg
## 9 0.939 0.0611 34 neg neg
## 10 0.373 0.627 35 pos neg
## # … with 182 more rows
Fitting and using your final model
The previous section evaluated the model trained on the training data using the testing data. But once you’ve determined your final model, you often want to train it on your full dataset and then use it to predict the response for new data.
If you want to use your model to predict the response for new observations, you need to use the fit()
function on your workflow and the dataset that you want to fit the final model on (e.g. the complete training + testing dataset).
final_model <- fit(rf_workflow, diabetes_clean)
The final_model
object contains a few things including the ranger object trained with the parameters established through the workflow contained in rf_workflow
based on the data in diabetes_clean
(the combined training and testing data).
final_model
## ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## ● step_normalize()
## ● step_knnimpute()
##
## ── Model ──────────────────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(formula = formula, data = data, mtry = ~3, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 768
## Number of independent variables: 8
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1570876
If we wanted to predict the diabetes status of a new woman, we could use the normal predict()
function.
For instance, below we define the data for a new woman.
new_woman <- tribble(~pregnant, ~glucose, ~pressure, ~triceps, ~insulin, ~mass, ~pedigree, ~age,
2, 95, 70, 31, 102, 28.2, 0.67, 47)
new_woman
## # A tibble: 1 x 8
## pregnant glucose pressure triceps insulin mass pedigree age
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 95 70 31 102 28.2 0.67 47
The predicted diabetes status of this new woman is “negative”.
predict(final_model, new_data = new_woman)
## # A tibble: 1 x 1
## .pred_class
## <fct>
## 1 neg
Variable importance
If you want to extract the variable importance scores from your model, as far as I can tell, for now you need to extract the model object from the fit()
object (which for us is called final_model
). The function that extracts the model is pull_workflow_fit()
and then you need to grab the fit
object that the output contains.
ranger_obj <- pull_workflow_fit(final_model)$fit
ranger_obj
## Ranger result
##
## Call:
## ranger::ranger(formula = formula, data = data, mtry = ~3, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 768
## Number of independent variables: 8
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1570876
Then you can extract the variable importance from the ranger object itself (variable.importance
is a specific object contained within ranger output - this will need to be adapted for the specific object type of other models).
ranger_obj$variable.importance
## pregnant glucose pressure triceps insulin mass pedigree age
## 16.66632 75.52923 17.43526 23.08703 52.40944 41.10770 29.85125 33.08012
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Email