Pre-processing data in R used to be the bane of my existence. For something that should be fairly straightforward, it often really wasn’t. Often my frustrations stemmed from simple things such as factor variables having different levels in the training data and test data, or a variable having missing values in the test data but not in the training data. I’d write a function that would pre-process the training data, and when I’d try to apply it to the test data, R would cry and yell and just be generally unpleasant.
Note: This tutorial was based on an older version of the abalone data that had a binary old varibale rather than a numeric age variable. It has been modified lightly so that it uses a manual old variable (is the abalone older than 10 or not) and ignores the numeric age variable. Materials prepared by Rebecca Barter. Package developed by Max Kuhn. An interactive Jupyter Notebook version of this tutorial can be found at https://github.