Pre-processing data in R used to be the bane of my existence. For something that should be fairly straightforward, it often really wasn’t. Often my frustrations stemmed from simple things such as factor variables having different levels in the training data and test data, or a variable having missing values in the test data but not in the training data. I’d write a function that would pre-process the training data, and when I’d try to apply it to the test data, R would cry and yell and just be generally unpleasant.
When you’re learning to code and perform data analysis, it can be overwhelming to figure out how to structure your projects. To help data scientists develop a reproducible and consistent workflow, I’ve put together a short GitHub-based document with some guiding advice: https://github.com/rlbarter/reproducibility-workflow If you’re interested in contributing or improving this document, please get in touch, or even better, submit a pull request (https://github.com/rlbarter/reproducibility-workflow)! The document as of writing is shown below.
A quick useful aside: Using shorthand for functions The _if() scoped variant: perform an operation on variables that satisfy a logical criteria select_if() rename_if() mutate_if() summarise_if() The _at() scoped variant: perform an operation only on variables specified by name Select helpers rename_at() mutate_at() summarise_at() The _all() scoped variant: perform an operation on all variables at once rename_all() mutate_all() summarise_all() Conclusion I often find myself wishing that I could apply the same mutate function to several columns in a data frame at once, such as convert all factors to characters, or do something to all columns that have missing values, or select all variables whose names end with _important.