Using the recipes package for easy pre-processing

Having to apply the same pre-processing steps to training, testing and validation data to do some machine learning can be surprisingly frustrating. But thanks to the recipes R package, it's now super-duper easy. Instead of having five functions and maybe hundreds of lines of code, you can preprocess multiple datasets using a single 'recipe' in fewer than 10 lines of code.

Rebecca Barter

Pre-processing data in R used to be the bane of my existence. For something that should be fairly straightforward, it often really wasn’t. Often my frustrations stemmed from simple things such as factor variables having different levels in the training data and test data, or a variable having missing values in the test data but not in the training data. I’d write a function that would pre-process the training data, and when I’d try to apply it to the test data, R would cry and yell and just be generally unpleasant.

mutate_all(), select_if(), summarise_at()... what's the deal with scoped verbs?!

What's the deal with these mutate_all(), select_if(), summarise_at(), functions? They seem so useful, but there doesn't seem to be a decent explanation of how to use them anywhere on the internet. Turns out, they're called 'scoped verbs' and hopefully this post will become one of many decent explanations of how to use them!

Rebecca Barter

A quick useful aside: Using shorthand for functions The _if() scoped variant: perform an operation on variables that satisfy a logical criteria select_if() rename_if() mutate_if() summarise_if() The _at() scoped variant: perform an operation only on variables specified by name Select helpers rename_at() mutate_at() summarise_at() The _all() scoped variant: perform an operation on all variables at once rename_all() mutate_all() summarise_all() Conclusion I often find myself wishing that I could apply the same mutate function to several columns in a data frame at once, such as convert all factors to characters, or do something to all columns that have missing values, or select all variables whose names end with _important.

Getting fancy with ggplot2: code for alternatives to grouped bar charts

In this post I present the code I wrote to prodocue the figures in my previous post about alternatives to grouped bar charts.

Rebecca Barter

Here I provide the code I used to create the figures from my previous post on alternatives to grouped bar charts. You are encouraged to play with them yourself! The key to creating unique and creative visualizations using libraries such as ggplot (or even just straight SVG) is (1) to move away from thinking of data visualization only as the default plot types (bar plots, boxplots, scatterplots, etc), and (2) to realise that most visualizations are essentially lines and circles that you can arrange however you desire in space.