Rebecca Barter

Categories

All (32)

ChatGPT (1)

OpenAI (1)

R (19)

anova (1)

blog (3)

caret (1)

causal inference (3)

communication (2)

data (1)

data science (2)

datasets (1)

documentation (1)

dplyr (3)

forcats (1)

generative AI (1)

ggplot2 (2)

hypothesis testing (1)

interactivity (1)

lubridate (1)

machine learning (4)

parsnip (1)

pipes (1)

purrr (2)

python (1)

readr (1)

recipes (1)

reproducibility (1)

rsample (1)

rstudioconf (2)

statistics (6)

stringr (1)

superheat (2)

tibbles (1)

tidyeval (1)

tidymodels (1)

tidyr (1)

tidyverse (7)

tune (1)

visualization (6)

workflow (3)

AI Tutorial: Using Text Embeddings to Label Synthetic Doctor’s Notes Generated with ChatGPT

data science

generative AI

ChatGPT

OpenAI

machine learning

I’ve been playing around with OpenAI and ChatGPT in my research, and I thought I’d put together a short tutorial that demonstrates using ChatGPT API to generate synthetic doctor’s notes, and then using OpenAI’s text embedding models to label the notes according to whether they involve a chronic or acute condition. And yes, I’m fully aware that what I write here will probably be out of date in about 3 hours.

An introduction to Python for R Users

python

I have a confession to make: I am now a Python user. Don’t judge me, join me! In this post, I introduce Python for data analysis from the perspective of an R (tidyverse) user. This post is a must-read if you are an R user hoping to dip your toes in the Python pool.

A list of public data repositories

data

datasets

Sources for lesser-known and better-known public datasets.

Thanks, Quarto, for saving my blog!

blog

My old blog broke so I decided to re-make my entire website using quarto, which was disturbingly easy. Welcome to my new and improved blog!

It’s time for statistics departments to start supporting their applied students

statistics

Statistics departments are failing their applied students. In this post, I have a lot of opinions and give two pieces of advice: statistics departments need to start supporting their applied students, and they need to hire applied faculty.

Across (dplyr 1.0.0): applying dplyr functions simultaneously across multiple columns

R

tidyverse

dplyr

With the introduction of dplyr 1.0.0, there are a few new features: the biggest of which is across() which supersedes the scoped versions of dplyr functions.

Tidymodels: tidy machine learning in R

R

tidyverse

machine learning

tidymodels

caret

recipes

parsnip

tune

rsample

The tidyverse’s take on machine learning is finally here. Tidymodels forms the basis of tidy machine learning, and this post provides a whirlwind tour to get you started.

5 useful R tips from rstudio::conf(2020) - tidy eval, piping, conflicts, bar charts and colors

R

rstudioconf

tidyverse

ggplot2

tidyeval

visualization

Last week I had the pleasure of attending rstudio::conf(2020) in San Francisco. Throughout the course of the week I met many wonderful people and learnt many things. This post covers some of the little tips and tricks that I learnt throughout the conference.

Becoming an R blogger

R

rstudioconf

blog

In this post I discuss why I became an R blogger, why you should too, and some tips and tricks to get your started. This post is based on my rstudio::conf(2020) talk called ‘Becoming an R blogger’

Learn to purrr

R

purrr

tidyverse

Purrr is the tidyverse’s answer to apply functions for iteration. It’s one of those packages that you might have heard of, but seemed too complicated to sit down and learn. Starting with map functions, and taking you on a journey that will harness the power of the list, this post will have you purrring in no time.

Transitioning into the tidyverse (part 1)

R

tidyverse

dplyr

ggplot2

pipes

This post walks through what base R users need to know for their transition into the tidyverse. Part 1 focuses on piping and the ‘base’ packages, dplyr and ggplot2.

Transitioning into the tidyverse (part 2)

R

tidyverse

tidyr

purrr

readr

tibbles

lubridate

forcats

stringr

This post walks through what base R users need to know for their transition into the tidyverse. Part 2 focuses on the more specialized R packages tidyr, purrr, readr, lubridate, forcats, etc

Using the recipes package for easy pre-processing

R

workflow

machine learning

Having to apply the same pre-processing steps to training, testing and validation data to do some machine learning can be surprisingly frustrating. But thanks to the recipes R package, it’s now super-duper easy. Instead of having five functions and maybe hundreds of lines of code, you can preprocess multiple datasets using a single ‘recipe’ in fewer than 10 lines of code.

A quick guide to developing a reproducible and consistent data science workflow

data science

workflow

reproducibility

When you’re learning to code and perform data analysis, it can be overwhelming to figure out how to structure your projects. To help data scientists develop a reproducible and consistent workflow, I’ve put together a short document with some guiding advice.

mutate_all(), select_if(), summarise_at()… what’s the deal with scoped verbs?!

dplyr

R

tidyverse

What’s the deal with these mutate_all(), select_if(), summarise_at(), functions? They seem so useful, but there doesn’t seem to be a decent explanation of how to use them anywhere on the internet. Turns out, they’re called ‘scoped verbs’ and hopefully this post will become one of many decent explanations of how to use them!

Which hypothesis test should I use? A flowchart

statistics

hypothesis testing

A flowchart to decide what hypothesis test to use.

Alternatives to grouped bar charts

R

visualization

One of the most common chart types that is simultaneously the most difficult to read is the grouped bar chart. Fortunately, there exist several substantially more effective alternatives that convey the same information without overwhelming our visual cognition abilities.

Getting fancy with ggplot2: code for alternatives to grouped bar charts

R

visualization

In this post I present the code I wrote to prodocue the figures in my previous post about alternatives to grouped bar charts.

Understanding Instrumental Variables

causal inference

statistics

Instrumental variables is one of the most mystical concepts in causal inference. For some reason, most of the existing explanations are overly complicated and focus on specific nuanced aspects of generating IV estimates without really providing the intuition for why it makes sense. In this post, you will not find too many technical details, but rather a narrative introducing instruments and why they are useful.

ggplot2: Mastering the basics

R

visualization

Making graphs in R with ggplot2 is easy! This post will cover the basics to get you started on your ggplot2 adventures

A basic tutorial of caret: the machine learning package in R

R

machine learning

R has a wide number of packages for machine learning (ML), which is great, but also quite frustrating since each package was designed independently and has very different syntax, inputs and outputs. Caret unifies these packages into a single package with constant syntax, saving everyone a lot of frustration and time!

A Basic Data Science Workflow

R

workflow

Developing a clean and easy analysis workflow takes a really, really long time. In this post, I outline the workflow that I have developed over the last few years.

Confounding in causal inference: what is it, and what to do about it?

causal inference

statistics

An introduction to the field of causal inference and the issues surrounding confounding.

The intuition behind inverse probability weighting in causal inference

causal inference

statistics

Removing confounding can be done via a variety methods including IP-weighting. This post provides a summary of the intuition behind IP-weighting.

Migrating from GitHub Pages to Netlify: how and why?

blog

Sorry GitHub Pages, but we need to break up. I’ve found a new web hosting service.

Coolors: choosing color schemes

visualization

R

A really cool website for choosing color palattes

Interactive visualization in R

visualization

R

interactivity

Learn about creating interactive visualizations in R.

Docathon: A Week of Doumentation

documentation

R

communication

We’re hosting a week-long docathon over at BIDS.

ANOVA

statistics

anova

A bunch of statisticians met to learn about ANOVA, a method that they’re supposed to already know about. Here is my summary.

superheat 0.1.0

R

superheat

First version of superheat now available on CRAN.

How to Present Good

communication

This week I was asked to give a presentation on presenting. Here lies a summary of my efforts.

Superheat: a simple example

R

superheat

A simple example of using superheat to create beautiful heatmaps.