What’s the fastest way to search and replace strings in a data frame?
Posted on July 22, 2022 by Econometrics and Free Software in R bloggers | 0 Comments
[This article was first published on Econometrics and Free Software , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve tweeted this:
Just changed like 100 grepl calls to stringi::stri_detect and my pipeline now runs 4 times faster #RStats
— Bruno Rodrigues (@brodriguesco) July 20, 2022
much discussed ensued. Some people were surprised, because in their experience, grepl() was faster than alternatives, especially if you set the perl parameter in grepl() to TRUE. My use case was quite simple; I have a relatively large data set (half a million lines) with one column with several misspelling of city names. So I painstakingly wrote some code to correct the spelling of the major cities (those that came up often enough to matter. Minor cities were set to “Other”. Sorry, Wiltz !)
So in this short blog post, I benchmark some code to see if what I did the other day was a fluke. Maybe something weird with my R installation on my work laptop running Windows 10 somehow made stri_detect() run faster than grepl()? I don’t even know if something like that is possible. I’m writing these lines on my Linux machine, unlike the code I run at work. So maybe if I find some differences, they could be due to the different OS running. I don’t want to have to deal with Windows on my days off (for my blood pressure’s sake), so I’m not running this benchmark on my work laptop. So that part we’ll never know.
Anyways, let’s start by getting some data. I’m not commenting the code below, because that’s not the point of this post.
library(dplyr) library(stringi) library(stringr) library(re2) adult