★❤✰ Vicki Boykis ★❤✰
Should Spark have an API for R?
Jul 20 2017
Conclusion
Goya, Fire at Night
As a consultant, I’m often asked to make tooling choices in big data projects: Should we use Python or R? What’s the best NoSQL datastore right now? How do we set up a data lake?
While I do keep up with data news , I don’t know everything that’s going on in the big data ecosystem, which now sprawls to over 2000 companies and products , and neither does anyone else working in the space.
In an effort to make all of these parts accessible to different audiences, developers create APIs that the majority of data scientists already use, namely by making their code accessible to R and Python interfaces.
A number of products are based off the success of tooling for these two languages. Microsoft, for example, offers R and Jupyter notebooks in Azure Machine Learning . Amazon offers Jupyter notebooks on EMR. And, in my favorite example about how everyone is trying to get on the data science bandwagon, SAS now lets you run R inside SAS .
One project that has been particularly good at staying ahead of the curve is Spark , which began in Scala and quickly added a fully-functioning Python API when data scientists without JVM experience started moving to the platform. It’s currently under the aegis of Databricks , a company started by the original Spark project developers. It provides, of course, its own notebook product .
The motivation behind SparkR, is great. If you know R, you don’t have to switch to Python or Scala to create your models, while also benefitting from the processing power of Spark. But, SparkR, as it stands now, is not (yet) a fully-featured mirror of R functionality, and I’ve come across some key features of the platform that make me believe that SparkR is not a good fit for the Spark programming paradigm.
The SparkR API: DataFrames are confusing
The first problem is data organization. Data frames have become a common element across all languages that deal with data, mimicking the functionality of a table in a SQL database. Python R , SAS all have them, and Spark has followed suit.
In general, tables are good. Tables are the way humans read data easiest, and the way we manipulate it. In relational environments, tables are pretty straightforward. But the issues start when imperative programming languages try to mimic declarative data structures, particularly when these data structures are distributed across many machines and linked together by complicated sets of instructions.
In Spark, there big difference between SparkDataFrames, and R’s data.frames.
Here’s a high-level overview of the differences:
To the naked eye, they look similar and you can run similar operations on them. There is nothing in the way they are named to tell you what kind of object they are. The difference between them became so confusing that SparkR renamed DataFrames to SparkDataFrames, in part because they were competing with the name of a third package, S4Vector.
But, to the person coming from base R functionality, there is a real clash of concepts here. data.frames are similar in the real world to looking up a single word in a dictionary. But with SparkDataFrames, the definition of your word is spread out across all of the volumes of the Encyclopedia Britannica, and you need to search through each volume, get the word you want, and then piece it back together into the definition. Also, the words are not in alphabetical order.
It helps to examine each structure’s internals. To start with, let’s look at R data.frames.
An R dataframe is an in-memory object created out of two-dimensional lists of vectors of equal lengths, where each column contains values of a single variable and rows contain a single set of observations.
Working in RStudio, we can import one of the default datasets:
datasets() # display available datasets attached to R data(income) # use US family income from US census 2008 help(income) # find out more about the package income class(income) # check type [1] "data.frame > class(income$value) #check individual column (vector) type [1] "integer" str(income) # check values 'data.frame': 44 obs. of 4 variables: $ value: int 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 ... $ count: int 2588 971 1677 3141 3684 3163 3600 3116 3967 3117 ... $ mean : int 298 3792 6261 8705 11223 13687 16074 18662 21064 23698 ... $ prop : num 0.02209 0.00829 0.01431 0.0268 0.03144 ...
When you run operations on an R data.frame, you’re processing everything on the machine that is running the R process, in this case, my MacBook:
mbp-vboykis:~ vboykis$ ps PID TTY TIME CMD 28239 ttys000 0:00.18 ~/R
A SparkDataFrame, is a very different animal. From Databricks :
First, Spark is distributed , which means processes, instead of happening locally, are broken out across multiple nodes, called workers, across multiple processes, called executors.
The process that kickstarts this is the driver. The driver is a program that kicks off the Spark session and creates the execution plan to submit to the master, which sends it off to the workers. In simpler terms, the driver is the process that kicks off the main() method.
The driver creates a SparkContext, which is the entry point into a Spark program, and splits up code execution on the executors, which are located on separate logical nodes, often also separate physical servers.
You write some SparkR code on the edge node of a Hadoop cluster. That code is translated from SparkR, to a JVM process on the driver (since Spark is Scala under the hood.) This code then gets pushed to JVM processes listening on each edge node. In the process, the data is serialized and split up across machines.
Only at this point, once Spark has been initialized and there is a network connection, a SparkDataFrame is created. Really, a SparkDataFrame is a view of another Spark object, called a Dataset, which is a collection of the serialized JVM objects objects .
The Dataset, the original set of instructions created by translating the R code to Spark, is then formatted in such a way by SparkSQL as to mimic a table . There is an great, detailed talk about this process by one of the SparkR committers.
Additionally, for a SparkDataFrame to compute correctly, Spark has to implement logic across servers. To make this work, the SparkDataFrame is not really the data, but a set of instructions for how to access the data and process it across nodes, the data lineage .
Here’s a really good visual of all the setup that needs to happen when you do invoke and make changes to a SparkDataFrame, from Databricks:
There are two major implications for this level of complexity is that you access SparkDataFrames differently than data.frames.
Code here.
Although Spark is a row-based paradigm, you can’t access specific rows in an ordered manner in a SparkDataFrame.
head(income) value count mean prop 1 0 2588 298 0.022085115 2 2500 971 3792 0.008286185 3 5000 1677 6261 0.014310950 4 7500 3141 8705 0.026804229 5 10000 3684 11223 0.031438007 6 12500 3163 13687 0.026991970 > class(income) [1] "data.frame" > income[row.names(income)==2,] value count mean prop 2 2500 971 3792 0.008286185
And now the SparkR equivalent:
sdf