Nicer Machine Learning with Spark - RFormula

June 22, 2018 - Spark, AWS, EMR

This is part 3 in a series exploring Spark. Part 1 is Setting up a Spark Cluster on AWS. Machine Learning with Spark is part 2.

Getting Started with RFormula

This tutorial will assume you've been following along with the process here: Machine Learning with Spark

When we last left off, we had just finished running a random forest model on a bunch of data with SparkML. It wasn't a great model, but we'd done quite a bit of cleaning and processing to get there. In one section, we got our data ready to be passed into the model by doing this:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
si = StringIndexer(inputCol='Origin', outputCol='OriginCode')
ohe = OneHotEncoder(inputCol='OriginCode',outputCol='OriginVec')
si2 = StringIndexer(inputCol='UniqueCarrier', outputCol='UniqueCarrierCode')
ohe2 = OneHotEncoder(inputCol='UniqueCarrierCode',outputCol='UniqueCarrierVec')
          
from pyspark.ml.feature import VectorAssembler
va = VectorAssembler(inputCols=['Year','Month','DayofMonth','DayOfWeek','DepTime',\
'OriginVec','UniqueCarrierVec'], outputCol='features')
All of that was necessary to make sure that the data had been properly split into categories, indexed, and then each category had been converted into a stand-alone column so we could work with string data. Today, we're going to focus on a nicer way to handle all that without having to manually dive in and change each column one at a time with labels and one hot encoding.

There's a tool in Spark 2.0 called RFormula. It allows us to specify the relationship we want to find between our columns, and then behind the scenes it does all the encoding, labeling, and cleaning up for us. It's a great tool in the sense that it does a lot of the heavy lifting for us, at very little cost. So let's see how we can modify our code to work with RForumla. We can start by replacing all of the above code with the following lines:
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import RandomForestRegressor

rf = RandomForestRegressor(featuresCol='features', labelCol='label', numTrees=10)

transformer = RFormula(formula="DepDelay ~ .", featuresCol='features', labelCol='label')
output = transformer.fit(df).transform(df)
model = rf.fit(output)
Behind the scenes, RFormula says, "oh hey, they want to create a relationship between DepDelay and all of the other variables (the . stands for "everything else" here). Then it says, I should make sure I make a column called features that is the "everything else" and a column called "label" that is "DepDelay". Then it DOES that. We don't have to specify anything, it makes assumptions about how to handle categorical data, does the one hot encoding, everything. At that point, we can just run out data through like normal! The only other change we need to make is to change our prediction code to use the "output" dataframe like so:
predictions = model.transform(output)
Everything else stays the same! We load the data like before, we do regression metrics like before... we just replaced all that manual processing with Spark's automated processing methods under RFormula. It's a great tool for making your code more readable, without sacrificing much (if any) speed. I also get the same answers (within tolerance of randomization) as before for modeling outputs.

IF YOU STARTED A CLUSTER FOR THIS TUTORIAL:
MAKE SURE YOU SHUT DOWN YOUR CLUSTER. THEY ARE EXPENSIVE