Nicer Machine Learning with Spark - RFormulaJune 22, 2018 - Spark, AWS, EMR
This is part 3 in a series exploring Spark. Part 1 is Setting up a Spark Cluster on AWS. Machine Learning with Spark is part 2.
Getting Started with RFormulaThis tutorial will assume you've been following along with the process here: Machine Learning with Spark
When we last left off, we had just finished running a random forest model on a bunch of data with SparkML. It wasn't a great model, but we'd done quite a bit of cleaning and processing to get there. In one section, we got our data ready to be passed into the model by doing this:
All of that was necessary to make sure that the data had been properly split into categories, indexed, and then each category had been converted into a stand-alone column so we could work with string data. Today, we're going to focus on a nicer way to handle all that without having to manually dive in and change each column one at a time with labels and one hot encoding.
from pyspark.ml.feature import StringIndexer, OneHotEncoder si = StringIndexer(inputCol='Origin', outputCol='OriginCode') ohe = OneHotEncoder(inputCol='OriginCode',outputCol='OriginVec') si2 = StringIndexer(inputCol='UniqueCarrier', outputCol='UniqueCarrierCode') ohe2 = OneHotEncoder(inputCol='UniqueCarrierCode',outputCol='UniqueCarrierVec') from pyspark.ml.feature import VectorAssembler va = VectorAssembler(inputCols=['Year','Month','DayofMonth','DayOfWeek','DepTime',\ 'OriginVec','UniqueCarrierVec'], outputCol='features')
There's a tool in Spark 2.0 called
RFormula. It allows us to specify the relationship we want to find between our columns, and then behind the scenes it does all the encoding, labeling, and cleaning up for us. It's a great tool in the sense that it does a lot of the heavy lifting for us, at very little cost. So let's see how we can modify our code to work with
RForumla. We can start by replacing all of the above code with the following lines:
Behind the scenes, RFormula says, "oh hey, they want to create a relationship between
from pyspark.ml.feature import RFormula from pyspark.ml.regression import RandomForestRegressor rf = RandomForestRegressor(featuresCol='features', labelCol='label', numTrees=10) transformer = RFormula(formula="DepDelay ~ .", featuresCol='features', labelCol='label') output = transformer.fit(df).transform(df) model = rf.fit(output)
DepDelayand all of the other variables (the
.stands for "everything else" here). Then it says, I should make sure I make a column called features that is the "everything else" and a column called "label" that is "DepDelay". Then it DOES that. We don't have to specify anything, it makes assumptions about how to handle categorical data, does the one hot encoding, everything. At that point, we can just run out data through like normal! The only other change we need to make is to change our prediction code to use the "output" dataframe like so:
Everything else stays the same! We load the data like before, we do regression metrics like before... we just replaced all that manual processing with Spark's automated processing methods under RFormula. It's a great tool for making your code more readable, without sacrificing much (if any) speed. I also get the same answers (within tolerance of randomization) as before for modeling outputs.
predictions = model.transform(output)