NLP Pipeline Management - Taking the Pains out of NLP
Aug 18, 2018 - PythonThis post will discuss managing natural language processing. However, it assumes you already have some knowledge about what that is and how it works. I plan to write an "intro to NLP" someday, but it is not this day. You can find an intro here.
The most frustrating part of Natural Language Processing (NLP) is dealing with all the various "valid" combinations that can occur. As an example, I might want to try cleaning the text with a stemmer and a lemmatizer - all while still tying to a vectorizer that works by counting up words. Well, that's two possible combinations of objects that I need to create, manage, train, and save for later. If I then want to try both of those combinations with a vectorizer that scales by word occurrence, that's now four combinations. If I then add in trying different topic reducers like LDA, LSA, and NMF, I'm up to 12 total valid combinations that I need to try. If I then combine that with 6 different models... 72 combinations. It can become infuriating quite quickly.
A pipe for cleaning text data
To fight this issue, I've developed a Python tool that manages the pipeline for a user. The user just needs to open a pipeline object, hand it the various tools that are in this specific vesion of their pipeline, and then watch it go. Let's look at an example, then we'll examine the code.from nlp_preprocessor import nlp_preprocessor
corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing']
nlp = nlp_preprocessor(vectorizer=CountVectorizer(), stemmer=PorterStemmer().stem)
nlp.fit(corpus)
nlp.transform(corpus).toarray()
---
> array([[1, 1, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 1, 1]])
nlp_preprocessor
class allows the user to add a particular vectorizer, cleaning function,
tokenizer, or stemmer and then the user just needs to call fit or predict, like a normal SkLearn model.
Let's examine how this works by looking at the code. (Full code available here:
GitHub Project). Let's start by looking at the
class definition.
class nlp_preprocessor(nlpipe):
def __init__(self, vectorizer=CountVectorizer(), tokenizer=None,
cleaning_function=None, stemmer=None):
"""
A class for pipelining our data in NLP problems. The user provides a series of
tools, and this class manages all of the training, transforming, and modification
of the text data.
---
Inputs:
vectorizer: the model to use for vectorization of text data
tokenizer: The tokenizer to use, if none defaults to split on spaces
cleaning_function: how to clean the data, if None, defaults to the in built class
stemmer: a function that returns a stemmed version of a token. For NLTK, this
means getting a stemmer class, then providing the stemming function underneath it.
"""
if not tokenizer:
tokenizer = self.splitter
if not cleaning_function:
cleaning_function = self.clean_text
self.stemmer = stemmer
self.tokenizer = tokenizer
self.cleaning_function = cleaning_function
self.vectorizer = vectorizer
self._is_fit = False
nlpipe
... we'll come back
to that later when we discuss saving models. Next let's look at the default functions for cleaning and
tokenizing. def splitter(self, text):
"""
Default tokenizer that splits on spaces naively
"""
return text.split(' ')
def clean_text(self, text, tokenizer, stemmer):
"""
A naive function to lowercase all works can clean them quickly.
This is the default behavior if no other cleaning function is specified
"""
cleaned_text = []
for post in text:
cleaned_words = []
for word in tokenizer(post):
low_word = word.lower()
if stemmer:
low_word = stemmer(low_word)
cleaned_words.append(low_word)
cleaned_text.append(' '.join(cleaned_words))
return cleaned_text
def fit(self, text):
"""
Cleans the data and then fits the vectorizer with
the user provided text
"""
clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
self.vectorizer.fit(clean_text)
self._is_fit = True
def transform(self, text, return_clean_text=False):
"""
Cleans any provided data and then transforms the data into
a vectorized format based on the fit function.
If return_clean_text is set to True, it returns the cleaned
form of the text. If it's set to False, it returns the
vectorized form of the data.
"""
if not self._is_fit:
raise ValueError("Must fit the models before transforming!")
clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
if return_clean_text:
return clean_text
return self.vectorizer.transform(clean_text)
To handle that, let's look at our parent class called
nlpipe
.
import pickle
class nlpipe:
def __init__(self):
"""
Empty parent class for nlp pipelines that contains
shared file i/o that happens in every class.
"""
pass
def save_pipe(self, filename):
"""
Writes the attributes of the pipeline to a file
allowing a pipeline to be loaded later with the
pre-trained pieces in place.
"""
if type(filename) != str:
raise TypeError("filename must be a string")
pickle.dump(self.__dict__, open(filename+".mdl",'wb'))
def load_pipe(self, filename):
"""
Writes the attributes of the pipeline to a file
allowing a pipeline to be loaded later with the
pre-trained pieces in place.
"""
if type(filename) != str:
raise TypeError("filename must be a string")
if filename[-4:] != '.mdl':
filename += '.mdl'
self.__dict__ = pickle.load(open(filename,'rb'))
__dict__
. So if we save the __dict__
, to disk, we can re-load it
at any time in the future. That's what these two methods do, using the nice pickle library from python to
store the attributes as binary files. This means we can store the attributes in their trained form as well.
All together, that makes up our full data management pipeline. It's great for keeping the code I actually need very simple. No complicated functions, no tuples storing tons of bits and pieces. Just a class that has the whole pipeline hidden in a single API call to fit or predict.
Adding a Model to the Mix
You may, quite fairly be wondering where the machine learning is. Fear not, we just didn't build it into the cleaning pipe directly so that we can attach whatever machine learning model we want later. Let's write another class that expects us to provide a pipeline object and then can do supervised learning. Here's a full example of the code for that, which I'll discuss below.from nlp_preprocessor import nlp_preprocessor
from nlpipe import nlpipe
class supervised_nlp(nlpipe):
def __init__(self, model, preprocessing_pipeline=None):
"""
A pipeline for doing supervised nlp. Expects a model and creates
a preprocessing pipeline if one isn't provided.
"""
self.model = model
self._is_fit = False
if not preprocessing_pipeline:
self.preprocessor = nlp_preprocessor()
else:
self.preprocessor = preprocessing_pipeline
def fit(self, X, y):
"""
Trains the vectorizer and model together using the
users input training data.
"""
self.preprocessor.fit(X)
train_data = self.preprocessor.transform(X)
self.model.fit(train_data, y)
self._is_fit = True
def predict(self, X):
"""
Makes a prediction on the data provided by the users using the
preprocessing pipeline and provided model.
"""
if not self._is_fit:
raise ValueError("Must fit the models before transforming!")
test_data = self.preprocessor.transform(X)
preds = self.model.predict(test_data)
return preds
def score(self, X, y):
"""
Returns the accuracy for the model after using the trained
preprocessing pipeline to prepare the data.
"""
test_data = self.preprocessor.transform(X)
return self.model.score(test_data, y)
nlp_preprocessor
object as an input.
Once we have that, we just write some quick fit, predict, and scoring methods that wrap around the normal
sklearn model API. Our new methods just make sure we process the data with our pipeline before handing it
off to SkLearn for machine learning. After that, we can just let SkLearn do the heavy lifting for us.If this sounds like it may be of use to you, you can find the full implementation of a topic modeler and a supervised learning class on my GitHub. You can find installation instructions in the README on GitHub.
Hopefully that will help make your NLP more manageable. Good luck!