NLP Pipeline Management - Taking the Pains out of NLPAug 18, 2018 - Python
This post will discuss managing natural language processing. However, it assumes you already have some knowledge about what that is and how it works. I plan to write an "intro to NLP" someday, but it is not this day. You can find an intro here.
The most frustrating part of Natural Language Processing (NLP) is dealing with all the various "valid" combinations that can occur. As an example, I might want to try cleaning the text with a stemmer and a lemmatizer - all while still tying to a vectorizer that works by counting up words. Well, that's two possible combinations of objects that I need to create, manage, train, and save for later. If I then want to try both of those combinations with a vectorizer that scales by word occurrence, that's now four combinations. If I then add in trying different topic reducers like LDA, LSA, and NMF, I'm up to 12 total valid combinations that I need to try. If I then combine that with 6 different models... 72 combinations. It can become infuriating quite quickly.
A pipe for cleaning text dataTo fight this issue, I've developed a Python tool that manages the pipeline for a user. The user just needs to open a pipeline object, hand it the various tools that are in this specific vesion of their pipeline, and then watch it go. Let's look at an example, then we'll examine the code.
from nlp_preprocessor import nlp_preprocessor corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing'] nlp = nlp_preprocessor(vectorizer=CountVectorizer(), stemmer=PorterStemmer().stem) nlp.fit(corpus) nlp.transform(corpus).toarray() --- > array([[1, 1, 0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0, 0, 0], [0, 0, 1, 0, 0, 0, 1, 1]])
nlp_preprocessorclass allows the user to add a particular vectorizer, cleaning function, tokenizer, or stemmer and then the user just needs to call fit or predict, like a normal SkLearn model. Let's examine how this works by looking at the code. (Full code available here: GitHub Project). Let's start by looking at the class definition.
class nlp_preprocessor(nlpipe): def __init__(self, vectorizer=CountVectorizer(), tokenizer=None, cleaning_function=None, stemmer=None): """ A class for pipelining our data in NLP problems. The user provides a series of tools, and this class manages all of the training, transforming, and modification of the text data. --- Inputs: vectorizer: the model to use for vectorization of text data tokenizer: The tokenizer to use, if none defaults to split on spaces cleaning_function: how to clean the data, if None, defaults to the in built class stemmer: a function that returns a stemmed version of a token. For NLTK, this means getting a stemmer class, then providing the stemming function underneath it. """ if not tokenizer: tokenizer = self.splitter if not cleaning_function: cleaning_function = self.clean_text self.stemmer = stemmer self.tokenizer = tokenizer self.cleaning_function = cleaning_function self.vectorizer = vectorizer self._is_fit = False
nlpipe... we'll come back to that later when we discuss saving models. Next let's look at the default functions for cleaning and tokenizing.
def splitter(self, text): """ Default tokenizer that splits on spaces naively """ return text.split(' ') def clean_text(self, text, tokenizer, stemmer): """ A naive function to lowercase all works can clean them quickly. This is the default behavior if no other cleaning function is specified """ cleaned_text =  for post in text: cleaned_words =  for word in tokenizer(post): low_word = word.lower() if stemmer: low_word = stemmer(low_word) cleaned_words.append(low_word) cleaned_text.append(' '.join(cleaned_words)) return cleaned_text
Nothing super fancy there, just grabbing the pieces we've already built and stacking them into a single pipeline for the data to flow through. We're in pretty good shape now. We have the ability to create a bunch of complicated NLP pipelines by just invoking a class and sticking in the pieces we want. The only other behavior that might be handy is the ability to save and load these pipelines without having to re-train.
def fit(self, text): """ Cleans the data and then fits the vectorizer with the user provided text """ clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer) self.vectorizer.fit(clean_text) self._is_fit = True def transform(self, text, return_clean_text=False): """ Cleans any provided data and then transforms the data into a vectorized format based on the fit function. If return_clean_text is set to True, it returns the cleaned form of the text. If it's set to False, it returns the vectorized form of the data. """ if not self._is_fit: raise ValueError("Must fit the models before transforming!") clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer) if return_clean_text: return clean_text return self.vectorizer.transform(clean_text)
To handle that, let's look at our parent class called
This class is basically just for file I/O. It has two methods, a save and a load. For both of these methods we use the nice behavior of python classes of storing all the attributes in a class in a single dictionary. This is called
import pickle class nlpipe: def __init__(self): """ Empty parent class for nlp pipelines that contains shared file i/o that happens in every class. """ pass def save_pipe(self, filename): """ Writes the attributes of the pipeline to a file allowing a pipeline to be loaded later with the pre-trained pieces in place. """ if type(filename) != str: raise TypeError("filename must be a string") pickle.dump(self.__dict__, open(filename+".mdl",'wb')) def load_pipe(self, filename): """ Writes the attributes of the pipeline to a file allowing a pipeline to be loaded later with the pre-trained pieces in place. """ if type(filename) != str: raise TypeError("filename must be a string") if filename[-4:] != '.mdl': filename += '.mdl' self.__dict__ = pickle.load(open(filename,'rb'))
__dict__. So if we save the
__dict__, to disk, we can re-load it at any time in the future. That's what these two methods do, using the nice pickle library from python to store the attributes as binary files. This means we can store the attributes in their trained form as well.
All together, that makes up our full data management pipeline. It's great for keeping the code I actually need very simple. No complicated functions, no tuples storing tons of bits and pieces. Just a class that has the whole pipeline hidden in a single API call to fit or predict.
Adding a Model to the MixYou may, quite fairly be wondering where the machine learning is. Fear not, we just didn't build it into the cleaning pipe directly so that we can attach whatever machine learning model we want later. Let's write another class that expects us to provide a pipeline object and then can do supervised learning. Here's a full example of the code for that, which I'll discuss below.
This wraps around the processing pipeline by expecting a
from nlp_preprocessor import nlp_preprocessor from nlpipe import nlpipe class supervised_nlp(nlpipe): def __init__(self, model, preprocessing_pipeline=None): """ A pipeline for doing supervised nlp. Expects a model and creates a preprocessing pipeline if one isn't provided. """ self.model = model self._is_fit = False if not preprocessing_pipeline: self.preprocessor = nlp_preprocessor() else: self.preprocessor = preprocessing_pipeline def fit(self, X, y): """ Trains the vectorizer and model together using the users input training data. """ self.preprocessor.fit(X) train_data = self.preprocessor.transform(X) self.model.fit(train_data, y) self._is_fit = True def predict(self, X): """ Makes a prediction on the data provided by the users using the preprocessing pipeline and provided model. """ if not self._is_fit: raise ValueError("Must fit the models before transforming!") test_data = self.preprocessor.transform(X) preds = self.model.predict(test_data) return preds def score(self, X, y): """ Returns the accuracy for the model after using the trained preprocessing pipeline to prepare the data. """ test_data = self.preprocessor.transform(X) return self.model.score(test_data, y)
nlp_preprocessorobject as an input. Once we have that, we just write some quick fit, predict, and scoring methods that wrap around the normal sklearn model API. Our new methods just make sure we process the data with our pipeline before handing it off to SkLearn for machine learning. After that, we can just let SkLearn do the heavy lifting for us.
If this sounds like it may be of use to you, you can find the full implementation of a topic modeler and a supervised learning class on my GitHub. You can find installation instructions in the README on GitHub.
Hopefully that will help make your NLP more manageable. Good luck!