NLP Pipeline Management - Taking the Pains out of NLP

Aug 18, 2018 - Python

This post will discuss managing natural language processing. However, it assumes you already have some knowledge about what that is and how it works. I plan to write an "intro to NLP" someday, but it is not this day. You can find an intro here.

The most frustrating part of Natural Language Processing (NLP) is dealing with all the various "valid" combinations that can occur. As an example, I might want to try cleaning the text with a stemmer and a lemmatizer - all while still tying to a vectorizer that works by counting up words. Well, that's two possible combinations of objects that I need to create, manage, train, and save for later. If I then want to try both of those combinations with a vectorizer that scales by word occurrence, that's now four combinations. If I then add in trying different topic reducers like LDA, LSA, and NMF, I'm up to 12 total valid combinations that I need to try. If I then combine that with 6 different models... 72 combinations. It can become infuriating quite quickly.

A pipe for cleaning text data

To fight this issue, I've developed a Python tool that manages the pipeline for a user. The user just needs to open a pipeline object, hand it the various tools that are in this specific vesion of their pipeline, and then watch it go. Let's look at an example, then we'll examine the code.
from nlp_preprocessor import nlp_preprocessor

corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing']

nlp = nlp_preprocessor(vectorizer=CountVectorizer(), stemmer=PorterStemmer().stem)
nlp.fit(corpus)
nlp.transform(corpus).toarray()
---
> array([[1, 1, 0, 0, 0, 1, 0, 0],
         [0, 0, 0, 1, 1, 0, 0, 0],
         [0, 0, 1, 0, 0, 0, 1, 1]])
The nlp_preprocessor class allows the user to add a particular vectorizer, cleaning function, tokenizer, or stemmer and then the user just needs to call fit or predict, like a normal SkLearn model. Let's examine how this works by looking at the code. (Full code available here: GitHub Project). Let's start by looking at the class definition.
class nlp_preprocessor(nlpipe):

    def __init__(self, vectorizer=CountVectorizer(), tokenizer=None,
                 cleaning_function=None, stemmer=None):
        """
        A class for pipelining our data in NLP problems. The user provides a series of
        tools, and this class manages all of the training, transforming, and modification
        of the text data.
        ---
        Inputs:
        vectorizer: the model to use for vectorization of text data
        tokenizer: The tokenizer to use, if none defaults to split on spaces
        cleaning_function: how to clean the data, if None, defaults to the in built class
        stemmer: a function that returns a stemmed version of a token. For NLTK, this
        means getting a stemmer class, then providing the stemming function underneath it.
        """
        if not tokenizer:
            tokenizer = self.splitter
        if not cleaning_function:
            cleaning_function = self.clean_text
        self.stemmer = stemmer
        self.tokenizer = tokenizer
        self.cleaning_function = cleaning_function
        self.vectorizer = vectorizer
        self._is_fit = False
This part of the class allows the user to add their pieces upon initialization. It does a few extra things. First, if the user doesn't provide a cleaning function or a tokenizer, it defaults to some simplistic ones that are build into the class. Otherwise, it basically just sets up some attributes for later usage. You mave noticed this class also inherits from another class called nlpipe... we'll come back to that later when we discuss saving models. Next let's look at the default functions for cleaning and tokenizing.
    def splitter(self, text):
        """
        Default tokenizer that splits on spaces naively
        """
        return text.split(' ')

    def clean_text(self, text, tokenizer, stemmer):
        """
        A naive function to lowercase all works can clean them quickly.
        This is the default behavior if no other cleaning function is specified
        """
        cleaned_text = []
        for post in text:
            cleaned_words = []
            for word in tokenizer(post):
                low_word = word.lower()
                if stemmer:
                    low_word = stemmer(low_word)
                cleaned_words.append(low_word)
            cleaned_text.append(' '.join(cleaned_words))
        return cleaned_text
These functions should look pretty normal to anyone with some background in NLP. The cleaner just goes word by word and lowercases everything and chops of endings like 'ing' or 'ed' if a stemmer is used. These are pretty naive and inoffensive in-terms of "aggressive text cleaning" so a user would do well to provide their own functions. For now though, the defaults are there to make the class work even in the simplest cases. Next, we need to figure out how we actually make it do our cleaning and transforming. To do that, we'll write a fit function which will clean the text and then teach the vectorizer how to behave. Then we'll write a transform function that can take new text and convert it into vector space.
    def fit(self, text):
        """
        Cleans the data and then fits the vectorizer with
        the user provided text
        """
        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
        self.vectorizer.fit(clean_text)
        self._is_fit = True

    def transform(self, text, return_clean_text=False):
        """
        Cleans any provided data and then transforms the data into
        a vectorized format based on the fit function.
        If return_clean_text is set to True, it returns the cleaned
        form of the text. If it's set to False, it returns the
        vectorized form of the data.
        """
        if not self._is_fit:
            raise ValueError("Must fit the models before transforming!")
        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
        if return_clean_text:
            return clean_text
                return self.vectorizer.transform(clean_text)
Nothing super fancy there, just grabbing the pieces we've already built and stacking them into a single pipeline for the data to flow through. We're in pretty good shape now. We have the ability to create a bunch of complicated NLP pipelines by just invoking a class and sticking in the pieces we want. The only other behavior that might be handy is the ability to save and load these pipelines without having to re-train.

To handle that, let's look at our parent class called nlpipe.
import pickle

class nlpipe:

    def __init__(self):
        """
        Empty parent class for nlp pipelines that contains
        shared file i/o that happens in every class.
        """
        pass

    def save_pipe(self, filename):
        """
        Writes the attributes of the pipeline to a file
        allowing a pipeline to be loaded later with the
        pre-trained pieces in place.
        """
        if type(filename) != str:
            raise TypeError("filename must be a string")
        pickle.dump(self.__dict__, open(filename+".mdl",'wb'))

    def load_pipe(self, filename):
        """
        Writes the attributes of the pipeline to a file
        allowing a pipeline to be loaded later with the
        pre-trained pieces in place.
        """
        if type(filename) != str:
            raise TypeError("filename must be a string")
        if filename[-4:] != '.mdl':
            filename += '.mdl'
            self.__dict__ = pickle.load(open(filename,'rb'))
This class is basically just for file I/O. It has two methods, a save and a load. For both of these methods we use the nice behavior of python classes of storing all the attributes in a class in a single dictionary. This is called __dict__. So if we save the __dict__, to disk, we can re-load it at any time in the future. That's what these two methods do, using the nice pickle library from python to store the attributes as binary files. This means we can store the attributes in their trained form as well.

All together, that makes up our full data management pipeline. It's great for keeping the code I actually need very simple. No complicated functions, no tuples storing tons of bits and pieces. Just a class that has the whole pipeline hidden in a single API call to fit or predict.

Adding a Model to the Mix

You may, quite fairly be wondering where the machine learning is. Fear not, we just didn't build it into the cleaning pipe directly so that we can attach whatever machine learning model we want later. Let's write another class that expects us to provide a pipeline object and then can do supervised learning. Here's a full example of the code for that, which I'll discuss below.
from nlp_preprocessor import nlp_preprocessor
from nlpipe import nlpipe

class supervised_nlp(nlpipe):

    def __init__(self, model, preprocessing_pipeline=None):
        """
        A pipeline for doing supervised nlp. Expects a model and creates
        a preprocessing pipeline if one isn't provided.
        """
        self.model = model
        self._is_fit = False
        if not preprocessing_pipeline:
            self.preprocessor = nlp_preprocessor()
        else:
            self.preprocessor = preprocessing_pipeline

    def fit(self, X, y):
        """
        Trains the vectorizer and model together using the
        users input training data.
        """
        self.preprocessor.fit(X)
        train_data = self.preprocessor.transform(X)
        self.model.fit(train_data, y)
        self._is_fit = True

    def predict(self, X):
        """
        Makes a prediction on the data provided by the users using the
        preprocessing pipeline and provided model.
        """
        if not self._is_fit:
            raise ValueError("Must fit the models before transforming!")
        test_data = self.preprocessor.transform(X)
        preds = self.model.predict(test_data)
        return preds

    def score(self, X, y):
        """
        Returns the accuracy for the model after using the trained
        preprocessing pipeline to prepare the data.
        """
        test_data = self.preprocessor.transform(X)
        return self.model.score(test_data, y)
This wraps around the processing pipeline by expecting a nlp_preprocessor object as an input. Once we have that, we just write some quick fit, predict, and scoring methods that wrap around the normal sklearn model API. Our new methods just make sure we process the data with our pipeline before handing it off to SkLearn for machine learning. After that, we can just let SkLearn do the heavy lifting for us.

If this sounds like it may be of use to you, you can find the full implementation of a topic modeler and a supervised learning class on my GitHub. You can find installation instructions in the README on GitHub.

Hopefully that will help make your NLP more manageable. Good luck!