Text Preprocessing Methods for Deep Learning

Deep Learning, particularly Natural Language Processing (NLP), has been gathering a huge interest nowadays. Some time ago, there was an NLP competition on Kaggle called Quora Question insincerity challenge. The competition is a text classification problem and it becomes easier to understand after working through the competition, as well as by going through the invaluable kernels put up by the Kaggle experts.

First, let’s start by explaining a little more about the text classification problem in the competition. 

Text classification is a common task in natural language processing, which transforms a sequence of a text of indefinite length into a category of text. How could you use that?

Now, each of these problems has something in common. From a Machine Learning perspective, these are essentially the same problem with just the target labels changing and nothing else. With that said, the addition of business knowledge can help make these models more robust and that is what we want to incorporate while preprocessing the data for test classification. 

While the preprocessing pipeline we are focusing on in this post is mainly centered around Deep Learning, most of it will also be applicable to conventional machine learning models too.

First, let's go through the flow of a deep learning pipeline for text data before going through all the steps in order to get a higher-level perspective about the whole process.

We normally start with cleaning up the text data and performing basic EDA. Here we try to improve our data quality by cleaning up the data. We also try to improve the quality of our Word2Vec embeddings by removing OOV (Out-of-Vocabulary) words. These first two steps normally don’t have much order between them and we generally go back and forth between these two steps. 

Next, we create a representation for text that could be fed into a deep learning model. We then start with creating our models and training them. Finally, we evaluate the models using appropriate metrics and get approval from respective shareholders to deploy our models. Don’t worry if these terms don’t make much sense now. We will try to explain them through the course of this article.

Here at this junction, let us take a little detour to talk a little about word embeddings. We will have to think about them while preprocessing data for our deep learning models.

A Primer on Word2Vec Embeddings

We need to have a way to represent words in a vocab. One way to do that could be to use One hot encoding of word vectors but that is not really a good choice. One of the major reasons is that the one-hot word vectors cannot accurately express the similarity between different words, such as the cosine similarity.

Given the structure of One hot encoded vectors, the similarity is always going to come as 0 between different words. Another reason is that as the size of vocabulary increases these One hot encoded vectors become very large.

Word2Vec overcomes the above difficulties by providing us with a fixed-length vector representation of words and by capturing the similarity and analogy relationships between different words.

Word2vec vectors of words are learned in such a way that they allow us to learn different analogies. It enables us to do algebraic manipulations on words that were not possible before. For example: What is king — man + woman? It comes out to be Queen.

Word2Vec vectors also help us to find out the similarity between words. If we try to find similar words to “good”, we will find awesome, great, etc. It is this property of word2vec that makes it invaluable for text classification. Now our deep learning network understands that “good” and “great” are essentially words with similar meanings.

Thus, in very simple terms, word2vec creates vectors for words. Thus we have a d dimensional vector for every word (common bigrams too) in a dictionary. We normally use pre-trained word vectors which are provided to us by others after training on large corpora of texts like Wikipedia, Twitter, etc. The most commonly used pre-trained word vectors are Glove and Fasttext with 300-dimensional word vectors. We are going to use Glove in this post.

Basic Preprocessing Techniques for Text Data

In most cases, we observe that text data is not entirely clean. Data coming from different sources have different characteristics and that makes Text Preprocessing one of the most important steps in the classification pipeline. 

For example, Text data from Twitter is totally different from text data on Quora, or some news/blogging platform, and thus would need to be treated differently. Helpfully, the techniques we are going to talk about in this post are generic enough for any kind of data you might encounter in the jungles of NLP.

Cleaning Special Characters and Removing Punctuations

Our preprocessing pipeline depends a lot on the word2vec embeddings we are going to use for our classification task. In principle, our preprocessing should match the preprocessing that was used before training the word embedding. Since most of the embeddings don’t provide vector values for punctuations and other special chars, the first thing you want to do is to get rid of the special characters in your text data. These are some of the special chars that were there in the Quora Question data and we use replace function to get rid of these special chars.

# Some preprocessing that will be common to all the text classification methods you will see.

Python
 
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',  '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '', '←', '×', '§', '″', '′', ' ', '█', '½', 'à', '…',  '“', '', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─',  '▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '', '', 'é', '¯', '', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞',  '∙', ')', '↓', '、', '│', '(', '»', ',', '', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '', 'ï', 'Ø', '¹', '≤', '‡', '√', ]


Python
 
def clean_text(x):    x = str(x)    for punct in puncts:        if punct in x:            x = x.replace(punct, '')    return x


This could also have been done with the help of a simple regex. But we normally like the above way of doing things as it helps to understand the sort of characters we are removing from our data.

Python
 
def clean_text(x):    pattern = r'[^a-zA-z0-9\s]'    text = re.sub(pattern, '', x)    return x


Cleaning Numbers

Why do we want to replace numbers with #s? Because most embeddings have preprocessed their text like this.

Small Python Trick: We use an if statement in the code below to check beforehand if a number exists in a text. It is as an if is always fast than a re.sub command and most of our text doesn't contain numbers.

Python
 
def clean_numbers(x):    if bool(re.search(r'\d', x)):        x = re.sub('[0-9]{5,}', '#####', x)        x = re.sub('[0-9]{4}', '####', x)        x = re.sub('[0-9]{3}', '###', x)        x = re.sub('[0-9]{2}', '##', x)    return x


Removing Misspells

It always helps to find out misspells in the data. As those word embeddings are not present in the word2vec, we should replace words with their correct spellings to get better embedding coverage. 

The following code artifact is an adaptation of Peter Norvig’s spell checker. It uses word2vec ordering of words to approximate word probabilities, as Google word2vec apparently orders words in decreasing order of frequency in the training corpus. You can use this to find out some misspelled words in the data you have.

This comes from CPMP script in the Quora questions similarity challenge.

Python
 
import re from collections import Counter import gensim import heapq from operator import itemgetter from multiprocessing import Pool 
model = gensim.models.KeyedVectors.load_word2vec_format('../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin',                                                         binary=True) words = model.index2word 
w_rank = {} for i,word in enumerate(words):    w_rank[word] = i 
WORDS = w_rank 
def words(text): return re.findall(r'\w+', text.lower()) 
def P(word):     "Probability of `word`."    # use inverse of rank as proxy    # returns 0 if the word isn't in the dictionary    return - WORDS.get(word, 0) 
def correction(word):     "Most probable spelling correction for word."    return max(candidates(word), key=P) 
def candidates(word):     "Generate possible spelling corrections for word."    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]) 
def known(words):     "The subset of `words` that appear in the dictionary of WORDS."    return set(w for w in words if w in WORDS) 
def edits1(word):    "All edits that are one edit away from `word`."    letters    = 'abcdefghijklmnopqrstuvwxyz'    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]    deletes    = [L + R[1:]               for L, R in splits if R]    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]    inserts    = [L + c + R               for L, R in splits for c in letters]    return set(deletes + transposes + replaces + inserts) 
def edits2(word):     "All edits that are two edits away from `word`."    return (e2 for e1 in edits1(word) for e2 in edits1(e1)) 
def build_vocab(texts):    sentences = texts.apply(lambda x: x.split()).values    vocab = {}    for sentence in sentences:        for word in sentence:            try:                vocab[word] += 1            except KeyError:                vocab[word] = 1    return vocab 
vocab = build_vocab(train.question_text) 
top_90k_words = dict(heapq.nlargest(90000, vocab.items(), key=itemgetter(1))) 
pool = Pool(4) corrected_words = pool.map(correction,list(top_90k_words.keys())) 
for word,corrected_word in zip(top_90k_words,corrected_words):    if word!=corrected_word:        print(word,":",corrected_word)


Once we are through with finding misspelled data, the next thing remains to replace them using misspell mapping and regex functions.

Python
 
mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}


Python
 
def _get_mispell(mispell_dict):    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))    return mispell_dict, mispell_re 
mispellings, mispellings_re = _get_mispell(mispell_dict) def replace_typical_misspell(text):    def replace(match):       return mispellings[match.group(0)]    return mispellings_re.sub(replace, text) 
# Usage replace_typical_misspell("Whta is demonitisation") 


Removing Contractions

Contractions are words that we write with an apostrophe. Examples of contractions are words like “ain’t” or “aren’t”. Since we want to standardize our text, it makes sense to expand these contractions. Below we have done this using contraction mapping and regex functions.

Python
 
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}


Python
 
def _get_contractions(contraction_dict):    contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))    return contraction_dict, contraction_re 
contractions, contractions_re = _get_contractions(contraction_dict) 
def replace_contractions(text):    def replace(match):        return contractions[match.group(0)]    return contractions_re.sub(replace, text) 
# Usage replace_contractions("this's a text with contraction")


Apart from the above techniques, there are other preprocessing techniques of text like Stemming, Lemmatization, and Stopword Removal. Since these techniques are not used along with Deep Learning NLP models, we won’t talk about them here.

Representation: Sequence Creation

One of the things that have made Deep Learning the "go-to" choice for NLP is the fact that we don’t really have to hand-engineer features from the text data. The deep learning algorithms take as input a sequence of text to learn the structure of text just like a human does. Since machines cannot understand words they expect their data in numerical form. So we would like to represent text data as a series of numbers. 

To understand how this is done we need to understand a little about the Keras Tokenizer function. One can use any other tokenizer, but Keras tokenizer is a popular choice.

Tokenizer:

In simple words, a tokenizer is a utility function to split a sentence into words. keras.preprocessing.text.Tokenizer tokenizes(splits) the texts into tokens(words) while keeping only the most occurring words in the text corpus.

Python
 
#Signature: Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0, **kwargs)


The num_words parameter keeps a pre-specified number of words in the text only. This is helpful as we don’t want our models to get a lot of noise by considering words that occur very infrequently. In real-world data, most of the words we leave using num_words param are normally misspells. The tokenizer also filters some non-wanted tokens by default and converts the text into lowercase.

The tokenizer once fitted to the data also keeps an index of words (dictionary of words which we can use to assign a unique number to a word) which can be accessed by:

 tokenizer.word_index

The words in the indexed dictionary are ranked in order of frequencies.

So the whole code to use tokenizer is as follows:

Python
 
from keras.preprocessing.text import Tokenizer ## Tokenize the sentences tokenizer = Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(train_X)+list(test_X)) train_X = tokenizer.texts_to_sequences(train_X) test_X = tokenizer.texts_to_sequences(test_X)


where train_X and test_X are lists of documents in the corpus.

Pad Sequence

Normally our model expects that each sequence (each training example) will be of the same length (same number of words/tokens). We can control this using the maxlen parameter.

For example:

Python
 
train_X = pad_sequences(train_X, maxlen=maxlen) test_X = pad_sequences(test_X, maxlen=maxlen)


Now our train data contains a list of numbers. Each list has the same length. And we also have the word_index which is a dictionary of the words occurring most in the text corpus.

Embedding Enrichment

As mentioned above, we will be using GLoVE Word2Vec embeddings to explain the enrichment. GLoVE pretrained vectors are trained on the Wikipedia corpus. 

That means some of the words that might be present in your data might not be present in the embeddings. How could we deal with that? Let’s first load the Glove Embeddings first.

Python
 
def load_glove_index():    EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:300]    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))    return embeddings_index 
glove_embedding_index = load_glove_index()


Be sure to put the path of the folder where you download these GLoVE vectors. 

What does this glove_embedding_index contain? It is just a dictionary in which the key is the word and the value is the word vector, a np.array of length 300. The length of this dictionary is somewhere around a billion. Since we only want the embeddings of words that are in our word_index, we will create a matrix that just contains the required embeddings.

tokenizer.word Index for Text Preprocessing

Python
 
def create_glove(word_index,embeddings_index):    emb_mean,emb_std = -0.005838499,0.48782197    all_embs = np.stack(embeddings_index.values())    embed_size = all_embs.shape[1]    nb_words = min(max_features, len(word_index))    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))    count_found = nb_words    for word, i in tqdm(word_index.items()):       if i >= max_features: continue       embedding_vector = embeddings_index.get(word)       if embedding_vector is not None:             embedding_matrix[i] =  embedding_vector       else:                count_found-=1    print("Got embedding for ",count_found," words.")    return embedding_matrix


The above code works fine but is there a way that we can use the preprocessing in GLoVE to our advantage? 

Yes. When preprocessing was done for glove, the creators didn’t convert the words to lowercase. That means that it contains multiple variations of a word like 'USA’, ‘usa’ and ‘Usa’. That also means that in some cases while a word like ‘Word’ is present, its analog in lowercase i.e. ‘word’ is not present. 

We can get through this situation by using the below code.

Python
 
def create_glove(word_index,embeddings_index):    emb_mean,emb_std = -0.005838499,0.48782197    all_embs = np.stack(embeddings_index.values())    embed_size = all_embs.shape[1]    nb_words = min(max_features, len(word_index))    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))     count_found = nb_words    for word, i in tqdm(word_index.items()):        if i >= max_features: continue        embedding_vector = embeddings_index.get(word)        if embedding_vector is not None:             embedding_matrix[i] =  embedding_vector        else:            if word.islower():                # try to get the embedding of word in titlecase if lowercase is not present                embedding_vector = embeddings_index.get(word.capitalize())                if embedding_vector is not None:                     embedding_matrix[i] = embedding_vector                else:                    count_found-=1            else:                count_found-=1    print("Got embedding for ",count_found," words.")    return embedding_matrix


The above was just an example of how we can use our knowledge of embedding to get better coverage. Sometimes, depending on the problem, one might also derive value by adding extra information to the embeddings using some domain knowledge and NLP skills. 

For example, we can add external knowledge to the embeddings themselves by adding polarity and subjectivity of a word from the TextBlob package in Python.

Python
 
from textblob import TextBlob word_sent = TextBlob("good").sentiment print(word_sent.polarity,word_sent.subjectivity) # 0.7 0.6


We can get the polarity and subjectivity of any word using TextBlob. Pretty neat! So, let us try to add this extra information to our embeddings.

Python
 
def create_glove(word_index,embeddings_index):    emb_mean,emb_std = -0.005838499,0.48782197    all_embs = np.stack(embeddings_index.values())    embed_size = all_embs.shape[1]    nb_words = min(max_features, len(word_index))    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size+4))        count_found = nb_words    for word, i in tqdm(word_index.items()):        if i >= max_features: continue        embedding_vector = embeddings_index.get(word)        word_sent = TextBlob(word).sentiment        # Extra information we are passing to our embeddings        extra_embed = [word_sent.polarity,word_sent.subjectivity]        if embedding_vector is not None:             embedding_matrix[i] =  np.append(embedding_vector,extra_embed)        else:            if word.islower():                embedding_vector = embeddings_index.get(word.capitalize())                if embedding_vector is not None:                     embedding_matrix[i] = np.append(embedding_vector,extra_embed)                else:                    embedding_matrix[i,300:] = extra_embed                    count_found-=1            else:                embedding_matrix[i,300:] = extra_embed                count_found-=1    print("Got embedding for ",count_found," words.")    return embedding_matrix


Engineering embeddings is an essential part of getting better performance from the deep learning models at a later stage. Generally, we will revisit this part of code multiple times during the stage of a project while trying to improve our models even further. You can show up a lot of creativity here to improve coverage over your word_index and to include extra features in your embedding.

More Engineered Features

Embedding_matrix Text Preprocessing Methods

One can always add sentence-specific features like sentence length, number of unique words, etc. as another input layer to give extra information to the Deep Neural Network. 

For example, we created these extra features as part of a feature engineering pipeline for the Quora Insincerity Classification Challenge.

Python
 
def add_features(df):    df['question_text'] = df['question_text'].progress_apply(lambda x:str(x))    df["lower_question_text"] = df["question_text"].apply(lambda x: x.lower())    df['total_length'] = df['question_text'].progress_apply(len)    df['capitals'] = df['question_text'].progress_apply(lambda comment: sum(1 for c in comment if c.isupper()))    df['caps_vs_length'] = df.progress_apply(lambda row: float(row['capitals'])/float(row['total_length']),                                axis=1)    df['num_words'] = df.question_text.str.count('\S+')    df['num_unique_words'] = df['question_text'].progress_apply(lambda comment: len(set(w for w in comment.split())))    df['words_vs_unique'] = df['num_unique_words'] / df['num_words']     return df


Conclusion

NLP is still a very interesting problem in the deep learning space, so we would encourage you to do a lot of experimentation to see what works and what doesn’t. We have tried to provide a wholesome perspective of the preprocessing steps for a deep learning neural network for any NLP problem, but that doesn’t mean it is definitive.

 

 

 

 

Top