Text Checker Tutorial (Part 3)

by ryan | Dec. 16, 2023

Part 3 of an NLP classification tutorial; Language Models, Sentence Transformers Random Forests; Find GitHub repo here

This tutorial is divided into the following sections:

The Goal (and summary of part 1 and 2)

Have you ever sent a text or email message to wrong person 😳 ? The goal of this tutorial series, we try develop an algorithm that can prevent this, and learn a lot of cool NLP stuff along the way.

Just to recap, we have 2 classes:

  • 0 = the negative class, means that the text message is most likely safe to send,
  • 1 = the positive class, means that the message could be harmful to send.

In Part 1 and Part 2, we created two classifiers to do this. The first one was a Naive Bayes Classifier and the second one was a Random Forest classifier (using TF-IDF as the features). Both of these methods were relatively simple because basically both were counting word frequencies. Although it preformed pretty well, we are just simply counting, which means we are potentially loosing a lot of information because we were NOT taking into account the order of the words and the context in which they fell.

i.e. Because we are just counting, the models from Part 1 and Part 2 would treat the following in the exact same way:

  • ‘bank’ as in place where you put money 🏦
  • ‘bank' as in the bank of a river 🏞️

In this tutorial we address this issue in hopes of developing an even better classifier!

To do this we are going to use sentence embeddings. In the next section, I will give a brief background to embeddings.

What are Embeddings?

In the field of Natural Language Processing, embeddings are a method for turning text into numbers.

Pretty simple right? But why do this?

Well, computers don’t handle human language very well, but they are pretty good at working with numbers. So by embedding, we can turn natural languages (i.e. ones humans speak) in to numbers so that computers can process them.

Types of Embeddings

There are types of embeddings, and in fact, TF-IDF from Part 2 is actually a type of embedding. So you already are familiar with one type of text embedding.

Embeddings can be done at various levels of abstraction. For example, we could get embeddings at the following levels:

  • character level
  • subword-level (also called tokens)
  • the word level
  • the sentence/phrase level

πŸ’‘ Note: From my knowledge, for english, character level embeddings aren’t used in practice, but in Andrej Karpathy’s tutorial, he works at the character level. It’s more for educational purposes, but he does a great job teaching about creating embeddings from text.

How to Generate Embeddings

How to generate embedding is an active field of research that merits is own tutorial (if not multiple). So for the purposes of this tutorial I will just give a brief overview of methods for creating embeddings.

  • Simple statistical/counting methods:
    • Examples: One-hot encoding, TF-IDF
    • Issues:
      • The embeddings are really sparse (i.e. a lot of 0s),
      • order and context not taken into account
    • Solution: Use neural methods to create more dense vectors
  • Basic Neural Methods:
    • Examples: word2vec
    • Issues:
      • Context is not taken into account
      • i.e. the bank where you put money 🏦 and bank on the side of a river🏞️ would be represented with the same vector
    • Solution: Attention Mechanisms
  • Neural Methods with attention:

For this tutorial we are going to use Sentence Transformers which started with this initial paper Sentence-BERT and has its own library that makes it super easy to generate sentence embeddings with a variety of models.

Basically these models that they offer are transformer models that have been trained for semantic similarity tasks. In other words, if 2 sentences/phrases are close together in meaning, they should be close together in some n-dimensional vector space (often measured by the cosine similarity.

The Code

So now that we have some background into what embeddings are let’s dive into the code. We start off with code we had in parts 1 and 2, where we did some data cleaning into split out data into training and validation splits.

Language model take a lot of time, power, and data to train so we download a pre-trained model for sentence embeddings called ‘all-MiniLM-L6-v2’ and use the encode` commands to convert text into embeddings. If you want to see other pre-trained models that are available, take a look at the Sentence Transformers website.

# import library
from sentence_transformers import SentenceTransformer, util
# download model
minilm= SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# get embedding for training and validation data
embeddings_train = minilm.encode(train_df.text, convert_to_tensor=True)
embeddings_val = minilm.encode(val_df.text, convert_to_tensor=True)

We can take a quick look at the size of the embeddings.

embeddings_train.shape

# should output something like:
# torch.Size([177, 384])
# because we have 177 samples in our trianing
# data each body of text got converted into an
# embedding of size 384

We can think of these embeddings as somehow capturing the meaning of a give text with a list of 384 numbers.

There are a variety of ways we can use these embeddings for classification (including doing logistic regression or even training a simple neural network to make the classification). But since we use Random Forest last time, I thought it would be neat to use the same algorithm for classification, so we could compare how the different types embeddings affect the performance of a simple Random Forest classifier.

rf_emb = RandomForestClassifier(max_depth=10, n_estimators=7, n_jobs=-1)
rf_emb.fit(embeddings_train.cpu().numpy(), train_df.target)

y_pred_train = rf_emb.predict(embeddings_train)
print('Train accuracy: ', (train_df.target == y_pred_train).mean().round(3))

y_pred_val = rf_emb.predict(embeddings_val)
print('Validation accuracy: ', (val_df.target == y_pred_val).mean().round(3))

# Should output something like:
# Train accuracy:  0.994
# Validation accuracy:  0.913
# NOTE: RandomForest are indeed random, which means that
# these results might change everytime you train it.

Not to bad for only 7 trees (n_estimators=7) with a max depth of 10. In practice, sometimes hundreds of trees are used.

Now, to test it out on other pieces of text, we have to take two steps:

  • embed the text
  • pass the embedded text into the Random Forest model to generate a prediction

Here is the code to do that

test_y = minilm.encode("Hi I'm running into some issues. I hate you", convert_to_tensor=True)

rf_emb.predict(test_y.reshape(1, -1))
# Shoud output something like:
# array([1])
# 1 => potentailly harmful text

In this case, since the output was 1 we would want to display a message to the user saying something like “are you sure you want to send this message?”

Closing Thoughts

  • These models aren’t perfect, this is just a tutorial using dummy data
  • In the real world, what you would want or don’t want to send to someone would be highly dependent on who you are contacting. Because of this, it would probably be smart to include the recipient as a feature in the classification task.

Congratulations! We’ve made it to the end! Just to summarize here’s what we’ve accomplished in each part:

  • Part 1 → created Naive Bayes classifier from scratch
  • Part 2 → used TF-IDF and features into a Random Forest classifier
  • Part 3 → created a Random Forest classifier using sentence embeddings as features

On the github repository, you’ll find some code for visualizing these embeddings using PCA. I thought this tutorial was already getting to be too long so I’ll leave it up to the reader to look into that.

nlp tutorial cs project