by ryan | Dec. 16, 2023
Part 3 of an NLP classification tutorial; Language Models, Sentence Transformers Random Forests; Find GitHub repo here
This tutorial is divided into the following sections:
Have you ever sent a text or email message to wrong person π³ ? The goal of this tutorial series, we try develop an algorithm that can prevent this, and learn a lot of cool NLP stuff along the way.
Just to recap, we have 2 classes:
In Part 1 and Part 2, we created two classifiers to do this. The first one was a Naive Bayes Classifier and the second one was a Random Forest classifier (using TF-IDF as the features). Both of these methods were relatively simple because basically both were counting word frequencies. Although it preformed pretty well, we are just simply counting, which means we are potentially loosing a lot of information because we were NOT taking into account the order of the words and the context in which they fell.
i.e. Because we are just counting, the models from Part 1 and Part 2 would treat the following in the exact same way:
In this tutorial we address this issue in hopes of developing an even better classifier!
To do this we are going to use sentence embeddings. In the next section, I will give a brief background to embeddings.
In the field of Natural Language Processing, embeddings are a method for turning text into numbers.
Pretty simple right? But why do this?
Well, computers don’t handle human language very well, but they are pretty good at working with numbers. So by embedding, we can turn natural languages (i.e. ones humans speak) in to numbers so that computers can process them.
There are types of embeddings, and in fact, TF-IDF from Part 2 is actually a type of embedding. So you already are familiar with one type of text embedding.
Embeddings can be done at various levels of abstraction. For example, we could get embeddings at the following levels:
π‘ Note: From my knowledge, for english, character level embeddings aren’t used in practice, but in Andrej Karpathy’s tutorial, he works at the character level. It’s more for educational purposes, but he does a great job teaching about creating embeddings from text.
How to generate embedding is an active field of research that merits is own tutorial (if not multiple). So for the purposes of this tutorial I will just give a brief overview of methods for creating embeddings.
For this tutorial we are going to use Sentence Transformers which started with this initial paper Sentence-BERT and has its own library that makes it super easy to generate sentence embeddings with a variety of models.
Basically these models that they offer are transformer models that have been trained for semantic similarity tasks. In other words, if 2 sentences/phrases are close together in meaning, they should be close together in some n-dimensional vector space (often measured by the cosine similarity.
So now that we have some background into what embeddings are let’s dive into the code. We start off with code we had in parts 1 and 2, where we did some data cleaning into split out data into training and validation splits.
Language model take a lot of time, power, and data to train so we download a pre-trained model for sentence embeddings called ‘all-MiniLM-L6-v2’ and use the encode` commands to convert text into embeddings. If you want to see other pre-trained models that are available, take a look at the Sentence Transformers website.
# import library
from sentence_transformers import SentenceTransformer, util
# download model
minilm= SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# get embedding for training and validation data
embeddings_train = minilm.encode(train_df.text, convert_to_tensor=True)
embeddings_val = minilm.encode(val_df.text, convert_to_tensor=True)
We can take a quick look at the size of the embeddings.
embeddings_train.shape
# should output something like:
# torch.Size([177, 384])
# because we have 177 samples in our trianing
# data each body of text got converted into an
# embedding of size 384
We can think of these embeddings as somehow capturing the meaning of a give text with a list of 384 numbers.
There are a variety of ways we can use these embeddings for classification (including doing logistic regression or even training a simple neural network to make the classification). But since we use Random Forest last time, I thought it would be neat to use the same algorithm for classification, so we could compare how the different types embeddings affect the performance of a simple Random Forest classifier.
rf_emb = RandomForestClassifier(max_depth=10, n_estimators=7, n_jobs=-1)
rf_emb.fit(embeddings_train.cpu().numpy(), train_df.target)
y_pred_train = rf_emb.predict(embeddings_train)
print('Train accuracy: ', (train_df.target == y_pred_train).mean().round(3))
y_pred_val = rf_emb.predict(embeddings_val)
print('Validation accuracy: ', (val_df.target == y_pred_val).mean().round(3))
# Should output something like:
# Train accuracy: 0.994
# Validation accuracy: 0.913
# NOTE: RandomForest are indeed random, which means that
# these results might change everytime you train it.
Not to bad for only 7 trees (n_estimators=7
) with a max depth of 10. In practice, sometimes hundreds of trees are used.
Now, to test it out on other pieces of text, we have to take two steps:
Here is the code to do that
test_y = minilm.encode("Hi I'm running into some issues. I hate you", convert_to_tensor=True)
rf_emb.predict(test_y.reshape(1, -1))
# Shoud output something like:
# array([1])
# 1 => potentailly harmful text
In this case, since the output was 1 we would want to display a message to the user saying something like “are you sure you want to send this message?”
Congratulations! We’ve made it to the end! Just to summarize here’s what we’ve accomplished in each part:
On the github repository, you’ll find some code for visualizing these embeddings using PCA. I thought this tutorial was already getting to be too long so I’ll leave it up to the reader to look into that.
nlp tutorial cs project