by ryan | April 25, 2023
Part 2 of an NLP classification tutorial; TD-IDF, Random Forests; Find GitHub repo here
In Part 1, we created a Naive Bayes Classifier from scratch to detect messages that contain potentially harmful, personal, or sensitive messages. And it worked pretty well! But can we do better? Can we do something a bit less naive?
In this tutorial we explore a bit less naive methods of classification.
This tutorial is divided into the following sections:
This is a continuation from Part 1, in which our goal is to goal is to develop an algorithm/model that can detect messages that a user might NOT want to send to someone. Just to recap, we have 2 classes. 0, the neagtive class, means that the text message is most likely safe to send, and 1, the positive class, means that the message could be harmful to send.
Below is an example of each class.
Positive Example:
input: 'I hate this job.'
Model's expected output: 1.
Decision: DISPLAY confirmation message to user 'Are you sure you want to send this message?'
Negative Example:
input: 'Good luck with the presentation today!'
Model's expected output: 0.
Decision: DO NOT DISPLAY confirmation message to user
Disclaimer: The dataset we use is a toy dataset I created using ChatGPT (see this section from Part 1 for more info on the data).
Know that we have reviewed the task at hand let’s dive into the nitty-gritty.
With the Naive Bayes Classifier, we used the word frequencies (how often a word occurred) in the training data and multiplied them together to see how likely a phrase belongs to each class. We then selected the class with the highest probability as our prediction.
For building our next classifier we are going to use something very similar to word frequencies called term frequency–inverse document frequency (or TF-IDF).
Although it’s an “old school” method according to one of my advisors, it’s a pretty widely used. According to the Wikipedia page, TD-IDF is used in information retrieval, text-mining, and user modeling, and that variations of it are used often by search engines (think Google or Bing) to rank how relevant websites are to a user’s search. But what is TD-IDF exactly?
As it states in it’s name it has to deal with the term frequency and the inverse document frequency so let’s define what these are.
Term frequency is simply how often a work occurs in a document (or any body of text). If we listed out our documents with the word count it would look something like this.
Word | the | pizza | dog | cat | … |
---|---|---|---|---|---|
Document 1 | 10 | 2 | 0 | 0 | … |
Document 2 | 5 | 0 | 3 | 1 | … |
Document 3 | 7 | 0 | 0 | 5 | … |
In a search engine you could find the documents with the highest frequencies of the words used in a Google/Bing/Yahoo search (in the example table above, if a user searched ‘cat’ the ordering would be document 3, followed by 2 than 1).
The inverse document frequency (IDF) is a bit more complicated, but basically the IDF is a way to measure for how rare a term is in a corpus of documents (in our case the corpus of documents is our training data).
For a given word, to calculate the IDF, we count the number of documents we have (i.e. the number of messages we have in our training set) and divide this by the number of documents that contain that word. We then take the log of it (in scikit learn, a popular python library for Machine Learning algorithms, they use np.log which I believe is the natural log).
$$ idf(t, D) = log(\frac{N}{|d\in D: t\in d|}) $$
For example, in the table above, the word “the” appears in 3 out of 3 documents.
$$ idf(dog, D) = log(\frac{3}{3}) = 0 $$
So we see that words that appear across all documents (i.e. not rare words), will have a low value. Lets more example with a more rare word. The word “dog” appears in 1 out of 3 documents thus:
$$ idf(dog, D) = log(\frac{3}{1}) \approx 1.10 $$
💡 Side note: There are other ways to define both TF and IDF, but this is the way I believe it is implemented in Scikit learn (see the Wiki page for other ways).
So to put these together we just multiply them. So for the tf-idf for a word in a document is:
$$ tfidf(t,d,D) = tf(t,d) \cdot idf(t,D) $$
💡 Side note: Since this is multiplication, every word that appears in the document will have and IDF of 0 and thus a TF-IDF of 0. Thus, it is common to +1 to the IDF like in here.
We will use SciKit Learn to for calculating the TF-IDF. We can do this pretty easily with the following code:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words= 'english')#remove stopwords (the, and, ...)
X = vectorizer.fit_transform(train_df.text)
The fit_transform
function, as the name suggests, does two things: 1) it fits (calculates the TF and IDF from the training data) and transforms (returns a matrix with the documents as rows and words as columns with the TF-IDF in each cell).
For the validation data we do not fit (that would be cheating). Instead we just use the transform
function which uses the vocabulary and document frequencies from from the training data, but uses the word frequencies found in the validation data. Here's the code to do that:
X_tfidf_val = vectorizer.transform(val_df.text)
We can now use the TF-IDF as an input to a machine learning algorithm. Each document's meaning will be represented by its TF-IDF, and the algorithm will try to learn based off of the TF-IDF whether the message is safe (and should output 0) or potentially harmful (should output 1).
We could use a variety of machine learning algorithms, but I've decided to use a Random Forest Classifier (by far one of the most popular → the paper that introduced this method has over 100,000 citations according to google scholar).
I won't go into all the details of the algorithm, but I will give an overview and show the code I use.
At the core of a Random Forest model we have something called decision trees.
In a decision tree for classification the leaf nodes are classes and the non-leaf nodes have two decisions to make:
which feature/column in that dataset to use (in our case the features/columns are words) and
in with feature where is the best split (which TD-IDF value to use to separate the data).
Generally these are decided by using a measure of impurity (i.e. How well did that feature separate the data? The better the job the more pure the leaf nodes are).
If the above doesn’t make to much sense take a look at this video from StatQuest which explains in depth this process of creating a decision tree.
The issue with using just a decision tree is that they tend to over fit to the training data, and thus, don’t have the ability to predict accurately data it has never seen before. The idea of Random Forests was created to solve this issue.
Overview of Random Forest Algorithm:
Bootstrap data (sample from the training data with replacement)
Create a decision tree, but choose the best features/columns (that one results in the lowest impurity) from a random subset of features/columns instead of using all the features.
Repeat steps 1 and 2 to build many trees
For prediction time we run new data through all the tree and take the majority output as the winner (as the class we assign to that data).
💡 Example: In our scenario let’s the message “I am in a meeting” run through a random forest of 7 tress results in 6 trees classifying the message with 0 and the last tree classifies the message with a 1. Because the majority of the trees outputted a 0, we would classify this message with class 0, meaning a non-sensitive message.
Due to the bootstrapping and the random selection of features at each step, we end up with wide variety of trees which prevents overfitting (i.e. helps to predict more accurately when it comes to new data). For more details on creating random forest see this video or this site from the creator of Random Forests
Scikit Learn has an implementation of Random Forests that I use. I use 7 trees (n_estimators=7) and limit the trees so that they won't go deeper than 10 steps (max_depth=10).
rf = RandomForestClassifier(max_depth=10, n_estimators=7, n_jobs=-1)
rf.fit(X, train_df.target)
y_pred_train = rf.predict(X)
print('Train accuracy: ', (train_df.target == y_pred_train).mean().round(3))
y_pred_val = rf.predict(X_tfidf_val)
print('Validation accuracy: ', (val_df.target == y_pred_val).mean().round(3))
# I got this but this will change each time you run it due to
# randomness in the Random Forest algorithm
# Train accuracy: 0.966
# Validation accuracy: 0.913
We get a pretty high accuracy! Try experimenting with more trees (you should be able to get better performance with more trees).
One of the big advantages of using Random Forest over Neural Networks is that it is easily interpretable. With Neural Networks, sometimes it is hard to tell why the network made a certain decision (although making Neural Networks more interpretable is an on going area of research). However, with Random Forest we can look at something we call feature importance.
The feature importance, as the name suggests, is a measure of how important a feature/column is (in our case, the features/columns are words). There are a variety of ways to measure this, but the way scikit learn does involves looking at the total reduction of impurity across all the tress.
Here is some code to look at the most important features.
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
top_word_indicies = []
for f in range(10):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
top_word_indicies.append(indices[f])
print("\\n")
# Print top features
for ind in top_word_indicies:
print(vectorizer.get_feature_names()[ind])
# What I got as the output
# Due to randomness these results might vary
Feature ranking:
1. feature 175 (0.164085)
2. feature 437 (0.102590)
3. feature 215 (0.066480)
4. feature 300 (0.043398)
5. feature 132 (0.041178)
6. feature 414 (0.039718)
7. feature 154 (0.031463)
8. feature 227 (0.029563)
9. feature 168 (0.022290)
10. feature 8 (0.022071)
hi
wanted
let
professor
feedback
trouble
good
make
having
afternoon
In this tutorial, we learned about TF-IDF and Random Forests, and used them to build a classifier that can tell whether or not a message contains potentially harmful or sensitive information.
The one issue with TF-IDF is that it is still a bit naive. Just to recap, with the Naive Bayes Classifier (from part 1) the naive part was assuming that a word in a message is not dependent on the words around it (which is naive knowing that we generally follow grammatical/linguistical patterns).
In TF-IDF, we are assuming basically the same thing by using what can be called a Bag-of-words model. In this model, we don’t take into account at all the ordering of words, we basically just treat the words as if we had thrown them in a bag, mixed them up, and only use word frequencies to predict whether or not a message is safe to send.
Yet despite this, we still get pretty good accuracy use these old-school methods. But can we do something less naive? With real data (instead of this toy data set), we could image things getting a lot messier so might be important to have a non-bag-of-words method for handling messier or more complicated data.
Thank you for taking the time to look at this tutorial! And hope to see you in the next one!
nlp tutorial cs project