Classifying Sentiment using Deep Learning: A BERT Project

11 min readApr 19, 2023

Sentiment analysis, aka ‘opinion mining’, is a natural language processing (NLP) technique used to classify the emotional tone of text.

Some popular use cases include evaluating customer/product reviews, social reactions to global news or markets, and of course, monitoring social media data- which lucky for us, is widely available and produced on a massive scale. Sounds like a recipe for disaster doesn’t it?

With large language models (LLMs) becoming wildly popular, I’m quite curious about NLP. While taking my first dive in this field, I’ve prepared a comprehensive and compact guide how to build a BERT sentiment classifier using Twitter data.

Model Background

Transformers have become the state-of-the-art model in many use cases. They have especially thrived in language tasks, and are the main sauce behind most LLMs. A few years back, researchers at Google published a new Natural Language Processing (NLP) model called BERT:

Developed as a technique for training general purpose models on massive amounts of unannotated text on the web (aka pre-training).
Advanced ability in understanding linguistic notions, and is highly contextual because it learns representations from both sides of the word in sentences. Basically, it learns the some of the most fundamental nature of language.
Importantly, BERT is a general purpose model that can be adapted, and fine-tuned for many tasks: summarization, question answering, classification, etc.
And the best part about BERT - it can be downloaded and used for FREE!

At the heart of BERT, is the attention mechanism of the transformer. The transformer consists of an encoder (reading input text) and decoder (making predictions). The transformer encoder is ‘bidirectional’, meaning it learns the surroundings of each word. This characteristic separates transformers from traditional directional models (i.e. vanilla RNNs).

The **original publication** on transformers takes a detailed dive into the model structure. I also recommend reading the **illustrated transformer** blog, which personally helped me understand and simplify transformers. OR you can jump ahead into the project / coding :) Source: Polosukhin et al. 2017

BERT Input

The input to the encoder is a sentence converted to a sequence of tokens, where each token is mapped to a feature vector. Prior to processing, there are a few more manipulations applied to the input:

Token embeddings: [CLS] and [SEP], help the model understand the start and end of a single input, respectively
Segment embeddings: distinguishes sentences in input
Positional embeddings: indicate position of each token in the sequence

For this project, we leverage the massive scale of BERT’s language knowledge to build a sentiment classifier. We’ll begin with a pre-trained BERT sequence classifier available in the transformers library (by huggingface, one of the most popular transformer libraries in python), which can be combined directly with PyTorch!

Data

To teach our model sentiment patterns, we will use an open-source dataset of labeled tweets. The dataset contains 1,600,000 tweets, with binary labels (positive/negative). You can find the dataset itself and much more information here.

The authors also detailed their methods of automated data labeling. Common problems can arise when labeling the emotional tone of data. In our case, neutral tweets are not accounted for, and emoticons are unfortunately stripped from the training data, which can carry a lot of information. More-so, our tweets are not filtered by topic, and are not domain-specific. If you have specific goals in mind, filtering by topic (i.e. sports) can likely improve model accuracy

First, we prepare our dataset:

import pandas as pd

df = pd.read_csv('training_1600000_noemoticon.csv', 
                 index_col=False,
                 names = ['label', 'id', 'date', 'query', 'user_id', 'tweet'])

df.set_index('id', inplace=True)
df.head()

Here’s a glimpse of the data. You will likely spot some vile text in here. This is Twitter after all. We have the user_id and text of 1600000 tweets, along with a label column (0: negative, 4: positive), with an even 50/50 label split. For this example, I shuffled and took the first 100000 samples — you can try with with less/more) to deal with limited computing resources. We also split our data into training / validation:

# adjust labels to [0, 1], this is just the way the model likes it
label_dict = {}
for index, label in enumerate(df.label.unique()):
    label_dict[label] = index
df['label'] = df.label.replace(label_dict)

# shuffle rows and take only 100 000 samples
# or feel free to use the full set
df = df.sample(frac = 1)[0:100000]

# test / validation split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size = 0.15, # ... using a 15/85 % test/validation split
    random_state = 60
)

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['label', 'data_type'])['tweet'].count()

The last line of code lets us quickly scan over the label distribution after the split. With small datasets, or imbalanced class data, you run an increased risk of oversampling certain classes when splitting data. For these cases, you can try a stratified approach to make sure each class is accounted across the data split

Tokenizing & Encoding

Next, we encode our samples into tokens using BERT tokenizer, and specify that we don’t care about case sensitivity using ‘bert-base-uncased’:

from transformers import BertTokenizer
from torch.utils.data import TensorDataset

tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased', 
    do_lower_case=True
)

And feed our train/val data into the tokenizer:

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].tweet.values,
    add_special_tokens=True, # ... Berts way of knowing sentence start/end
    return_attention_mask=True, # ... using a fixed input (dim will == 256, will tell us where blank info is)
    pad_to_max_length=True, 
    max_length=256,
    return_tensors='pt'
)

# validation
encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].tweet.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

Then, we split our encoded data into input ids, attention masks, and labels before :

input_ids_train = encoded_data_train['input_ids']
attention_mask_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

dataset_train = TensorDataset(input_ids_train, 
                              attention_mask_train, 
                              labels_train)
                               
input_ids_val = encoded_data_val['input_ids']
attention_mask_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values) 

dataset_val = TensorDataset(input_ids_val,
                            attention_mask_val,
                            labels_val)

Input IDs are simply the mappings between tokens and their respective IDs. The attention mask prevents the model from observing padded tokens. Special tokens simply allow BERT to know the sentence start/end (i.e. [CLS] & [SEP] seen in the earlier figure)

Pre-Trained BERT Model

Let’s fetch our BERT model (~450MB) + declare a few parameters. This pre-trained model comes fully loaded with weights, and all necessary configs. Note, that we use the base version (‘bert-base-uncased’), which is much lighter for both training and inferences.

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = len(label_dict),
    output_attentions=False, # attending to certain words more than others
    output_hidden_states=False
)

Transfer learning was first popularized in the field of computer vision. For example, ImageNet/ResNet is trained on a broad task of image classification where generic image patterns are deeply understood. Then, fine-tuning is performed to specialize the model in a more specific task, which is associated with a new output layer. For example, a general purpose CNN can be fine-tuned to flag medical images with unusual signs to automate disease screening!
With enough access to resources available, there is merit to train a model from scratch. Recently, Bloomberg developed a massive NLP model from scratch — specialized for financial tasks called BloombergGPT. This model outperforms outperforms other large language models on financial NLP tasks (paper here). And of course, this model is not public :(

In our dataloader, I’ve used a batch size of 32 —for this step, ensure your machine does not have memory limitations when loading batches. You can always afford a larger batch-size for validation steps since there are less computational steps (i.e. no backward pass). RandomSampler() will prevent our model from learning sequence based differences.

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 32

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train), 
    batch_size=batch_size,
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=32, 
)

Later, during training, BERT can cause memory errors. This usually indicates the need for more powerful hardware — a GPU with more on-board RAM/TPU. A simple workaround if you don’t have hardware on hand: decrease your batch_size to free up memory which will in-turn decrease your training speed.

Optimizer and Scheduler

We have a pretty standard setup here. Remember, the type of optimizer, number of epochs, and learning rate can all be fine-tuned. Feel free to play around with these parameters. From what I’ve seen, BERT tuning does not always benefit from a large number of epochs (~5+)

from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(
    model.parameters(),
    lr=1e-5, 
    eps=1e-8
)

epochs = 5 

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(dataloader_train)*epochs, 
)

Performance Metrics

Now, we need to define some metrics to track performance. At times, accuracy alone may not be sufficient. We use F1-score, which combines the precision and recall scores.

The accuracy_per_class will simply assess performance on our labels (positive / negative) individually.

import numpy as np
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten() 
    labels_flat = labels.flatten()
    
    return f1_score(labels_flat, preds_flat, average='weighted') 
    
def accuracy_per_class(preds, labels):
    labels_dict_inverse = {v: k for k, v in label_dict.items()}
    preds_flat = np.argmax(preds, axis=1).flatten() 
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {labels_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])} / {len(y_true)}\n')

Training (fine-tuning)

As a general summary, model fine-tuning includes:

Loading the pre-trained BERT model
Adding new classification layer, where n output layers == n labels
Freezing weights — except the new classification layer
Training & Evaluating

Before running the training loop, we set a random.seed, and tell our model which device to run on.

import random

seed_val = 6
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val) 

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device) 
print(device)

Next, write our evaluation function which we will use to test performance loss on our validation dataset

def evaluate(dataloader_val):
    
    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':       batch[0],
                  'attention_mask':  batch[1],
                  'labels':          batch[2]
                 }
        with torch.no_grad():
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()
        
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
        
    loss_val_avg = loss_val_total/len(dataloader_val)
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
    
    return loss_val_avg, predictions, true_vals

Finally, it’s time for some fine-tuning magic. In our training loop, each batch will gradually update the weights of our base BERT model. A few notes on the following code block:

progress_bar gives feedback on progress
model.train() moves the model into training mode
model.zero_grad() resets gradients to zero — no need for this when using RNN (recurrent neural nets)
outputs = model(**inputs) gives BERT our inputs, simple way to unpack our dictionary into the model
torch.nn.utils.clip_grad_norm_ will clip the gradient, normalize weights to 1, this technique helps prevent odd behavior such as ‘exploding gradients’
torch.save() … saves model every epoch
after each epoch, our evaluate function will compute our validation loss

from tqdm.notebook import tqdm

for epoch in tqdm(range(1,epochs+1)):
    
    model.train()
    loss_train_total = 0
    progress_bar = tqdm(dataloader_train,
                        desc='Epoch {:1d}'.format(epoch),
                        leave=False,
                        disable=False)
    
    for batch in progress_bar: 
        
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {
            'input_ids' : batch[0],
            'attention_mask' : batch[1],
            'labels' : batch[2]
        }
        
        outputs = model(**inputs) # feed bert our inputs
        
        loss = outputs[0]
        loss_train_total += loss.item() 
        loss.backward() # back propoagation step
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) 
        
        # step optimizer & scheduler
        optimizer.step()
        scheduler.step()
        
        # update progress bar and end batch loop
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))}) # loss of current item / batch length
        
    model.save_pretrained(f'Models/BERT_ft_epoch{epoch}/') # ... save model every epoch
    tqdm.write(f'\nEpoch {epoch}')
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')

    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (weighted): {val_f1}')

During consecutive training epochs, we hope to see a decrease in loss. When training loss decreases but validation loss starts to go up, you are likely overfitting to the train dataset. This indicates a sacrifice of generalizability, where the model is learning the data instead of the general problem at hand.

Evaluation

Finally — we have a model. Let’s throw some stuff at our model and run some inferences. Load the model, move it to the cpu/gpu, and load the saved weights:

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = len(label_dict),
    output_attentions=False, 
    output_hidden_states=False
)

model.to(device)

model.load_state_dict(
    torch.load('Models/BERT_ft_epochX.model',
               map_location=torch.device('cpu'))
)

And check model accuracy by class:

_, predictions, true_vals = evaluate(dataloader_val)

accuracy_per_class(predictions, true_vals)

In this example, I trained only one epoch. Around ~85% accuracy for both positive and negative classes. Not too shabby! Keep in mind this is only a binary classification task meaning our model’s worst inference is still a 50/50 coin-flip each time. In a multi-class classification problem, you may experience more challenges.

For running inferences on custom text, I recommend constructing a classifier with pipeline(), by simply providing it with the model directory and tokenizer:

tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case=True
)

model_path = './Models/BERT_ft_epoch1/'
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=2)

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

Provide it with some text, and print the prediction/confidence:

label_map = {'LABEL_0': 'Negative', 'LABEL_1': 'Positive'}

result = classifier("This is the best BERT guide")

print("Predicted label:", label_map[result[0]['label']])
print("Score:", round(result[0]['score'],3))

That’s a pretty simple input. Let’s see how our model handles a trickier input:

result = classifier("I'm disapointed that it has taken so long to discover 
this fantastic BERT guide")

print("Predicted label:", label_map[result[0]['label']])
print("Score:", round(result[0]['score'],3))

This one is a bit subjective, the semantics can easily have alternate interpretations. This is a problem with sentiment analysis, where the models have difficulty determining a change in context. Like in this case, where we have both negative and positive tones in one input.

Conclusion

In summary, we:

Introduced BERT + some basic model background
Used the Transformers & Pytorch python libraries, with our own data
Discussed how to train/evaluate a pre-trained model for a new task, and hopefully, I highlighted the importance of transfer learning
Worked through a practical example of how to use BERT from start to finish!

There are still many opportunities and problems to solve in sentiment analysis. Many shortcomings result from data quality and data labeling pipelines. Data quality is always key.

Sentiment analysis concepts and techniques can provide great utility in real-world applications, especially when combined with web-scraping. I intend to explore these in the future, for example, sentiment analysis on the current news cycle to predict market or stock behavior.

I hope you were able to follow along, and learn a thing or two, or build your own project. Happy NLP modeling and opinion mining! 🙂

https://github.com/kulasbart/BERT_deeplearning_sentiment

You can find my full GitHub repo along with the data I used here. I noticed other datasets online, some with more diverse emotion labels (angry/sad/neutral/etc), a lot can be found on kaggle as well. Please leave a comment if you have any questions or discussion points!

If you enjoyed the post, please consider following me on Medium 🙂