Comparative analysis of sentiment analysis algorithms

nlp

python

transformers

sentiment-analysis

vader

textblob

Author

wovago

Published

February 22, 2023

Abstract

In this notebook we will perform a comparative analysis between various sentiment analysis approaches.

Introduction

For this project we will perform a comparative analysis between various sentiment analysis approaches. The scope of this study is to compare out-of-the-box performance of various classifiers in order to establish some classifier baselines for future studies. To perform the comparison we will run various algorithms on a drug reviews data set and compute several metrics per classifier.

More specifically, we will compare the following approaches towards semntiment analysis:

Rule-based sentiment analysis: For this approach we will compare two rule- and lexicon-based sentiment classifiers, i.e. VADER and Textblob
Feature-base sentiment analysis: For this approach we will convert the reviews into features using TF-IDF scores and then train a standard ML classifier to peform sentiment analysis. Classifiers we will use are Naive Bayes, Logistic Regression and Support Vector Machines (SVMs)
Embedding-based sentiment analysis: For this approach we will embed the review words using a pretrained language model and then use a transformer to perform sentiment analysis. The language model we will use or this analysis is DistilBERT model that was fine-tuned for sentiment analysis.

As evaluation metrics we will use Precision, Recall, f1-score, AUC and Jaccard Index, which we will compute for each classifier using the test data set.

Data Preprocessing

code

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import string
import nltk
nltk.download('punkt')

from collections import defaultdict
from flair.models import TextClassifier
from flair.data import Sentence
from imblearn.under_sampling import RandomUnderSampler
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import jaccard_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import Pipeline
from textblob import TextBlob
from tqdm import tqdm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from yellowbrick.target import ClassBalance
from yellowbrick.classifier import ClassPredictionError, ClassificationReport
from yellowbrick.classifier import ConfusionMatrix, ROCAUC
from yellowbrick.style import set_palette
from yellowbrick.style.palettes import PALETTES, SEQUENCES, color_palette
set_palette('yellowbrick')

pd.set_option('display.max_rows', 100)

%matplotlib inline

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Data Import

We will start our analysis by importing the drug reviews from json files.

code

def load_reviews(stage, file):
    # Loading data from multi-line jsonl files, so set 'lines=True'
    df = pd.read_json(file, lines=True).convert_dtypes()
    df = df.astype({col: 'int32' for col in df.select_dtypes('int64').columns})
    df['drugName'] = df['drugName'].astype('category')
    df['condition'] = df['condition'].astype('category')
    df['condition'] = df['condition'].str.replace('disorde', 'disorder') # Correct type in data set
    df.rename(columns={'usefulCount': 'useful_count', 'drugName': 'drug_name'}, inplace=True)
    df.set_index('patient_id',inplace=True)
    df.drop_duplicates(inplace=True, subset=['review'])
    print(f'Number of unique reviews in {stage} set:', df.shape[0])
    return(df)

code

train_df = load_reviews("training", "../data/drug_review_train.jsonl")
val_df = load_reviews("validation", "../data/drug_review_validation.jsonl")
test_df = load_reviews("test", "../data/drug_review_test.jsonl")

print(
    f"\n{train_df.shape[0] + val_df.shape[0] + test_df.shape[0]} unique drug reviews imported"
)

Number of unique reviews in training set: 84138
Number of unique reviews in validation set: 26054
Number of unique reviews in test set: 41467

151659 unique drug reviews imported

Let’s quickly check that the data does not contain any NA values.

code

if sum([df.isna().sum()[1] for df in [train_df, val_df, test_df]])==0:
    print("No missing values found!")

No missing values found!

Label creation

We will first preprocess our data and put it in a tidy format for subsequent analyses. To create training labels we will use the rating provided by the reviewer themselves. We will assign label POSITIVE to ratings greater or equal to 7 and NEGATIVE to ratings smaller than 7.

code

# Unicode strings for emoticons
HAPPY = "\U0001F642"
SAD = "\U0001F621"
STAR = "\U00002B50"

def preprocess_df(df):
    cols = [
        "drug_name",
        "condition",
        "review",
        "stars",
        "rating",
        "actual_sentiment",
        "actual_label",
        "date",
        "useful_count",
        "review_length",
    ]
    df["actual_label"] = ["POSITIVE" if x >=7 else "NEGATIVE" for x in df['rating'].tolist()]
    df["actual_label"] = df["actual_label"].astype("category")
    sentiment_labels = ["NEGATIVE", "POSITIVE"]
    df["actual_label"] = df["actual_label"].cat.reorder_categories(sentiment_labels)
    df["actual_sentiment"] = [HAPPY if x >=7 else SAD for x in df['rating'].tolist()]
    df["stars"] = [STAR * x for x in df["rating"].tolist()]
    df = df[cols]
    return df


train_df = preprocess_df(train_df)
val_df = preprocess_df(val_df)
test_df = preprocess_df(test_df)

code

test_df = test_df.head(1000)

After preprocessing we now have a column “actual_label”, which is a binary variable that contains the sentiment labels, i.e “POSITIVE” or “NEGATIVE”. We will use this column during training to train some of the classifier models and we will also use these labels during validation and testing to assess classifier performance.

To provided a quick, visual overview, we have also added some emoticons for the sentiment labels and ratings. So, after preprocessing our data table looks as follows:

code

train_df.head(5)

	drug_name	condition	review	stars	rating	actual_sentiment	actual_label	date	useful_count	review_length
patient_id
89879	Cyclosporine	keratoconjunctivitis sicca	"i have used restasis for about a year now and...	⭐⭐	2	😡	NEGATIVE	2013-04-20	69	147
143975	Etonogestrel	birth control	"my experience has been somewhat mixed. i have...	⭐⭐⭐⭐⭐⭐⭐	7	🙂	POSITIVE	2016-08-07	4	136
106473	Implanon	birth control	"this is my second implanon would not recommen...	⭐	1	😡	NEGATIVE	2016-05-11	6	140
184526	Hydroxyzine	anxiety	"i recommend taking as prescribed, and the bot...	⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐	10	🙂	POSITIVE	2012-03-19	124	104
91587	Dalfampridine	multiple sclerosis	"i have been on ampyra for 5 days and have bee...	⭐⭐⭐⭐⭐⭐⭐⭐⭐	9	🙂	POSITIVE	2010-08-01	101	74

Create training, validation and test sets

To facilitate model building and evaluation, we will extract features and labels for the train, validation and test sets.

code

X_train = train_df['review']
y_train = train_df['actual_label']

X_val = val_df['review']
y_val = val_df['actual_label']

X_test = test_df['review']
y_test = test_df['actual_label']

Now that we have created labels for the training, validation and test set, we can immediately have a look at the class proportions to see whether the data set is balanced with regards to the classes.

code

def plot_class_balances(labels):
    assert type(labels)==dict
    fig, axis = plt.subplots(1, 3, figsize=(8,3))
    for i, (k,v) in enumerate(labels.items()):
        plt.subplot(1,3,i+1)
        visualizer = ClassBalance(labels=["No", "Yes"])
        visualizer.fit(v)
        visualizer.finalize()
        plt.title(k.capitalize() + " labels", fontweight="bold")
        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()

code

plot_class_balances({"train": y_train, "validation": y_val, "test": y_test})

Rule-based sentiment analysis

Rule-based approaches to perform sentiment analysis typically rely traditional NLP techniques such as parsing, stemming, tokenization, part-of-speech tagging and lexical analysis to perform sentiment analysis.

Two well-known programs to perform rule-based sentiment analysis are VADER [Hutto & Gilbert, 2014] and Textblob [Loria, 2018], which we will further present and evaluate in the next two paragraphs.

Because those rule-based sentiment analysis algorithms typically do not need to get trained first on a training data set, we will run both VADER and Textblob immediately on the drug reviews in the test set.

Vader

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

(taken from https://github.com/cjhutto/vaderSentiment)

After running VADER on the test data set, we will get a list with polarity scores and sentiments for all drug reviews in the test set. We will store those results in a dataframe, which will be used for further analysis later.

code

analyzer = SentimentIntensityAnalyzer()
vs_result_list = []
for sentence in tqdm(test_df['review'].tolist()):
    vs = analyzer.polarity_scores(sentence)
    vs_result_list.append(vs)

100%|██████████| 1000/1000 [00:01<00:00, 865.65it/s]

code

vs_dict = defaultdict(list)

for vs in vs_result_list:
    label = "POSITIVE" if vs["compound"] > 0 else "NEGATIVE"
    vs_dict["vader_neg"].append(vs["neg"])
    vs_dict["vader_neu"].append(vs["neu"])
    vs_dict["vader_pos"].append(vs["pos"])
    vs_dict["vader_polarity"].append(vs["compound"])
    vs_dict["vader_label"].append(label)
    emoji = HAPPY if vs["compound"] >= 0 else SAD
    vs_dict["vader_sentiment"].append(emoji)

vader_df = pd.DataFrame(vs_dict, index=test_df.index)
vader_df.head()

	vader_neg	vader_neu	vader_pos	vader_polarity	vader_label	vader_sentiment
patient_id
163740	0.204	0.629	0.167	-0.5267	NEGATIVE	😡
206473	0.040	0.802	0.158	0.7539	POSITIVE	🙂
39293	0.036	0.884	0.080	0.6810	POSITIVE	🙂
97768	0.036	0.825	0.139	0.9559	POSITIVE	🙂
208087	0.065	0.802	0.133	0.6924	POSITIVE	🙂

Textblob

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

(taken from [https://textblob.readthedocs.io/en/dev/])

Although Textblob is a general NLP library, it also provides functionality to perform sentiment analysis and to compute polarity scores form text. Similar to VADER we will also run the TextBlob sentiment analysis algorithm to compute polarity scores for all drug reviews in the test set, which we will store in a dataframe as well.

code

textblob_scores = []
for doc in tqdm(test_df["review"].tolist()):
    doc = (
        doc.translate(string.punctuation)
        .replace(".", "")
        .replace("\n", "")
        .replace("!", "")
        .replace("?", "")
    )
    blob = TextBlob(doc)
    
    for sentence in blob.sentences:
        textblob_scores.append(sentence.sentiment.polarity)

100%|██████████| 1000/1000 [00:01<00:00, 927.12it/s]

code

tb_dict = defaultdict(list)

for tb in textblob_scores:
    label = "POSITIVE" if tb >= 0 else "NEGATIVE"
    tb_dict["textblob_polarity"].append(tb)
    tb_dict["textblob_label"].append(label)
    emoji = HAPPY if tb >= 0 else SAD
    tb_dict["textblob_sentiment"].append(emoji)

textblob_df = pd.DataFrame(tb_dict, index=test_df.index)
textblob_df.head()

	textblob_polarity	textblob_label	textblob_sentiment
patient_id
163740	0.000000	POSITIVE	🙂
206473	0.566667	POSITIVE	🙂
39293	0.139063	POSITIVE	🙂
97768	0.234537	POSITIVE	🙂
208087	0.341667	POSITIVE	🙂

Interlude: Vader vs. Textblob, a polarity comparison

As a small interlude, let’s have a look how the polarity scores from VADER and Textblob compare to each other. First we will take the VADER and Textblob vectors with polarity scores and we will compute the Pearson correlation coeficient between both vectors.

code

pcc = np.corrcoef(vader_df["vader_polarity"], textblob_df["textblob_polarity"])[1,0]
print(f"Pearson correlation coefficient: {pcc}")

Pearson correlation coefficient: 0.4983856937129934

The Pearson correlation percent is 0.5, indicating no statistical relationship between both variables. Let’s also visualize the polarity scores for both classifiers and how they relate to each other. From the graph we can observe that the VADER polarity scores seem to be more negative, whereas the Textblob scores have a more positive polarity.

code

pd.concat(
    [vader_df["vader_polarity"], textblob_df["textblob_polarity"]], axis=1
).plot.scatter(x="textblob_polarity", y="vader_polarity", c='steelblue');

Feature-based sentiment analysis

Besides rule-based sentiment analysis, we can also perform feature-based sentiment analysis. To perform feature-based sentiment analysis we will transform the text of the drug reviews into a numerical representation. We can then use this numerical representation as a feature vector and train some classifiers to learn to predict some sentiment labels.

To convert the drug reviews into numerical vectors we will use term frequency–inverse document frequency (TF-IDF) scores. For the sentiment classification task, we will use some well-known classifiers such as Naive Bayes, Logistic Regression and Support Vector Machins (SVM).

So let’s get started with the feature extraction and compute the TF-IDF scores.

Feature extraction (TF-IDF)

The TF-IDF score for a word \(i\) in a document \(j\) can be computed as follows:

\(TF-IDF = tf_(i,j) \times log(\frac{N}{df_i})\)

Where

\(tf_(i,j)\) = number of occurences of \(i\) in document \(j\)
\(df_i\) = number of documents containing \(i\)
\(N\) = total number of documents

Although we could use a simple bag-of-word model or term frequences, we will use TF-IDF as it conveys some information about the importance of a word in a corpus (and hence it’s feature importance).

To compute the TD-IDF scores, we will use Scikit-learn’s TdidfVectorizer.

code

tf_idf = TfidfVectorizer(ngram_range=(1, 3), binary=True, smooth_idf=False)
X_train_tfidf = tf_idf.fit_transform(X_train)
X_val_tfidf = tf_idf.transform(X_val)
X_test_tfidf = tf_idf.transform(X_test)

Now that we have compute TT-ID scores we can train the different classifier models using the TF-IDF scores from the traing set and the associated labels we have created earlier.

Before training the selected classifiers, we will define some auxiliary functions that will compute and visualize model performance.

code

def fit_model(model, prefix):

    model.fit(X_train_tfidf, y_train)
    pred = model.predict(X_test_tfidf)

    sentiment = [HAPPY if x=="POSITIVE" else SAD for x in pred]

    df = pd.DataFrame(
        {f"{prefix}_label": pred, f"{prefix}_sentiment": sentiment}, index=test_df.index
    )
    
    return df

code

y_train_bin = np.array([1 if x=="POSITIVE" else 0 for x in y_train])
y_test_bin = np.array([1 if x=="POSITIVE" else 0 for x in y_test])


def plot_performance(model):
    fig = plt.figure(figsize=(14, 14))
    fig, axes = plt.subplots(2, 2)

    cl = ["NEGATIVE", "POSITIVE"]
    visualgrid = [
        ClassPredictionError(model, classes=cl, ax=axes[0][0]),
        ConfusionMatrix(model, classes=cl, ax=axes[0][1]),
        ClassificationReport(model, classes=cl, ax=axes[1][0]),
        ROCAUC(model, classes=cl, ax=axes[1][1], binary=True),
    ]

    for viz in visualgrid:
        viz.fit(X_train_tfidf, y_train_bin)
        viz.score(X_test_tfidf, y_test_bin)
        viz.finalize()

    plt.tight_layout()
    plt.show()

Naive Bayes Classifier

As a first classifier, we will train a Naive Bayes Classifier. Note that we set some weighted class prioers because the class labels are unbalanced.

code

nb_model = MultinomialNB(class_prior=[0.9, 0.1])
plot_performance(nb_model)

<Figure size 1400x1400 with 0 Axes>

code

nb_df = fit_model(nb_model, prefix='nb')
nb_df.head(10)

	nb_label	nb_sentiment
patient_id
163740	POSITIVE	🙂
206473	POSITIVE	🙂
39293	POSITIVE	🙂
97768	POSITIVE	🙂
208087	NEGATIVE	😡
215892	POSITIVE	🙂
169852	POSITIVE	🙂
23295	POSITIVE	🙂
71428	NEGATIVE	😡
196802	NEGATIVE	😡

Logistic regression

Similarly, we will train a logistic regression model.

code

lr_model = LogisticRegression(solver='liblinear', multi_class='auto')
plot_performance(lr_model)

<Figure size 1400x1400 with 0 Axes>

code

lr_df = fit_model(lr_model, prefix='lr')
lr_df.head(10)

	lr_label	lr_sentiment
patient_id
163740	POSITIVE	🙂
206473	POSITIVE	🙂
39293	POSITIVE	🙂
97768	POSITIVE	🙂
208087	POSITIVE	🙂
215892	NEGATIVE	😡
169852	POSITIVE	🙂
23295	POSITIVE	🙂
71428	NEGATIVE	😡
196802	NEGATIVE	😡

Support Vector Machines

As a final machine learning model, we will fit a SVM classifier. Because SVMs do not scale well with the number of training examples (i.e they have quadratic runtime complexity) we will only train the SVM on the first 10000 drug reviews. So we will train the SVM with a custom pipeline.

code

svm_pipeline = Pipeline(
    [
        ('vectorizer', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', SVC(kernel='linear', C=1, class_weight='balanced')) 
    ]
)

code

NR_EXAMPLES = 10000
svm_clf = svm_pipeline.fit(
    train_df.head(NR_EXAMPLES)["review"], train_df.head(NR_EXAMPLES)["actual_label"]
)

code

svm_pred = svm_clf.predict(test_df["review"])
svm_sentiment = [HAPPY if x == "POSITIVE" else SAD for x in svm_pred]
svm_df = pd.DataFrame(
    {"svm_label": svm_pred, "svm_sentiment": svm_sentiment}, index=test_df.index
)
svm_df.head(10)

	svm_label	svm_sentiment
patient_id
163740	POSITIVE	🙂
206473	POSITIVE	🙂
39293	POSITIVE	🙂
97768	POSITIVE	🙂
208087	POSITIVE	🙂
215892	NEGATIVE	😡
169852	POSITIVE	🙂
23295	POSITIVE	🙂
71428	NEGATIVE	😡
196802	NEGATIVE	😡

Embedding-based sentiment analysis

Another approach to perform sentiment analysis is to use word embeddings. We can embed the words in the drug reviews using pretrained language models and then train a transformer model to predict the sentiment labels from the word embeddings. Compared to the previous TF-IDF + classifier approach, we will not only have more context from the language model embeddings, but we will also have more sentence context because BERT models are bidirectional. So possibly this emmbeding approach could improve sentiment predictions compared to previous approach.

To create the word embeddings we will use the Flair NLP library, which will download the pretrained language model and subsequently will create the embeddings and transformer model. Flair uses sentiment-en-mix-distillbert_4 as a pretrained language model. This language model is based on the distilbert language model and was further fine-tuned for performing sentiment analysis.

code

classifier = TextClassifier.load('en-sentiment')

2023-02-24 02:50:21,026 loading file /home/ubuntu/.flair/models/sentiment-en-mix-distillbert_4.pt

code

fl_pred = []
for doc in tqdm(test_df["review"].tolist()):
    
    doc = (
        doc.translate(string.punctuation)
        .replace(".", "")
        .replace("\n", "")
        .replace("!", "")
        .replace("?", "")
    )
    
    sentence = Sentence(doc)
    classifier.predict(sentence)
    fl_pred.append(sentence.labels[0].to_dict()['value'])
    
fl_sentiment = [HAPPY if x == "POSITIVE" else SAD for x in fl_pred]
    
fl_df = pd.DataFrame(
    {"db_label": fl_pred, "db_sentiment": fl_sentiment}, index=test_df.index
)

fl_df.head(10)

100%|██████████| 1000/1000 [05:12<00:00,  3.20it/s]

	db_label	db_sentiment
patient_id
163740	POSITIVE	🙂
206473	POSITIVE	🙂
39293	POSITIVE	🙂
97768	POSITIVE	🙂
208087	NEGATIVE	😡
215892	NEGATIVE	😡
169852	POSITIVE	🙂
23295	NEGATIVE	😡
71428	NEGATIVE	😡
196802	NEGATIVE	😡

Model comparison

Now that we have run all our models, we are finally ready to compare and evaluate model performance of the various classifiers.

Merge results

We already have all those model scores computed on the test for all classifiers, so let’s put everything together in one data frame to facilitate further comparisons.

code

results_df = pd.concat(
    [
        test_df[["review", "stars", "rating", "actual_label", "actual_sentiment"]],
        vader_df[["vader_label", "vader_polarity", "vader_sentiment"]],
        textblob_df[["textblob_sentiment", "textblob_label", "textblob_polarity"]],
        lr_df[["lr_sentiment", "lr_label"]],
        nb_df[["nb_sentiment", "nb_label"]],
        fl_df[["fl_sentiment", "fl_label"]],
        svm_df[["svm_sentiment", "svm_label"]],        
    ],
    axis=1,
)

We will also add additional columns with binary labels, as it will make it easier to score the model predictions later on.

code

label_cols = results_df.filter(regex=("_label")).columns.to_list()

for l in label_cols:
    new = l.replace('label', 'binlabel')
    results_df[new] = [
        1 if x == "POSITIVE" else 0 for x in results_df[l]
    ]

results_df.filter(regex=("_binlabel")).head()

	actual_binlabel	vader_binlabel	textblob_binlabel	lr_binlabel	nb_binlabel	fl_binlabel	svm_binlabel
patient_id
163740	1	0	1	1	1	1	1
206473	1	1	1	1	1	1	1
39293	1	1	1	1	1	1	1
97768	1	1	1	1	1	1	1
208087	0	1	1	1	0	0	1

After merging all results, we can have a look at the predictions made by all classifiers.

code

selected_cols = results_df.filter(regex=("_sentiment")).columns.to_list()
display(results_df[selected_cols].head(10).style.hide(axis="index"))

actual_sentiment	vader_sentiment	textblob_sentiment	lr_sentiment	nb_sentiment	fl_sentiment	svm_sentiment
🙂	😡	🙂	🙂	🙂	🙂	🙂
🙂	🙂	🙂	🙂	🙂	🙂	🙂
🙂	🙂	🙂	🙂	🙂	🙂	🙂
🙂	🙂	🙂	🙂	🙂	🙂	🙂
😡	🙂	🙂	🙂	😡	😡	🙂
😡	😡	😡	😡	🙂	😡	😡
🙂	🙂	🙂	🙂	🙂	🙂	🙂
🙂	🙂	🙂	🙂	🙂	😡	🙂
😡	😡	😡	😡	😡	😡	😡
😡	😡	😡	😡	😡	😡	😡

Create confusion matrices

To make it easier to compare model performance for all classifiers we will create a large image that combines all confusion matrices for the individual classifiers into a single graph.

code

labels = [0, 1]

clf_dict = {
    'vader': 'Vader Sentiment',
    "textblob": 'TextBlob',
    "lr": "Logistic Regression",
    'nb': "Naive Bayes",
    'svm': "SVM",
    'fl': 'DistilBERT'
}

cm_dict = {}
for k, v in clf_dict.items():
    new = k + "_binlabel"
    cm_dict[k] = confusion_matrix(
        results_df["actual_binlabel"], results_df[new], labels=labels
    )

code

def plot_confusion_matrix(
    cf_matrix,
    cmap="RdPu",
    font_color="steelblue",
    edge_color="violet",
    title="Confusion Matrix",
    xlab="Predicted",
    ylab="Actual",
):
    group_names = [
        "True Negatives",
        "False Positives",
        "False Negatives",
        "True Positives",
    ]
    group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
    group_percentages = [
        "{0:.2%}".format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)
    ]
    labels = [
        f"{v1}\n\n{v2} ({v3})"
        for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
    ]
    labels = np.asarray(labels).reshape(2, 2)
    axis_labels = ["Negative", "Positive"]
    fig = sns.heatmap(
        cf_matrix,
        annot=labels,
        xticklabels=axis_labels,
        yticklabels=axis_labels,
        fmt="",
        linewidths=0.5,
        clip_on=False,
    )
    for tick_label in fig.axes.get_xticklabels():
        tick_label.set_color(font_color)
        tick_label.set_fontsize("12")
    for tick_label in fig.axes.get_yticklabels():
        tick_label.set_color(font_color)
        tick_label.set_fontsize("12")
    fig.collections[0].colorbar.set_label(
        "Counts", fontweight="bold", color=font_color, labelpad=-30, y=1.06, rotation=0
    )
    plt.setp(
        plt.getp(fig.collections[0].colorbar.ax.axes, "yticklabels"), color=font_color
    )
    plt.xlabel(xlab, fontweight="bold", fontsize=16, color=font_color)
    plt.ylabel(ylab, fontweight="bold", fontsize=16, color=font_color)
    plt.title(title, fontweight="bold", fontsize=20, color=font_color)

code

fig = plt.figure(figsize=(10,12))
fig.subplots_adjust(hspace=0.4, wspace=1.4)
fig.suptitle("Confusion Matrices", fontsize=24, color="steelblue")

for i, (k,v) in enumerate(cm_dict.items()):
    
    ax = fig.add_subplot(3, 2, (i+1))
    x_lab = "Predicted sentiment" if (i+1) in (5,6) else ""
    y_lab = "Actual sentiment" if (i+1) % 2 == 1 else ""
    plot_confusion_matrix(
        v,
        title=clf_dict[k],
        xlab=x_lab,
        ylab=y_lab,
    )
plt.tight_layout()

2023-02-18 18:04:16,731 https://nlp.informatik.hu-berlin.de/resources/models/sentiment-curated-distilbert/sentiment-en-mix-distillbert_4.pt not found in cache, downloading to /tmp/tmpt2_kvm49

Compute performance metrics

We will also create the following meterics to have a more global view of individual classifier performance:

Precision: \[{\displaystyle \frac {\text{|TP|}} {\text{|TP|+|FP|}} }\]
Recall: \[{\displaystyle \frac {\text{|TP|}} {\text{|TP|+|FN|}} }\]
f1 score: \[{\displaystyle {{|A \cap B|}\over{|A| + |B| - |A \cap B|}}}\]
Jaccard index: \[{\displaystyle J(A,B) = {{|A \cap B|}\over{|A \cup B|}} = {{|A \cap B|}\over{|A| + |B| - |A \cap B|}}}\]

code

scores_df = pd.DataFrame(columns=("Classifier", "Precision", "Recall", "F1", "AUC", "Jaccard"))

for i, (k, v) in enumerate(clf_dict.items()):
    bin_label = k + "_binlabel"
    prf = precision_recall_fscore_support(
        results_df["actual_binlabel"], results_df[bin_label], average="weighted"
    )
    precision, recall, f1, _ = prf

    auc = roc_auc_score(
        results_df["actual_binlabel"], results_df[bin_label], average="weighted"
    )
    
    jac = jaccard_score(
        results_df["actual_binlabel"], results_df[bin_label], average="weighted"
    )

    scores_df.loc[i] = [v, precision, recall, f1, auc, jac]

#scores_df.set_index('Classifier', inplace=True)
scores_df.style.hide(axis='index')

Classifier	Precision	Recall	F1	AUC	Jaccard
Vader Sentiment	0.702843	0.642000	0.652326	0.663859	0.486413
TextBlob	0.680990	0.678000	0.679399	0.641566	0.524408
Logistic Regression	0.883598	0.881000	0.876879	0.838964	0.784839
Naive Bayes	0.854229	0.845000	0.834908	0.781717	0.724865
SVM	0.797173	0.786000	0.789438	0.778461	0.656351
DistilBERT	0.800855	0.627000	0.625036	0.712476	0.454605

Finally, we can plot the above metrics for all classifiers.

code

scores_long = scores_df.melt(id_vars=["Classifier"], 
                var_name='metric',
                value_vars=['Precision', 'Recall', 'F1', 'AUC', 'Jaccard'],
                value_name='score')

sns.barplot(data=scores_long, x='metric', y="score", hue="Classifier")
plt.legend(bbox_to_anchor=(1.02, 0.55), loc='upper left', borderaxespad=0);
plt.title("Performance meterics for all classifiers");

Conclusion

We can observe from the comparison graph that the logistic regression classifier has the best performance across all metrics. Overal feature-based methods seem to have the best out-of-the-box performance, outperforming rule-and embedding-based methods. For the rule-based methods, VADER and Textblob exhibited similar performance. Somewhat surprisingly, the embedding-based model did not outperform the other feature-based classifiers, but exhibited similar performance as the rule-based methods. It should be noted though that we used a sentiment analysis model that was trainined on the IMDB movie reviews data set. So, this data set might not be entirely representative. Becaue we were only interested in out-of-the-box performance for the scopoe of this study, we did not fine-tune the model further. However, it is expected that further finetuning the model on our training set with drug reviews will result in performance gains.

References

Hutto, C., & Gilbert, E. (2014, May). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).
Loria, S. (2018). textblob Documentation. Release 0.15, 2(8).