Sentiment analysis & topic modeling of antidepressant drug reviews

nlp

python

transformers

huggingface

bertopic

topic-modeling

sentiment-analysis

Author

wovago

Published

February 24, 2023

Abstract

This analysis describes how to perform transformer-based sentiment analysis and topic modeling using BERTopic to analyze drug reviews.

Introduction

In this notebook we will analyze drug reviews using transformer models. More specifically, we will perform sentiment analysis on drug reviews for antidepressants using the DistilBERT language model in the first part of our analysis. In the second part we will use BERTopic to perform transformer- and c-TF-IDF- based topic modeling of drug reviews for antidepressants. For this part we will use the distilRoBERTa language model to embed the drug reviews.

Data Preprocessing

code

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import glob
import pickle

import textwrap
wrapper = textwrap.TextWrapper(width=80)

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from transformers import AutoTokenizer, pipeline, DistilBertForSequenceClassification
from umap import UMAP
from hdbscan import HDBSCAN

import plotly.io as pio
# This ensures Plotly output works in multiple places:
# plotly_mimetype: VS Code notebook UI
# notebook: "Jupyter: Export to HTML" command in VS Code
# See https://plotly.com/python/renderers/#multiple-renderers
pio.renderers.default = "plotly_mimetype+notebook"

pd.set_option('display.max_rows', 100)

%matplotlib inline

code

### Some variables and conigurations ###

# Color map for graphs
CMAP = "Set2"

# Choose whether to rerun sentiment analysis or to load prebuilt model
COMPUTE_SENTIMENT_SCORES = False

# Pre-trained language model used for performing sentiment analysis
SENTIMENT_MODEL = 'distilbert-base-uncased-finetuned-sst-2-english'

# Choose whether to rerun topic analysis or to load prebuilt model
COMPUTE_EMBEDDINGS = False

#  Pre-train language model for embedding documents when performing topic analysis
EMBEDDING_MODEL = 'all-distilroberta-v1'

Data Import

For this analysis we will download a data set with drug reviews from Kaggle: https://www.kaggle.com/datasets/mohamedabdelwahabali/drugreview

After downloading and extracting we can import the drug reviews as follows:

code

def load_reviews(file):
    # Loading data from multi-line jsonl files, so set 'lines=True'
    df = pd.read_json(file, lines=True).convert_dtypes()
    df = df.astype({col: "int32" for col in df.select_dtypes("int64").columns})
    df["drugName"] = df["drugName"].astype("category")
    df["condition"] = df["condition"].astype("category")
    df["condition"] = df["condition"].str.replace(
        "disorde", "disorder" # Correct typo in data set
    )
    df.rename(
        columns={"usefulCount": "useful_count", "drugName": "drug_name"}, inplace=True
    )
    df.set_index("patient_id", inplace=True)
    print(f"Number of reviews in {file}:", df.shape[0])
    return df

code

data_df = pd.concat([load_reviews(f) for f in glob.glob("*.jsonl")])
print(f"\n{data_df.shape[0]} drug reviews imported")

Number of reviews in drug_review_test.jsonl: 46108
Number of reviews in drug_review_validation.jsonl: 27703
Number of reviews in drug_review_train.jsonl: 110811

184622 drug reviews imported

It seems the raw data sets contain duplicate reviews, because some reviews are included twice with both brand and compound name. See for example the following reviews, retrieved with a randomly selected review snippet.

code

!grep "i have been on brintellix for 6 months\." drug_review_train.jsonl | jq '.'

{
  "patient_id": 94112,
  "drugName": "Trintellix",
  "condition": "depression",
  "review": "\"i have been on brintellix for 6 months. started on 10mg. initially like everyone else thought it was great with the exception of nausea. it also helped with the quality of my sleep. after about a month it plataued so i went up to 20 mg but it made me extremely tired so i went back down to 10 mg. 5 months into taking this ridiculously expensive drug i felt my personality is completely changing and not for the better.  i became quick to anger, on the verge of tears, (not typical for me), with an explosive and unpredictable personality.  if that was not enough i was also very tired all the time. the personality changes cost me my job and at home well, it hurt people that i love.  going off of it i was in pain and tears for days...\"",
  "rating": 3,
  "date": "March 1, 2016",
  "usefulCount": 58,
  "review_length": 142
}
{
  "patient_id": 93133,
  "drugName": "Vortioxetine",
  "condition": "depression",
  "review": "\"i have been on brintellix for 6 months. started on 10mg. initially like everyone else thought it was great with the exception of nausea. it also helped with the quality of my sleep. after about a month it plataued so i went up to 20 mg but it made me extremely tired so i went back down to 10 mg. 5 months into taking this ridiculously expensive drug i felt my personality is completely changing and not for the better.  i became quick to anger, on the verge of tears, (not typical for me), with an explosive and unpredictable personality.  if that was not enough i was also very tired all the time. the personality changes cost me my job and at home well, it hurt people that i love.  going off of it i was in pain and tears for days...\"",
  "rating": 3,
  "date": "March 1, 2016",
  "usefulCount": 58,
  "review_length": 142
}

We can see that both reviews are exactly the same, except for drug name, which is the brand name “Trintellix” in one case and compound name “Vortioxetine” in the other case. So, we will remove those duplicate reviews. We will still normalize brand and compound names later, so for the moment it does not matter which reviews of duplicate pairs are getting removed.

code

total_reviews = data_df.shape[0]
data_df.drop_duplicates(inplace=True, subset=['review'])
percentage = (data_df.shape[0] / total_reviews ) * 100
print(f"\n{data_df.shape[0]} unique reviews found ({percentage:.2f}% of total reviews)")


110903 unique reviews found (60.07% of total reviews)

After duplicate removal, 110903 unique reviews of 184622 total reviews are kept. This amounts to 60% percentage of reviews that are being preserved after duplciate removal.

We will also check whether there are any missing values in the data set. This seems not to be the case.

code

print('Number of missing values in data frame:\n', data_df.isna().sum(), sep='')

Number of missing values in data frame:
drug_name        0
condition        0
review           0
rating           0
date             0
useful_count     0
review_length    0
dtype: int64

After we have succesfully imported the data set, the resulting data frame looks as follows:

code

data_df.head()

	drug_name	condition	review	rating	date	useful_count	review_length
patient_id
163740	Mirtazapine	depression	"i've tried a few antidepressants over the yea...	10	2012-02-28	22	68
206473	Mesalamine	crohn's disease, maintenance	"my son has crohn's disease and has done very ...	8	2009-05-17	17	48
39293	Contrave	weight loss	"contrave combines drugs that were used for al...	9	2017-03-05	35	143
97768	Cyclafem 1 / 35	birth control	"i have been on this birth control for one cyc...	9	2015-10-22	4	149
208087	Zyclara	keratosis	"4 days in on first 2 weeks. using on arms an...	4	2014-07-03	13	60

Extract reviews for antidepressant drugs

Because the aim of this analysis is to perform sentiment analysis and topic modeling on antidepressant drug reviews, we will start by filtering reviews about antidepressants. After filtering, we still have 10888 reviews about antidepressants (5,9% from the total number of reviews)

code

ad_df = data_df[data_df.condition.str.contains("depression", regex=True, na=False)]
ad_df.shape

print(
      "%s reviews found for antidepressants (%.2f%% of total) "
      % (ad_df.shape[0], (ad_df.shape[0] / data_df.shape[0]) * 100)
)

6370 reviews found for antidepressants (5.74% of total)

Because topic models work best when there are enough documents per topic (i.e. drug), we will further narrow down the selected used antidepressants to include only those antidepressants having more than 100 reviews. So we will start by retrieving a list with the most commonly used antidepressant drugs.

Note though that the data set still contains both antidepressant brand names and active compounds. So we will still need to normalize drug names. We will therefore create dictionaries to map brands to compounds and vice versa, which we can then use to normalize drug names.

code

brand2compound_dict = {
    "Abilify": "Aripiprazole",
    "Celexa": "Citalopram",
    "Cymbalta": "Duloxetine",
    "Effexor": "Venlafaxine",
    "Effexor XR": "Venlafaxine",
    "Lexapro": "Escitalopram",
    "Mirtazapine": "Mirtazapine",
    "Paxil": "Paroxetine",
    "Pristiq": "Desvenlafaxine",
    "Prozac": "Fluoxetine",
    "Remeron": "Mirtazapine",
    "Trintellix": "Vortioxetine",
    "Viibryd": "Vilazodone",
    "Wellbutrin": "Bupropion",
    "Wellbutrin XL": "Bupropion",
    "Zoloft": "Sertraline",
}

Since reviews are written by customers, those reviews are probaby more likely to contain brand names rather than compond names, so we will also create a dictionary to map compound names back to brand names, which we will use to normalize brand names for antidepressants.

code

compound2brand_dict = dict((v,k) for k,v in brand2compound_dict.items())
compound2brand_dict["Venlafaxine"] = "Effexor"
compound2brand_dict["Effexor XR"] = "Effexor"
compound2brand_dict["Mirtazapine"] = "Remeron"
compound2brand_dict["Bupropion"] = "Wellbutrin"
compound2brand_dict["Wellbutrin XL"] = "Wellbutrin"

code

ad_df.replace({"drug_name": compound2brand_dict}, inplace=True)

antidepressants = (
    ad_df.drug_name.value_counts()
    .reset_index(name="counts")
    .query("counts >= 100")
    .sort_values(by="index", ascending=True)["index"]
    .values.tolist()
)

(
    pd.DataFrame(antidepressants)
    .rename(columns={0: "Normalized brand names"})
    .style.hide(axis="index")
)

Normalized brand names
Abilify
Celexa
Cymbalta
Effexor
Lexapro
Paxil
Pristiq
Prozac
Remeron
Trintellix
Viibryd
Wellbutrin
Zoloft

We can see that all compound names are mapped to brand names. Now that we have normalized nanmes for the most commonly used antidepressants, let’s have a look at the number of retrieved reviews per drug.

code

ad_df = ad_df[ad_df['drug_name'].isin(antidepressants)]
perc = (ad_df.shape[0] / total_reviews) * 100
print(f"{ad_df.shape[0]} reviews for commonly used antidepressants found ({perc:.2f}% of total).")

5331 reviews for commonly used antidepressants found (2.89% of total).

We can also have a look at the number of reviews per antidepressant. We can observe that all drugs indeed have more than 100 reviews.

code

print("\nReviews per antidepressant:")
pd.DataFrame(ad_df['drug_name'].value_counts()).rename(columns={'drug_name': 'Nr of Reviews'})


Reviews per antidepressant:

	Nr of Reviews
Wellbutrin	645
Zoloft	623
Pristiq	528
Effexor	512
Celexa	477
Lexapro	462
Trintellix	426
Cymbalta	380
Viibryd	368
Prozac	343
Remeron	243
Abilify	163
Paxil	161

Let’s also visualize this with a simple bar chart.

code

ax = ad_df["drug_name"].value_counts().rename().iloc[::-1].plot.barh(cmap=CMAP)

ax.set_title("Nr of reviews per antidepressant", fontsize=14)
ax.set_xlabel("Nr of reviews")
ax.set_ylabel("Antidepressants");

Now that we have performed some data preprocessing, we can proceed with the actual analyses.

Sentiment Analysis

For this part of the analysis, we will perform sentiment analysis on the retained antidepressant reviews. We will use a pretrained transformer model downloaded from Huggingface. For this sentiment analysis we will use the DistilBERT language model [Sanh. et al., 2019]. DistilBERT is is a smaller and faster language model than the original, large-scale BERT language model. DistilBERT has 40% less parameters, runs 60% faster, while keeping 95% of BERT’s performance. More specifically we have used distilbert-base-uncased-finetuned-sst-2-english, which is a pretrained distilBERT model that was fine-tuned on English, uncased text data and can be used for classification tasks and sentiment analysis.

To get started with the DistilBERT language model, we will first need to initialize a tokenizer, which will prepare the inputs for model. Note that unlike traditional NLP pipelines, no additional cleaning steps such as stop word removal, lemmatization, etc need to be performed. After initalizing the tokenizer, we will load the fine-tuned DistilBERT model from a model checkpoint. Once we have initialized both the tokenizer we can create the classifier pipeline. Because we will perform sentiment analysis we will set the parameter task='sentiment-analysis'.

Also note that the reviews have different lengths, while the language model will only accept a fixed length input. Therefore we will add parameter padding=True which instructs the pipeline fill reviews with shorter length to a fixed size. Similarly, adding parameter truncation=True will instruct the pipeline to clip off reviews that have a length that is longer than the fixed size.

code

tokenizer = AutoTokenizer.from_pretrained(SENTIMENT_MODEL)
model = DistilBertForSequenceClassification.from_pretrained(SENTIMENT_MODEL)
classifier = pipeline(
    task="sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    padding=True,
    truncation=True,
)

We will also define some auxiliary functions to parse the results after running the pipeline and merge them with the previously created dataframe containing all data and annotations.

code

def parse_sentiment_results(results):
    labels = []
    scores_pos = []
    scores_neg = []
    for i, _ in enumerate(results):
        label = results[i]["label"]

        labels.append(label)
        if label == "POSITIVE":
            scores_pos.append(results[i]["score"])
            scores_neg.append(1 - results[i]["score"])
        else:
            scores_pos.append(1 - results[i]["score"])
            scores_neg.append(results[i]["score"])

    df = pd.DataFrame.from_dict(
        {"sentiment": labels, "score_pos": scores_pos, "score_neg": scores_neg}
    )
    return df

Once we have set up the pipeline and defined the auxiliary functions we are ready to run our sentiment analysis. Because this is a computationally expensive step, the output will be written to a parquet data frame after running the sentiment analysis pipeline, which subsequently can be read in for further analyses.

code

if COMPUTE_SENTIMENT_SCORES:
    results = classifier(ad_df['review'].tolist())
    sentiment_df = parse_sentiment_results(results)
    sentiment_df.index = ad_df.index.tolist()
    sentiment_df.to_parquet('sentiment.df.parquet.gzip', compression='gzip')
else:
    sentiment_df = pd.read_parquet("./sentiment.df.parquet.gzip")

Once we have the results, we can merge them with our original data frame.

code

merged_df = pd.concat([ad_df, sentiment_df], axis=1)
merged_df["compound"] = merged_df["drug_name"].replace(brand2compound_dict)
cols_reordered = merged_df.columns.insert(1, "compound")
merged_df = merged_df[cols_reordered[:-1]]
merged_df.shape

(5331, 11)

For visualization purposes, let’s add some emoji symbols as well to the data frame containing all our reviews and sentiment scores.

code

HAPPY = "\U0001F642"
SAD = "\U0001F641"
STAR = "\U00002B50"

merged_df['mood'] = [HAPPY if x=="POSITIVE" else SAD for x in merged_df['sentiment'].tolist()]
merged_df['stars'] = [STAR * x for x in merged_df['rating'].tolist()]

So, after merging and adding some emoji, the merged data frame looks as follows:

code

merged_df.head(5)

	drug_name	compound	condition	review	rating	date	useful_count	review_length	sentiment	score_pos	score_neg	mood	stars
163740	Remeron	Mirtazapine	depression	"i've tried a few antidepressants over the yea...	10	2012-02-28	22	68	NEGATIVE	0.155829	0.844171	🙁	⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
141462	Lexapro	Escitalopram	depression	"i am a 22 year old female college student. i ...	9	2014-04-29	32	141	POSITIVE	0.979956	0.020044	🙂	⭐⭐⭐⭐⭐⭐⭐⭐⭐
201582	Zoloft	Sertraline	depression	"zoloft did not help me at all. i was on it f...	1	2013-01-14	51	45	NEGATIVE	0.001271	0.998729	🙁	⭐
131683	Effexor	Venlafaxine	depression	"sadly only lasted 5 days on effexor xr. the s...	1	2016-04-24	18	130	NEGATIVE	0.000938	0.999062	🙁	⭐
122089	Effexor	Venlafaxine	depression	"i was first prescribed effexor 13 years ago a...	8	2010-12-13	36	145	POSITIVE	0.969563	0.030437	🙂	⭐⭐⭐⭐⭐⭐⭐⭐

Now that we have annotated all reviews with sentiment label and scores, let’s print a few reviews with associated sentiment values to visually inspect whether the sentiment classifier works well. We will print both the 10 most positive as well as the 10 most negative reviews.

code

def display_top_reviews(n=3, sentiment="positive"):
    cols = ["review", "stars", "rating", "mood", "sentiment", "score_pos"]
    if sentiment == "positive":
        top_df = merged_df[cols].sort_values("score_pos", ascending=False).head(n)
    else:
        top_df = merged_df[cols].sort_values("score_pos", ascending=True).head(n)

    print(f"Top {n} most {sentiment} reviews:")
    with pd.option_context(
        "display.max_columns", None,
        "display.expand_frame_repr", False,
        "max_colwidth", -1,
    ):
        display(top_df[cols].style.hide_index())

code

display_top_reviews(n=10, sentiment="positive")

Top 10 most positive reviews:

review	stars	rating	mood	sentiment	score_pos
"i have been on brintellix for 6 months and feel amazing. i have energy and desire to be active and involved. i had noticed an improvement in my ability to concentrate and remember tasks. i love it and am so glad i feel whole."	⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐	10	🙂	POSITIVE	0.999882
"after years of trying to find something to work on various levels regarding my depression and anxiety, it seems that i am now being brought back to life. i feel alive again. it's amazing."	⭐⭐⭐⭐⭐⭐⭐⭐⭐	9	🙂	POSITIVE	0.999876
"very very good. helps a ton with energy, motivation, and joy. i was depressed for 4 years and this was just a miracle for me. highly recommend it. also, helps you last longer in bed"	⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐	10	🙂	POSITIVE	0.999871
"the first week it wasn't too good but after it was in my system..brilliant! i am happy, in control of my emotions and back studying and being the best mum i can be ...wonderful."	⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐	10	🙂	POSITIVE	0.999868
"i have been taking this medicine for 6 years and am very happy with it. it is the first medication that has helped me to feel what i would guess is normal. i am able to enjoy things in my life that are good. my anxiety is less and my paranoid obsessive thoughts are much more in control. i feel like it gets my brain in the right place to where i can think clearly."	⭐⭐⭐⭐⭐⭐⭐⭐⭐	9	🙂	POSITIVE	0.999862
"i have taken pristiq for 2 weeks and it is starting to help my mood and energy level. i am so happy as this gives me hope that i can get back to normal."	⭐⭐⭐⭐⭐⭐⭐⭐⭐	9	🙂	POSITIVE	0.999861
"i am taking this medicine now and i am happy with it. i felt like a zombie the first few days but today is the start of week two and i am feeling great."	⭐⭐⭐⭐⭐⭐⭐⭐	8	🙂	POSITIVE	0.999851
"this medicine worked miracles for me! i absolutely love it! it is very effective and works in days, instead of weeks! when i first started taking it 40 mg i was really out of it and sleepy but i got used to it and now it gives me energy and puts me in a great mood! best anti-depressant/ anxiety medication ever and always remember there are side effects to every medicine."	⭐⭐⭐⭐⭐⭐⭐⭐⭐	9	🙂	POSITIVE	0.999834
"i love it i never felt this happy i will tell anybody to get on it i am calmer then ever and talkative but that's ok! i'm over the throwing up part thank god. but i love love love it!"	⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐	10	🙂	POSITIVE	0.999824
"celexa is the best thing that has ever happened to me. i'm 16 and i have had a hard life and ever since i started taking celexa it has done nothing but the best."	⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐	10	🙂	POSITIVE	0.999818

code

display_top_reviews(n=10, sentiment="negative")

Top 10 most negative reviews:

review	stars	rating	mood	sentiment	score_pos
"worst medication i've ever taken. experienced every side effect except sudden death and compulsive gambling. started to feel like a zombie devoid of anything real. if you're not psychotic, don't take this!!!"	⭐	1	🙁	NEGATIVE	0.000188
"do not take!!! this medication ruined my life in the space of a month! it made me more depressed, agitated, irritable, lose my appetite, anxious beyond belief.... and that's only on the lowest dose! i cannot believe, in all my experience with depression, that an 'antidepressant' can make someone feel so depressed."	⭐	1	🙁	NEGATIVE	0.000190
"i think this ad needs to be reformulated. the early upset stomach and headaches are awful there does not seem to be any real lift from depression but a muting of your emotions. i just feel flat and listless."	⭐⭐	2	🙁	NEGATIVE	0.000191
"i'm on day 3 of zoloft i think i'm going to stop i feel angry more depressed no motivation to do anything, feel like i can't handle my kids. super dizzy and over it like 10x more suicidal then i've ever been. also kind of feel numb and like i can't think properly. not a good feeling and i don't think its worth waiting it out..."	⭐⭐	2	🙁	NEGATIVE	0.000192
"i was not happy about this drug at all. it made me feel so lazy and tired. zero effect on my mood; in fact, i was in a worse mood since i had zero energy."	⭐⭐	2	🙁	NEGATIVE	0.000196
"this drug is dangerous. the side effects are horrible and the withdrawal process to get off of the drug is arguably worse, if that is even possible. this drug should be taken off the market."	⭐	1	🙁	NEGATIVE	0.000200
"i've been taking abilify on and off for about eight years. i feel so flat. i use to be a go getter not anymore. i feel lazy with the abilify and lamictal combo. what should i do. i feel emotionally flat and people say i look lifeless."	⭐⭐⭐⭐⭐⭐	6	🙁	NEGATIVE	0.000203
"i was given the generic after taking pristiq for 5 years, this does not work for me at all, depression, anxiety and thoughts are all over the place. bennett taking the generic 2 months. have to try something else now because this is not good."	⭐⭐	2	🙁	NEGATIVE	0.000203
"my doctor switched me from being on zoloft for 2+years for anxiety/ depression, cause it stopped working. though it was rough to come off 150 mg dose of zoloft onto 5mg trintillix to current dose of 10 mg, my depression symptoms have mostly gone away. but as soon as i started 10 mg dose, i felt nauseated every time after taking it. after 1 week on 10mg, told my doctor about the nausea, she said the benefits outweighed the side effects, to give it more time. 2 months later, nausea is so bad,i feel like trash! it doesn't matter if i take it with/ without food, or the time of day i take it, it makes me nauseous all the time and vomit on several occasions. i also get severe acid reflux, so i'm about to quit this med, it's not worth it!"	⭐⭐	2	🙁	NEGATIVE	0.000208
"i have been on cymbalta for about 6 months. in this time i lost my libido completely, suffered night sweats and couldn't work out if i was hot or cold during the day. i constantly felt sick both in my head and my stomach and suffered a general loss of energy, probably due to the crazy dreams and loss of sleep. i decided to quit cymbalta cold turkey, and suffered major brain zaps and a general feeling of unwellness. the worst thing was, i didn't realise this is why i felt so sick and it was impossible to find a doctor who would take my symptoms seriously. none of them knew anything about the drug either. i had to work it out for myself. i've lost several months trying to work out why i always feel sick."	⭐	1	🙁	NEGATIVE	0.000221

The sentiment classifier seems to work well for both positive and negative reviews, so let’s tabulate and plot the percentage of positive and negative reviews per antidepressant.

code

counts_df = (
    merged_df.groupby(["drug_name", "sentiment"]).size().reset_index(name="counts")
)

counts_df["group_percentage"] = 100 * (
    counts_df["counts"] / counts_df.groupby("drug_name")["counts"].transform("sum")
)
counts_df.style.hide_index()

drug_name	sentiment	counts	group_percentage
Abilify	NEGATIVE	99	60.736196
Abilify	POSITIVE	64	39.263804
Celexa	NEGATIVE	270	56.603774
Celexa	POSITIVE	207	43.396226
Cymbalta	NEGATIVE	250	65.789474
Cymbalta	POSITIVE	130	34.210526
Effexor	NEGATIVE	356	69.531250
Effexor	POSITIVE	156	30.468750
Lexapro	NEGATIVE	253	54.761905
Lexapro	POSITIVE	209	45.238095
Paxil	NEGATIVE	107	66.459627
Paxil	POSITIVE	54	33.540373
Pristiq	NEGATIVE	314	59.469697
Pristiq	POSITIVE	214	40.530303
Prozac	NEGATIVE	186	54.227405
Prozac	POSITIVE	157	45.772595
Remeron	NEGATIVE	159	65.432099
Remeron	POSITIVE	84	34.567901
Trintellix	NEGATIVE	307	72.065728
Trintellix	POSITIVE	119	27.934272
Viibryd	NEGATIVE	238	64.673913
Viibryd	POSITIVE	130	35.326087
Wellbutrin	NEGATIVE	375	58.139535
Wellbutrin	POSITIVE	270	41.860465
Zoloft	NEGATIVE	380	60.995185
Zoloft	POSITIVE	243	39.004815

code

ax = (
    counts_df.pivot(index="drug_name", columns="sentiment", values="group_percentage")
    .sort_values("NEGATIVE", ascending=True)
    .plot.barh(stacked=True, cmap=CMAP)
)

ax.set_title("Sentiment towards antidepressants", fontsize=14)
ax.set_xlabel("Percentage negative and positive reviews")
ax.set_ylabel("Antidepressants")

plt.legend(loc=(1.04, 0.5));

Topic modeling of drug reviews using BERTopic

For the second part of this analysis we will perform topic modeling using BERTopic [Grootenhorst, 2022]. More specifically, we will apply BERTopic on reviews of antidepressant drugs to see whether we can extract side efects for each antidepressant drug. The idea is that each antidepressant will get modeled as a separate topic. Drug reviews then are documents that will get attributed to the different topics. In case there are side effects that are frequently mentioned, we expect those frequently mentioned side effects should get picked up by the topic model. So, let us start the topic modeling analysis to see whether we can pick up those side effects.

Introducing BERTopic

So what is BERTopic actually?

The following quote taken from the BERTopic homepage provides a good starting point:

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

Furthermore, the BERTopic website also provides a nice, dedicated page that outlines the underlying algorithm that BERtopic uses to perform topic modeling.

On a high level, the BERTopic algorithm comprises a sequence of processing steps to perform topic modeling. Those steps depicted in the image underneath and are also outlined below.

A schematic overview of the BERTopic workflow (image reproduced from https://www.youtube.com/watch?v=uZxQz87lb84/)

Overview of the BERTopic workflow

Embedding Extraction: In the first step, words are converted into numerical representations by using word embeddings. Although there are several embedding options, the default option is to use BERT embeddings.
Dimensionality Reduction: In the second step, some dimensionality reduction technique is applied to map the high-dimensional word embeddings to a lower-dimnsional structure. This lower-dimensional output subsequently can then be used as input for the clustering algorithm. Although several dimensionality reduction algorithms are available, the default algorithm used by BERTopic is Uniform Manifold Approximation and Projection (UMAP).[McInnes, 2018]
Clustering: In the next step documents with similar topics are clustered together such that we can find the topics within these clusters.
Count vectorizer: All documents whithin each cluster found in the previous step are seen as a separate category and are joined together. Subsequently, BERTopic computes term frequencies.
c-TF-IDF: To compare the importance of words between documents, BERTopic will compute a class-based term frequency–inverse document frequency (TF-IDF) score using the term frequences found in the previous stage. This class-based TF-IDF is coined c-TF-IDF and can be be computed with the following formula

\(c-IF-IDF = \Large \frac{t_i}{w_i} * log \frac{m} {\Sigma_j^n t_j}\)

Where \(w_i\) is the word frequency extracted for each class and divided by the total number of words \(w\). The next term denotes the unjoined, number of documents m divided by the total frequency of word t across all classes n.

For more information about c-TF-IDF, also see https://maartengr.github.io/BERTopic/api/ctfidf.html

Applying BERTopic in practice

Embedding extraction

Although several options are available for embedding the drug reviews, we will stick to the default option, which is to use the SentenceTransformers library [Reimers et al., 2019] for generating the word embeddings. The SentenceTransformers library is based on Pytorch and provides various transformers and pretrained models.

We will use the all-distilroberta-v1 language model from HuggingFace, which will embed all drug reviews into 768-dimensional vectors. This pretrained language model was finetuned starting from the distilroberta-base language model, which in turn is based on the larger RoBERTa language model [Liu et al. 2019].

The all-distilroberta-v1 model was fine-tuned using a contrastive learning objective: given a sentence from the pair, the model has to predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset. This custom learning objectivve makes the fine-tuned all-distilroberta-v1 model particulary well-suited for clustering and semantic search and therefore is a good model choice for topic modeling.

code

docs = None
embeddings = None

if COMPUTE_EMBEDDINGS:
    docs = merged_df["review"].tolist()
    sentence_model = SentenceTransformer(EMBEDDING_MODEL)
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    with open(f"{EMBEDDING_MODEL}_embeddings.pkl", "wb") as fout:
        pickle.dump(
            {"sentences": docs, "embeddings": embeddings},
            fout,
            protocol=pickle.HIGHEST_PROTOCOL,
        )
else:
    # Load sentences & embeddings from disc
    # (if available from previous run)
    with open(f"{EMBEDDING_MODEL}_embeddings.pkl", "rb") as fin:
        stored_data = pickle.load(fin)
        docs = stored_data["sentences"]
        embeddings = stored_data["embeddings"]

Dimensionality reduction

Although several dimensionality reductions options are available (e.g. t-SNE or PCA), we will use the default UMAP algorithm for dimensionality reduction. UMAP stands for Uniform Manifold Approximation and Projection [McInnes, 2018]. UMAP is a good choice for dimensionality reduction because it keeps a significant fraction of the high-dimensional local structure in lower dimensionality. As a similarity metric we will use cosine similarity.

code

umap_model = UMAP(
    n_neighbors=6, n_components=6, min_dist=0.0, metric="cosine", random_state=42
)

Clustering

For clustering we will use the HDBSCAN clusteringalgorithm.

HDBSCAN is a density-based algorithm that works quite well with UMAP, because it preserves a lot of local structure in lower-dimensional space. Additionally, HDBSCAN will not force data outliers into clusters.

code

hdbscan_model = HDBSCAN(
    min_cluster_size=6,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True,
)

Tokenize topics

Subsequently, we will use Scikit-learn’s Countvectorizer to create a sparse representation of term frequencies, whereby we will ignore stop words.

code

vectorizer_model = CountVectorizer(stop_words="english")

c-TF-IDF

Using the term frequencies we can compute the c-IF-IDF scores using BERTopic’s ClassTfidfTransformer() method.

code

ctf_idf_model = ClassTfidfTransformer()

Create topic model

Now that we have initialized all individual modules, we can put everything together and run the BERToopic pipeline to create a topic model for the antidepressant drug revies.

code

topic_model = BERTopic(

    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctf_idf_model,
    diversity=0.5,
    nr_topics='auto',
    min_topic_size=6
)
topics, probs = topic_model.fit_transform(docs, embeddings)

Now that we have fitted our topic model, let’s have a look at the topics that the model has created.

code

with pd.option_context('display.max_rows', None):
    display(topic_model.get_topic_info())

	Topic	Count	Name
0	-1	102	-1_citalopram_anxiety_depression_years
1	0	1561	0_medication_feel_like_depression
2	1	400	1_wellbutrin_xl_depression_day
3	2	383	2_pristiq_effects_feel_taking
4	3	362	3_lexapro_anxiety_depression_feel
5	4	324	4_zoloft_anxiety_feel_depression
6	5	318	5_effexor_xr_years_withdrawal
7	6	258	6_cymbalta_depression_pain_taking
8	7	225	7_viibryd_diarrhea_effects_week
9	8	216	8_celexa_feel_depression_years
10	9	206	9_prozac_feel_im_depression
11	10	189	10_brintellix_trintellix_nausea_taking
12	11	142	11_sertraline_50mg_depression_started
13	12	105	12_abilify_depression_weight_2mg
14	13	102	13_paxil_years_depression_life
15	14	80	14_mirtazapine_sleep_30mg_15mg
16	15	74	15_bupropion_day_xl_150mg
17	16	43	16_citalopram_anxiety_taking_week
18	17	32	17_remeron_sleep_weight_gain
19	18	32	18_citalopram_medication_life_depression
20	19	23	19_escitalopram_drops_anxiety_im
21	20	22	20_citalopram_life_feel_years
22	21	18	21_ssris_ssri_granules_recommend
23	22	17	22_remeron_taste_sleep_helped
24	23	13	23_suicide_nerve_pain_anxiety
25	24	13	24_ssri_life_ssris_couch
26	25	13	25_citalopram_really_bad_depression
27	26	11	26_ssris_dopamine_norepinephrine_wb
28	27	9	27_paroxetine_daily_instead_zaps
29	28	9	28_cipralex_night_day_aripiprazole
30	29	8	29_ssris_lexepro_xl_burping
31	30	8	30_cbt_bit_cipralex_just
32	31	7	31_ssris_someones_day_diarrhea
33	32	6	32_seroquel_psychiatrist_celexa_drug

We will create some custom annotations which we can add to our graphs when visualizing the topcis.. These custom annotations will be shown when hovering over individual data points. As extra annotations we will include the drug name, review and the sentiment score as computed in the first part of the analysis. We will use some simple HTML markup to format these annotations, such that we have the drug name in the header and the drug review in the body.

code

annotated_docs = [
    "<b>"
    + f"{drug_name} ({compound})"
    + "</b><br>"
    + f"{sentiment.upper()} review: "
    + f"(pos score={score_pos * 100:.2f}, neg score={score_neg * 100:.2f})<br><br>"
    + "<br>".join(wrapper.wrap(text=review))
    for drug_name, compound, sentiment, score_pos, score_neg, review in zip(
        merged_df["drug_name"].tolist(),
        merged_df["compound"].tolist(),
        merged_df["sentiment"].tolist(),
        merged_df["score_pos"].tolist(),
        merged_df["score_neg"].tolist(),
        merged_df["review"].tolist(),
    )
]

Once we visualize the topics, we can readily see that BERTopic has created topics for the antidepressant drugs. We can see drug clusters that are clearly separated. So BERTopic has succesfully managed to identify drug topics in an unsupervised fashion.

code

# Visualize documents
topic_model.visualize_documents(
    annotated_docs,
    embeddings=embeddings,
    hide_annotations=False,
    custom_labels=True
)

Furthermore we can also plot some bar chars, which will present the topics along with their most important words. After plotting we can see that BERTopic created topics about the antidepressant drugs.

Some topics indeed seem to contain side effects, e.g.

Viibryd seems to be associated with diarrhea
Brintellix seems to be associated with nausea
Escitalopram seems to be associated with headache
Remeron, Abilify and Paxil seem to associated with weight gain

Besides side efects, we can also identify some treatment indications, e.g.

Lexapro, Zoloft, Citalopram seem to be taken to treat anxiety
Cymbalta seems to be taken to treat pain
Remeron seems to be taken to improve sleep

We can also visualize the terms associated with each topic. The following graph presents some topics together with the most important terms for each topic. The bars represent \(c-TD-IDF\) scores for the corresponding terms.

code

topic_labels = topic_model.generate_topic_labels(
    nr_words=1, topic_prefix=False, word_length=20, separator=", "
)
topic_model.set_topic_labels(topic_labels)
topic_model.visualize_barchart(top_n_topics=24, custom_labels=True, 
                                        title="Antidepressant topics")

Finally, we can annote the original documents with the created topics, which we will store in a new data frame. We will also add the drug name to the new table.

code

annotated_docs_df = topic_model.get_document_info(docs)
annotated_docs_df['drug_name'] = merged_df['drug_name'].tolist()
annotated_docs_df.head()

	Document	Topic	Name	CustomName	Top_n_words	Probability	Representative_document	drug_name
0	"i've tried a few antidepressants over the yea...	14	14_mirtazapine_sleep_30mg_15mg	mirtazapine	mirtazapine - sleep - 30mg - 15mg - 45mg - nig...	1.000000	False	Remeron
1	"i am a 22 year old female college student. i ...	0	0_medication_feel_like_depression	medication	medication - feel - like - depression - im - d...	1.000000	False	Lexapro
2	"zoloft did not help me at all. i was on it f...	4	4_zoloft_anxiety_feel_depression	zoloft	zoloft - anxiety - feel - depression - 50mg - ...	0.987644	False	Zoloft
3	"sadly only lasted 5 days on effexor xr. the s...	5	5_effexor_xr_years_withdrawal	effexor	effexor - xr - years - withdrawal - depression...	1.000000	False	Effexor
4	"i was first prescribed effexor 13 years ago a...	5	5_effexor_xr_years_withdrawal	effexor	effexor - xr - years - withdrawal - depression...	1.000000	False	Effexor

References

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.