study notes: LDA Topic Modelling

Cheryl
5 min readMay 27, 2020

--

“Work nourishes noble minds.” — SENECA

Latent Dirichlet Allocation (LDA) is an unsupervised machine learning technique used for extracting

  • number of topics in data
  • distribution of topics in each text
  • distribution of words in each topic

A topic is a representation of salient keywords in a certain proportion and a good topic model will be able to classify each text to a topics based on the words used.

Data

import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import pyLDAvis.gensim
import re
from wordcloud import WordCloud
# Data of reviews by customers
df = pd.read_csv('Reviews.csv')
print('Dimension of df: ' + str(df.shape) + '\n')
print(df.loc[0, 'Text'])

>> Dimension of df: (568454, 10)

>> I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.

Basic Data Cleaning for textual data:

  • Remove punctuations
  • Remove stopwords
  • Lemmatization
# Data processing# Update with additional stopwords
stop = set(stopwords.words('english'))
stop.update(('br', 'food'))
# Remove punctuations
exclude = set(string.punctuation)
# Lemmatize
lemma = WordNetLemmatizer()
# Define function for data cleaning
def data_proc(x):
x = x.rstrip()
punc_free = ''.join(i for i in x if i not in exclude)
stop_free = ' '.join([i for i in punc_free.lower().split() if((i not in stop) and (not i.isdigit()))])
normalized = ' '.join(lemma.lemmatize(i) for i in stop_free.split())
return normalized
# Apply function to text
txt_proc = [data_proc(i).split() for i in df['Text']]
print('\nBefore Data Processing\n')
print(df.loc[0, 'Text'])
print('\nAfter Data Processing\n')
print(' '.join(txt_proc[0]))

>> Before Data Processing

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.

>> After Data Processing

bought several vitality canned dog product found good quality product look like stew processed meat smell better labrador finicky appreciates product better

# Combine all words in df
words = ""
for i in txt_proc:
words = words + ' '.join(i)
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(words)
wordcloud.to_image()

An ID is generated for every unique word in the dataset. The corpus (bag-of-words) provide a mapping of the ID to the frequency of each word.

eg. (0, 1) — Word ID 0 occurs once in the first review (output below)

The corpus is used as an input for the LDA model.

# Create a dictionary of courpus (bag-of-words) where each unique term is assigned an index
dictionary = corpora.Dictionary(txt_proc)
# Convert corpus into document term matrix using dictionary
corpus = [dictionary.doc2bow(i) for i in txt_proc]
# View corpus for first review
print(corpus[0])

>> [(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 3), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)]

For a more human readable format:

indx = []
for i,j in corpus[0]:
indx.append(str(dictionary[i]) + ', ' + str(j))
print(indx)

>> [‘appreciates, 1’, ‘better, 2’, ‘bought, 1’, ‘canned, 1’, ‘dog, 1’, ‘finicky, 1’, ‘found, 1’, ‘good, 1’, ‘labrador, 1’, ‘like, 1’, ‘look, 1’, ‘meat, 1’, ‘processed, 1’, ‘product, 3’, ‘quality, 1’, ‘several, 1’, ‘smell, 1’, ‘stew, 1’, ‘vitality, 1’]

Word ID 0 represents the word appreciates and it occurs once in the first review.

Training the LDA model on the corpus and dictionary:

# Train lda model
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=5, id2word = dictionary, passes=15)
# LDA model to generate 5 topic from entire dataset and 5 words for each topic and its weightage
topic = ldamodel.print_topics(num_topics=5, num_words=5)
for i in topic:
print(i)

>> (0, ‘0.021*”dog” + 0.014*”treat” + 0.010*”cat” + 0.010*”love” + 0.010*”like”’)
(1, ‘0.018*”taste” + 0.017*”like” + 0.013*”good” + 0.012*”flavor” + 0.011*”chocolate”’)
(2, ‘0.048*”tea” + 0.021*”taste” + 0.017*”drink” + 0.017*”like” + 0.016*”flavor”’)
(3, ‘0.044*”coffee” + 0.017*”flavor” + 0.016*”like” + 0.014*”cup” + 0.014*”taste”’)
(4, ‘0.017*”product” + 0.015*”amazon” + 0.014*”price” + 0.012*”store” + 0.010*”bag”’)

Topic 0 consist of keywords such as dog, treat, cat, love and like. Topic 0 is most likely to be about animal food (dog/cat mainly).

Overall, the entire dataset can be categorised into 5 main topics:

  • Product type — animal food, chocolate, tea, coffee
  • Purchase location — amazon.

The numbers represents the weightage for each word and its importance to the topic. LDA is able to classify each review to a different topic that is useful for further analysis.

To view the classification and probabilities of topics for each review:

for i in range(0,3):
print(ldamodel[corpus[i]])

>> [(0, 0.7747076), (4, 0.19869387)]
[(0, 0.105494626), (1, 0.26752147), (2, 0.011341258), (3, 0.01122728), (4, 0.6044154)]
[(0, 0.10473981), (1, 0.53593975), (2, 0.060168535), (3, 0.04605796), (4, 0.25309402)]

  • Review 1: 77% in Topic 0 , 20% in Topic 4

To measure the usefulness of the topic model:

# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=txt_proc,
dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

>> Perplexity: -7.806956254733788

>> Coherence Score: 0.42355554276366475

A lower perplexity score indicates better generalisation and performance. However, perplexity score can sometimes deviate from human judgement by clustering words with vastly different meanings. This leads to ambiguous and non-interpretable topics generated.

The coherence score evaluates the semantic similarity among keywords within each topic to determine if topics are interpretable or grouped by statistical inference.

Visualising the topic model:

lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

Each bubble represents a topic:

  • Size: prevalence and importance of topic within dataset
  • Proximity: clustering/overlapping of bubbles suggests topics are not clearly differentiated

The bar chart on the right represents the Top 30 salient keywords in the dataset by count.

By selecting a bubble, the chart will be updated to show only the salient keywords for each topic highlighted in red.

--

--

Cheryl
Cheryl

Written by Cheryl

trouvez vous un cato. etre un cato.

No responses yet