Topic Modeling

Analyzing Text Data with Topic Modeling

March 27, 2024

What is Topic Modeling?

Manually reading hundreds of thousands of texts to gather insights takes time (a lot of time)
Useful method for automatically finding themes in large uncategorized groups of text
Applications include:
- Analyzing customer reviews or feedback
- Categorizing news articles
- Social Media Analysis, analyzing political campaign ads
- Gene Expression Analysis and other applications in Bioinformatics
- Categorizing bills
- Analyzing themes in free form surveys

png

Dataset

The original dataset contains around 38000 lines of articles from CNN news from the year 2011 to 2022.
The dataset used in this tutorial is a sample of this and includes 45% of the articles from the original dataset.

See more details here

import warnings
warnings.filterwarnings('ignore')

import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import re

# load dataset and preview
df = pd.read_csv('./data/CNN_articles_sample.csv')
df.head()

	Unnamed: 0.1	Unnamed: 0	text	label
0	16381	16381	Story highlightsJenson Button says there's a "...	5
1	24435	24435	Story highlightsF1's drivers have expressed th...	5
2	12801	12801	Story highlightsThe Georgia Dome is scheduled ...	5
3	12482	12482	(CNN)There are few places on Earth that human...	3
4	3	3	Story highlightsLondon's Metropolitan Police s...	3

0 - Business
1 - Entertainment
2 - Health
3 - News
4 - Politics
5 - Sport

# add a column for section name 
section_map = {0: 'business',
               1: 'entertainment',
               2: 'health',
               3: 'news',
               4: 'politics',
               5: 'sport'}

df['section_name'] = df['label'].map(section_map)
df.head()

	Unnamed: 0.1	Unnamed: 0	text	label	section_name
0	16381	16381	Story highlightsJenson Button says there's a "...	5	sport
1	24435	24435	Story highlightsF1's drivers have expressed th...	5	sport
2	12801	12801	Story highlightsThe Georgia Dome is scheduled ...	5	sport
3	12482	12482	(CNN)There are few places on Earth that human...	3	news
4	3	3	Story highlightsLondon's Metropolitan Police s...	3	news

plt.style.use('fivethirtyeight')

# let's look at the distribution of section labels
fig, ax = plt.subplots()
labels = ['news', 'sport', 'politics', 
          'business', 'health', 'entertainment']
counts = df['section_name'].value_counts()
ax.pie(counts, labels=labels, textprops={'size': 'smaller'}, 
       autopct='%1.1f%%', pctdistance=1.15, labeldistance=.2)
plt.show()

png

# filter out any empty articles 
empty_rows = df[df['text'].isnull()]
empty_rows

	Unnamed: 0.1	Unnamed: 0	text	label	section_name
4714	27580	27580	NaN	3	news
14711	14100	14100	NaN	3	news
15938	36067	3849	NaN	3	news

# remove empty rows from the dataframe 
df_cleaned = df.dropna()
df_cleaned.head()

	Unnamed: 0.1	Unnamed: 0	text	label	section_name
0	16381	16381	Story highlightsJenson Button says there's a "...	5	sport
1	24435	24435	Story highlightsF1's drivers have expressed th...	5	sport
2	12801	12801	Story highlightsThe Georgia Dome is scheduled ...	5	sport
3	12482	12482	(CNN)There are few places on Earth that human...	3	news
4	3	3	Story highlightsLondon's Metropolitan Police s...	3	news

# lowercase the text and remove leading and ending spaces
df_cleaned['text'] = df_cleaned['text'].apply(
                        lambda x: x.lower().strip())

# remove potential noise like CNN at beginning or "Story Highlights"
# remove in following pattern (CNN TERM phrase)
# if first characters are story highlights 
cnn = r'\(cnn(?:\s\w+)?\)'
story = re.escape('story highlights')
df_cleaned['cleaned_text'] = df_cleaned['text'].apply(
                                lambda x: re.sub(cnn, '', x))
df_cleaned['cleaned_text'] = df_cleaned['cleaned_text'].apply(
                                lambda x: re.sub(story, '', x))
df_cleaned.head(2)

	Unnamed: 0.1	Unnamed: 0	text	label	section_name	cleaned_text
0	16381	16381	story highlightsjenson button says there's a "...	5	sport	jenson button says there's a "good chance" he ...
1	24435	24435	story highlightsf1's drivers have expressed th...	5	sport	f1's drivers have expressed their support for ...

# to understand our data better, let's look at the distribution of the articles' length. 
# show descriptive statistics for the column 
df_cleaned['text_length'] = df_cleaned['cleaned_text'].apply(lambda x: len(x))
df_cleaned['text_length'].describe()

count     17044.000000
mean       5747.173257
std        5949.797299
min          35.000000
25%        2718.750000
50%        4251.500000
75%        6792.000000
max      111427.000000
Name: text_length, dtype: float64

# after cleaning remove rows with word count 0 
df_cleaned = df_cleaned[df_cleaned['text_length'] != 0]
df_cleaned

	Unnamed: 0.1	Unnamed: 0	text	label	section_name	cleaned_text	text_length
0	16381	16381	story highlightsjenson button says there's a "...	5	sport	jenson button says there's a "good chance" he ...	5104
1	24435	24435	story highlightsf1's drivers have expressed th...	5	sport	f1's drivers have expressed their support for ...	3018
2	12801	12801	story highlightsthe georgia dome is scheduled ...	5	sport	the georgia dome is scheduled to be imploded m...	4891
3	12482	12482	(cnn)there are few places on earth that humans...	3	news	there are few places on earth that humans have...	5416
4	3	3	story highlightslondon's metropolitan police s...	3	news	london's metropolitan police says it is droppi...	2413
...	...	...	...	...	...	...	...
17052	19975	19975	(cnn)teenager coco gauff stunned defending cha...	5	sport	teenager coco gauff stunned defending champion...	3200
17053	19503	19503	(cnn)gianni infantino, the president of footba...	5	sport	gianni infantino, the president of football's ...	4007
17054	29008	29008	(cnn)the price the world has already paid for ...	3	news	the price the world has already paid for the c...	7402
17055	21777	21777	(cnn)the nba trade moratorium had been lifted ...	5	sport	the nba trade moratorium had been lifted for b...	2979
17056	25139	25139	arapoema, brazil (cnn) -- in this small town o...	3	news	arapoema, brazil -- in this small town on the...	9012

17044 rows × 7 columns

df_cleaned.to_csv('./data/CNN_articles_sample_clean.csv')

df_cleaned = pd.read_csv('./data/CNN_articles_sample_clean.csv')

Topic Modeling with BERTopic

BERTopic is a topic modeling technique. It leverages transformers and c-TF-IDF to create interpretable topic representations

Image Description

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

/opt/anaconda3/envs/naacp-topic-model/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# article text
docs = list(df_cleaned.cleaned_text.values)

# train the model 
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(low_memory=True, verbose=True, 
                       embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(docs)

# article text
docs = list(df_cleaned.cleaned_text.values)

# train the model 
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(low_memory=True, verbose=True, 
                       embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(docs)

Batches: 100%|██████████| 533/533 [09:12<00:00,  1.04s/it]
2024-03-26 21:30:37,512 - BERTopic - Transformed documents to Embeddings
2024-03-26 21:30:59,428 - BERTopic - Reduced dimensionality


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-03-26 21:31:01,271 - BERTopic - Clustered reduced embeddings

# depending on your resources, training can take a while 
# here are steps for how to save your model and load 
topic_model.save("simple_topic_model")
temp = BERTopic.load("simple_topic_model")

# let's take a look at the generated topics 
topic_model.get_topic_info()

	Topic	Count	Name
0	-1	5368	-1_and_to_the_of
1	0	733	0_golf_woods_masters_pga
2	1	450	1_ukraine_ukrainecrisis_ukrainian_russian
3	2	418	2_f1_hamilton_vettel_prix
4	3	317	3_2012_euro_photoseuro_370
...	...	...	...
240	239	10	239_syrian_civil_syria_isis
241	240	10	240_concordia_costa_ship_disasterthe
242	241	10	241_women_march_statue_female
243	242	10	242_water_drinking_navy_hawaii
244	243	10	243_simone_tybre_nina_kamau

245 rows × 3 columns

Notes on the topic representations

Largest group is topic -1, which corresponds to outliers
By default BERTopic uses HDBSCAN for clustering and it doesn’t force all data points to be part of clusters
Topic representation is a set of most important words specific to a topic and not others
Let’s take a look at the main terms of each topic
BERTopic uses class-based TF-IDF score to rank the words

# let's take a look at the top terms for the first 16 topics
topic_model.visualize_barchart(top_n_topics=16, n_words=10)

# let's take a look at the topics, corresponding size, and words
topic_model.visualize_topics()

# document and topic assignment
topic_model.get_document_info(docs)

	Document	Topic	Name	Top_n_words	Probability	Representative_document
0	jenson button says there's a "good chance" he ...	2	2_f1_hamilton_vettel_prix	f1 - hamilton - vettel - prix - race - formula...	0.702920	False
1	f1's drivers have expressed their support for ...	2	2_f1_hamilton_vettel_prix	f1 - hamilton - vettel - prix - race - formula...	1.000000	False
2	the georgia dome is scheduled to be imploded m...	-1	-1_and_to_the_of	and - to - the - of - in - that - for - is - i...	0.000000	False
3	there are few places on earth that humans have...	15	15_climate_change_energy_warming	climate - change - energy - warming - emission...	1.000000	False
4	london's metropolitan police says it is droppi...	62	62_assange_wikileaks_julian_embassy	assange - wikileaks - julian - embassy - extra...	1.000000	False
...	...	...	...	...	...	...
17039	teenager coco gauff stunned defending champion...	192	192_gauff_coco_wimbledon_she	gauff - coco - wimbledon - she - her - william...	1.000000	False
17040	gianni infantino, the president of football's ...	118	118_fifa_qatar_2022_cup	fifa - qatar - 2022 - cup - corruption - world...	0.579062	False
17041	the price the world has already paid for the c...	-1	-1_and_to_the_of	and - to - the - of - in - that - for - is - i...	0.000000	False
17042	the nba trade moratorium had been lifted for b...	5	5_nba_game_james_basketball	nba - game - james - basketball - warriors - l...	0.728845	False
17043	arapoema, brazil -- in this small town on the...	32	32_goal_brazil_world_cup	goal - brazil - world - cup - cupgoooal - 171 ...	1.000000	False

17044 rows × 6 columns

Reducing Outliers

Using the topic-document probabilities to assign topics
Using the topic-document distributions to assign topics
Using c-TF-IDF representations to assign topics
Using document and topic embeddings to assign topics

The default method for reducing outliers calculates the c-TF-IDF representations of outlier documents and assigns to the best matching c-TF-IDF representations of non-outliers

# reduce outliers after getting topic representations
new_topics = topic_model.reduce_outliers(docs, topics)

100%|██████████| 6/6 [00:41<00:00,  6.86s/it]

# updating outliers with new topics assignments, no more topic -1
topic_model.update_topics(docs, topics=new_topics)
topic_model.get_topic_info()

	Topic	Count	Name
0	0	746	0_golf_woods_masters_pga
1	1	478	1_ukraine_ukrainecrisis_ukrainian_crisis
2	2	457	2_f1_hamilton_vettel_prix
3	3	327	3_2012_euro_photoseuro_370
4	4	321	4_brexit_eu_uk_deal
...	...	...	...
239	239	34	239_syrian_isis_syria_civil
240	240	10	240_concordia_costa_ship_disasterthe
241	241	38	241_women_gender_equality_men
242	242	22	242_water_drinking_residents_navy
243	243	11	243_native_simone_tybre_nina

244 rows × 3 columns

Redo with automatic topic reduction and other improvements

# set nr_topics to auto, merges topics that are clustered together 

from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance

# vectorizer model will help with removing stop words 
# that don't add to topic representation
vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')

main_representation_model = KeyBERTInspired()
aspect_representation_model1 = MaximalMarginalRelevance(diversity=.5)
representation_model = {"Main": main_representation_model,
                        "Aspect1": aspect_representation_model1}

ar_topic_model = BERTopic(nr_topics='auto',
                          min_topic_size=20,
                          calculate_probabilities=True,
                          vectorizer_model=vectorizer_model,
                          embedding_model=embedding_model,
                          representation_model=representation_model)
ar_topics, ar_probs = ar_topic_model.fit_transform(docs)

2024-03-26 22:06:44,854 - BERTopic - Transformed documents to Embeddings
2024-03-26 22:06:50,740 - BERTopic - Reduced dimensionality


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-03-26 22:06:58,788 - BERTopic - Clustered reduced embeddings
2024-03-26 22:07:30,368 - BERTopic - Reduced number of topics from 119 to 77

# set nr_topics to auto, merges topics that are clustered together 

from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance

# vectorizer model will help with removing stop words 
# that don't add to topic representation
vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')

main_representation_model = KeyBERTInspired()
aspect_representation_model1 = MaximalMarginalRelevance(diversity=.5)
representation_model = {"Main": main_representation_model,
                        "Aspect1": aspect_representation_model1}

ar_topic_model = BERTopic(nr_topics='auto',
                          min_topic_size=20,
                          calculate_probabilities=True,
                          vectorizer_model=vectorizer_model,
                          embedding_model=embedding_model,
                          representation_model=representation_model)
ar_topics, ar_probs = ar_topic_model.fit_transform(docs)

# save the topic model
ar_topic_model.save("custom_topic_model")

# display topic representations
ar_topic_model.get_topic_info()

	Topic	Count	Name
0	-1	5006	-1_said_photos_people_caption
1	0	1865	0_police_brexit_said_trump
2	1	1050	1_world_cup_photos_best
3	2	1039	2_open_wimbledon_tennis_djokovic
4	3	733	3_golf_woods_masters_pga
...	...	...	...
72	71	25	71_kaepernick_anthem_nfl_colin
73	72	23	72_meth_drug_drugs_cocaine
74	73	22	73_eta_spain_spanish_madrid
75	74	21	74_marriage_women_girls_hussein
76	75	21	75_dinosaur_dinosaurs_species_fossil

77 rows × 3 columns

# visualizing topic representations
ar_topic_model.visualize_topics()

# visualizing documents and their topics in 2D
ar_topic_model.visualize_documents(docs)

# visualizing topics and their keywords
ar_topic_model.visualize_barchart(top_n_topics=16, n_words=10)

# documents and topic assignments
ar_topic_model.get_document_info(docs).head(10)

	Document	Topic	Name	Top_n_words	Probability	Representative_document
0	jenson button says there's a "good chance" he ...	-1	-1_said_photos_people_caption	said - photos - people - caption - hide - cnn ...	0.641343	False
1	f1's drivers have expressed their support for ...	7	7_f1_hamilton_vettel_prix	f1 - hamilton - vettel - prix - race - formula...	0.315125	False
2	the georgia dome is scheduled to be imploded m...	-1	-1_said_photos_people_caption	said - photos - people - caption - hide - cnn ...	0.225028	False
3	there are few places on earth that humans have...	16	16_climate_change_energy_warming	climate - change - energy - warming - emission...	0.042562	False
4	london's metropolitan police says it is droppi...	42	42_assange_wikileaks_julian_embassy	assange - wikileaks - julian - embassy - extra...	1.000000	False
5	here is a look at the life of pope francis, th...	11	11_pope_francis_vatican_church	pope - francis - vatican - church - catholic -...	0.837195	False
6	the super bowl, aka your best yearly excuse to...	-1	-1_said_photos_people_caption	said - photos - people - caption - hide - cnn ...	0.165729	False
7	usa wins historic sevens world series titlesec...	65	65_sevens_rugby_fiji_zealand	sevens - rugby - fiji - zealand - series - rya...	0.313852	False
8	diamond jubilee of britain's queen elizabeth i...	8	8_prince_royal_queen_harry	prince - royal - queen - harry - philip - duch...	0.715299	False
9	the french open got off to a wet and chilly st...	2	2_open_wimbledon_tennis_djokovic	open - wimbledon - tennis - djokovic - match -...	0.227162	False

ar_topic_model.get_document_info(docs).head(10).loc[8]['Document']

'diamond jubilee of britain\'s queen elizabeth ii marked with river pageanttens of thousands celebrate along the banks of the thamesroyal family led flotilla of a thousand boats up river to tower bridgemusic and fireworks, street parties and festivals to mark 60th anniversarythe thames became a sea of red, white and blue sunday, as tens of thousands celebrated the diamond jubilee of queen elizabeth ii -- so perhaps it was only fitting that alongside all the flags, another great british tradition was very much in evidence: gray skies and rain.some 20,000 people took to the water aboard 1,000 vessels for a river pageant featuring dragon boats, a floating belfry and the royal barge. the event -- inspired by regal riverside celebrations of the past -- was the largest such celebration on the thames for hundreds of years. around a million people were expected to line the route to cheer on the queen, at the head of a seven-mile long flotilla.  but bad weather meant a planned fly-past was canceled. follow cnn\'s live blogisabella hales and her family staked out their claim to a spot near tower bridge -- where the festivities reached a climax on sunday evening."it\'s cold, but i don\'t mind," the 10 year old, wearing a cardboard duchess of cambridge mask that was rapidly dissolving in the drizzle, told cnn. "it was raining for the queen\'s coronation too. i\'m just really excited, i can\'t wait."just watchedthe queen departs the pageantreplaymore videos ...must watchthe queen departs the pageant 01:29just watchedserenading queen elizabeth iireplaymore videos ...must watchserenading queen elizabeth ii 01:22just watchedqueen elizabeth ii arrives at the thamesreplaymore videos ...must watchqueen elizabeth ii arrives at the thames 01:21just watchedthe queen begins the flotillareplaymore videos ...must watchthe queen begins the flotilla 03:05"it\'s only the second time someone has reigned for 60 years," her aunt laura hales added. "it\'s a big accomplishment, and we wanted to celebrate that."there are about 20 of us -- we\'ve come well prepared," she said, pointing out picnic supplies, party masks of the royal family -- including a corgi -- and pink champagne, "and we don\'t care what the weatherman says."here\'s to liz!" she toasted, raising her glass.ireporters celebrate queen elizabeth iimargaretta soulsby, from dorset, was the first to arrive at tower bridge on saturday. she had planned to camp out, but when it began raining, stewards persuaded her to spend the night in a tent nearby.soulsby told cnn it was "well worth it -- i\'m in the perfect position," and said such events made her very proud to be british."in 1935, when i was 10, my father took the family to the mall to watch the silver jubilee celebrations for king george v and queen mary, and i\'ve been privileged to be present at all of the major royal events since then."after gathering upriver in west london, the flotilla made its way from battersea bridge to tower bridge, passing through the heart of britain\'s capital city over the course of several hours.at the front were 300 man-powered boats, with thousands of volunteers propelling them down river, flags and streamers fluttering around them. a barge carrying the eight royal jubilee bells -- the largest of which, at nearly half a ton, is named for the monarch -- led the way, with peals of bells ringing out from church towers along the river.bumming a smoke from the queen: when the security bubble burstsnext came passenger boats, pleasure boats, historic wooden vessels -- the oldest built in 1740 -- and boats carrying members of the armed forces, police and fire services. one of the boats taking part, the amazon, also took part in the 1897 diamond jubilee celebrations for queen victoria, britain\'s longest-serving monarch and the only other to reach the landmark 60 years on the throne.the biggest cheers were reserved for the present queen, who was carried aboard a specially-converted royal barge, opulently draped in red and gold.sailing boats that were too tall to pass under the 14 bridges along the river pageant route lined the river from london bridge to wapping, in the east, creating an avenue of sails set against the tower of london and the city\'s financial center.the queen disembarked at tower bridge and looked on as the remainder of the river pageant passed by in a riot of color and noise. excitement grew as a gun salute rang out from the tower of london.  nearby, those not lucky enough to get a riverside spot before the area was locked down, watched the pageant on a big screen. cheers, whistles and the odd chorus of "god save the queen" rang out, and the crowd stayed jolly despite the rain.helen mckee, from kent, said she bought her family along to enjoy a once-in-a-lifetime spectacle. "we\'re never going to see this sort of thing again. i\'ve got a little boy and i thought it was important for him to see it. i still remember the silver jubilee in 1977. it\'s a great atmosphere here, everyone is so friendly."to jamie newell, from london, events on the river were just a forerunner of the main attraction of the day - the after party he was planning at home. newell, decked out from head to toe in union flags, and sporting red, white and blue contact lenses, said simply: "i love the queen."not everyone was of the same opinion.  in a street behind the london assembly building, scores of pro-republican campaigners had gathered, waving placards reading: "make monarchy history" and "don\'t jubilee-ve it" and chanting "monarchy out, republic in." today though, they seemed resigned to the fact that they were well and truly in the minority.as the queen\'s barge approached and tower bridge lifted in salute, red white and blue streamers were tossed from the crowd.  and then, as if on cue, the heavens opened, rain lashing those gathered on the riverbanks below.hoods and umbrellas went up, coats and ponchos went on, quickly followed by shouts of "brolleys down" from those behind.some of those who had gathered fled to shelter, but others remained determined to see out the whole seven-mile flotilla, even in torrential rain.patrick gunning had been waiting for the flotilla since 11 a.m. on sunday.  it was well worth the wait, he said.  "i\'ve had my son saul, who\'s 8, on my shoulders so we saw the whole thing, and we\'ll be staying a little longer."london\'s metropolitan police said as many as 6,000 extra officers were on patrol during jubilee events. the huge security operation comes as london prepares to host the 2012 olympic games, which open in late july.outside the capital, britons gathered for thousands of jubilee-themed street parties and barbecues sunday. stores have been filled for weeks with an array of patriotic paraphernalia, from flag-adorned teapots to aprons to picnic sets, to help hosts set the scene for what is billed as a national celebration.the celebrations continue on monday and tuesday, which have been declared public holidays to mark the diamond jubilee.an afternoon garden party at buckingham palace will be followed monday evening by a televised pop concert outside the palace grounds.at the end of the concert, the queen will take to the stage to light the "national beacon," which will be on the mall. she will use a diamond made from crystal glass, which has been on display at the tower of london from the beginning of may, to light the flame.more than 4,000 beacons will then be lit in communities throughout the united kingdom, along with the commonwealth and uk overseas territories.tuesday will be a day of pomp and ceremony, as the queen attends a service of thanksgiving at st. paul\'s cathedral, followed after lunch in westminster by a carriage procession back to buckingham palace, where she will appear on the balcony, flanked by members of the royal family.'

# try assigning topic for unseen article 
# https://www.cnn.com/2024/03/26/business/byd-profit-soar-after-beats-tesla/index.html 

new_docs = ["BYD reported a jump of more than 80% in profit in its first set of annual earnings since it stole Tesla’s crown as the world’s top seller of electric vehicles. Net profit almost doubled to 30 billion yuan ($4.2 billion) last year, from 16.6 billion yuan ($2.3 billion) in 2022, the Shenzhen-based company said Tuesday. That’s despite BYD operating in a “complex external environment,” it noted, citing high levels of inflation globally, and a slowdown in growth in most major economies.  BYD overtook Tesla (TSLA) as the top seller of EVs worldwide in the last three months of last year, capping an extraordinary rise for the Warren Buffett-backed Chinese carmaker. BYD sold 525,409 battery electric vehicles (BEVs) during that period, compared with Tesla’s 484,507. In 2023 as a whole, BYD sold a record 3.02 million vehicles globally, up 62% from 2022. That figure includes 1.44 million plug-in hybrids, which Tesla does not sell. Elon Musk’s carmaker still sold more BEVs last year: 1.8 million to BYD’s 1.57 million.  Price war Compared with Tesla, BYD’s cars are more affordable, which has helped it attract a wider range of buyers. Its entry-level model sells in China for the equivalent of just over $10,000; the cheapest Tesla car, a Model 3, costs almost $39,000. But intensifying competition and a brutal price war last year have impacted the profit margins of many Chinese car makers, including BYD. The country’s car industry recorded a profit margin of 5% for the first 11 months of 2023, compared with 5.7% in 2022 and 6.1% in 2021, according to figures from the Chinese Passenger Car Association. Despite slim margins, the price war doesn’t appear to be abating. Earlier this month, BYD lowered the starting price of its most affordable EV, the Seagull hatchback, by 5% to 69,800 yuan ($9,670.) Other Chinese carmakers have also announced price reductions in the past few weeks, including Geely, Chery, and XPeng Motors."]

ar_topic_model.transform(new_docs)

2024-03-26 22:22:47,729 - BERTopic - Reduced dimensionality
2024-03-26 22:22:47,751 - BERTopic - Calculated probabilities with HDBSCAN
2024-03-26 22:22:47,752 - BERTopic - Predicted clusters





([27],
 array([[7.80879683e-03, 9.59296673e-03, 6.11203202e-04, 6.51194139e-04,
         6.26102653e-03, 1.84719664e-03, 6.21557657e-04, 5.64956914e-03,
         5.95798482e-04, 3.15880177e-03, 6.91040699e-04, 5.10253379e-04,
         1.00103089e-03, 6.52811160e-04, 6.43198324e-04, 1.45520594e-03,
         7.14182051e-04, 8.43490876e-04, 6.39631340e-04, 1.17409669e-03,
         2.67453254e-03, 6.63897534e-04, 6.06292973e-04, 6.60373345e-04,
         8.53775081e-04, 6.44738552e-04, 6.50694265e-04, 6.41241327e-01,
         6.75064118e-04, 5.89717115e-04, 6.00330784e-04, 7.43313477e-04,
         7.00047024e-04, 7.13112402e-04, 6.77553065e-04, 5.75267321e-04,
         7.16695048e-04, 5.89293652e-04, 1.32167727e-03, 5.91050222e-04,
         6.12620629e-04, 6.31253890e-04, 5.01172291e-04, 5.82837607e-04,
         6.55265103e-04, 5.91955399e-04, 6.33812229e-04, 6.00439943e-04,
         7.14137953e-04, 5.88626399e-04, 6.06904543e-04, 6.46155982e-04,
         6.86555877e-04, 8.62724982e-04, 5.97451033e-04, 6.39716600e-04,
         6.14094332e-04, 5.97666758e-04, 8.26303259e-04, 5.97191938e-04,
         6.60868355e-04, 1.70982675e-03, 6.02680294e-04, 6.13727455e-04,
         6.37300778e-04, 7.06349631e-04, 7.00122037e-04, 5.88634639e-04,
         5.89154242e-04, 6.56070178e-04, 6.31469176e-04, 6.71274430e-04,
         6.43720325e-04, 6.16350785e-04, 6.25302450e-04, 7.16005155e-04]]))

# get keywords and weights for topic 27
ar_topic_model.get_topic(27)

[('formula', 0.08117331061665939),
 ('electric', 0.06220142857582872),
 ('car', 0.03606534019125351),
 ('race', 0.03577615233247999),
 ('cars', 0.029986057349402014),
 ('racing', 0.024924522446742713),
 ('di', 0.01908949108743161),
 ('driving', 0.01809070261863316),
 ('season', 0.017823810183823893),
 ('driver', 0.01762892203551633)]

png

Other Variations

Guided Topic Modeling - starting with known topics
Dynamic Topic Modeling - analyzing how topics change over time
Online Topic Modeling - updating the topic model as new data comes in
and more!

Resources

Download this notebook here
Other tech resources: https://buspark.io/
More information on BERTopic:
- https://maartengr.github.io/BERTopic/index.html
- https://txt.cohere.com/topic-modeling-with-bertopic/