Semantic clustering of words with MajorClust and word2vec
In a recent project, I needed to arrange around a thousand words into semantic clusters. For this problem, the Vector Space Model seems wellsuited. Previous research has released word vectors already trained on large text collections using methods like word2vec or GloVe. These vectors can be used to produce a similarity matrix, which can then be used for word clustering.
The first step is to obtain the word vectors, e.g. using the gensim library's downloader tool:
import pandas as pd
import numpy as np
import gensim.downloader as api
vectors = api.load("glovewikigigaword100")
Assuming the words one wants to cluster are in words, keep only relevant word vectors:
vectors = np.array([vectors[x] for x in words])
The next step is the choice of a clustering algorithm. While there are a few popular ones like Kmeans and DBSCAN, Majorclust is somewhat less known, but has an important advantage: it does not require stopping criteria like the number of clusters, the number of elements in a cluster, or the maximum distance between elements within a cluster, to be preset in advance: in many NLP tasks such as word clustering these are very hard to guess, even through lots of experimentation.
The algorithm is described in (Stein and Meyer zu Essen, 2002) in application to document clustering (see p.4 for pseudocode and explanation). Below is its Python implementation taken from here (cosine_matrix is a precomputed matrix of document similarities):
# first assign each document to its own cluster
indices = np.arange(num_of_samples)
t = False
while not t:
t = True
# iterate over the documents
for index in np.arange(num_of_samples):
# find the cluster with the biggest similarity to the document
new_index = np.argmax(np.bincount(indices, weights=cosine_matrix[index]))
# if the cluster of the document has changed, iterate over the documents again
if indices[new_index] != indices[index]:
indices[index] = indices[new_index]
t = False
In brief, the algorithm begins by assigning every document to its own cluster. At each iteration, each document gets assigned to the cluster, to which it has the biggest similarity. A document's similarity to a cluster is calculated as the sum of the document's cosine similarities to each of the cluster's members. Thus, a document can be reclustered to other clusters at later iterations. Clustering stops when no document changes its cluster.
However I found that two modifications to this algorithm are necessary in my case. First, if many elements in the matrix have nonzero similarities to many other elements (e.g., word vectors have lots of nonzero values in the same columns, as is the case with word2vec vectors), then MajorClust may end up putting all elements into one single cluster. Therefore, it was necessary to introduce a threshold below which similarities get cancelled out, i.e. set to 0. After some experimentation I found that good results can be achieved by removing up to 99% of the smallest similarity pairs.
Second, after cancelling out low similarities, there will be many elements with no similarity to any other element. I.e., each of them should either constitute a cluster of its own or be treated as noise. MajorClust will merge two elements together even if the similarity between them is 0, thus producing one large catchall cluster with lots of unrelated words. So the other modification is to prevent assigning an element to a cluster if its similarity equals zero.
The implementation below requires that all similarity values are equal or greater than 0, so first the features in the vectors need to be scaled to a range, e.g. (0, 1):
from sklearn.preprocessing import minmax_scale
vectors = minmax_scale(word_vectors, feature_range=(0, 1))
A similarity matrix can now be produced:
from sklearn.metrics.pairwise import cosine_similarity
def get_sim_matrix(X, threshold=0.9):
"""Output pairwise cosine similarities in X. Cancel out the bottom
`threshold` percent of the pairs sorted by similarity.
"""
sim_matrix = cosine_similarity(X)
np.fill_diagonal(sim_matrix, 0.0)
sorted_sims = np.sort(sim_matrix.flatten())
thr_val = sorted_sims[int(sorted_sims.shape[0]*threshold)]
sim_matrix[sim_matrix < thr_val] = 0.0
return sim_matrix
sim_matrix = get_sim_matrix(vectors, 0.99)
The threshold parameter in the function represents the bottom percent of pairs to be cancelled out, after sorting the pairs by the cosine measure.
Now run Majorclust on the similarity matrix:
def majorclust(sim_matrix):
"""Actual MajorClust algorithm
"""
t = False
indices = np.arange(sim_matrix.shape[0])
while not t:
t = True
for index in np.arange(sim_matrix.shape[0]):
# check if all the sims of the word are not zeros
weights = sim_matrix[index]
if weights[weights > 0.0].shape[0] == 0:
continue
# aggregating edge weights
new_index = np.argmax(np.bincount(indices, weights=weights))
if indices[new_index] != indices[index]:
indices[index] = indices[new_index]
t = False
return indices
labels = majorclust(sim_matrix)
word2clusters = dict(zip(words, labels))
The clustering result is word2clusters, a dictionary with words and their cluster ids.
With the modifications above, MajorClust came up with quite reasonable clusters like:

car, vehicle, bike, truck, drive, driver, bus, helmet, cycle, motorcycle, pickup

game, xbox, player, warfare, cod, dlc, battlefield, playstation, wii, nintendo, fifa, controller, sims, console, play, titanfall, ball, skyrim, gta, halo, gaming, ops, code, madden

ford, toyota, bmw, audi, honda, tesla, nissan, jeep, kia, suv, ferrari, range, mustang, chevy, subaru, skoda, lexus, porsche, lamborghini, hyundai, moto, mercedes, gmc, benz, rover, chevrolet, mazda, mower, golf, merc, jaguar, fords, turbo, tractor, duster, diesel, camry, gtr, volvo, auto, corvette, renault, dodge, datsun, volkswagen

carrier, boat, flight, baggage, plane, drone, air, jet, back, trip, train, team, aircraft, balloon, yacht, space, cabin, fleet

gun, datum, weapon, van, mag, ben, rifle, firearm, ammo, shotgun

influence, supply, gas, power, petrol, fuel, electricity, boost, focus, booster, energy
Examples of words that ended up in singleton clusters: thing, brand, note, coin, item, box, case, pack, set, time, switch, machine. As one can see, these are words with many different meanings, so that it is hard to assign them to any particular semantic cluster.
Below is a tSNE visualization of the clustered words (click the image for a zoomable graph).
The implementation of MajorClust used to derive these clusters, along with the scikitlearn API, can be found here.