Text Document Clustering Implementation
Text clustering—grouping documents by semantic proximity without prior labels. Applications: topic discovery in corpus, customer request segmentation, document archive organization.
Clustering Pipeline
[Documents]
→ [Cleaning, normalization]
→ [Embeddings (Sentence-BERT)]
→ [Dimensionality reduction (UMAP)]
→ [Clustering (HDBSCAN / K-Means)]
→ [Cluster interpretation (keywords)]
→ [Visualization]
Embeddings for Russian Text
Clustering quality fully depends on embedding quality:
-
cointegrated/rubert-tiny2—312MB, 312ms/1000 texts, good for short texts -
sbert-base-ru-mean-tokens—better for long documents -
text-embedding-3-small(OpenAI API)—best quality, paid
Clustering Algorithms
K-Means: must know cluster count beforehand. Fast, scales well. Sensitive to outliers.
HDBSCAN: doesn't require cluster count, automatically identifies outliers (noise). Best choice for exploratory analysis:
import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, metric='euclidean')
labels = clusterer.fit_predict(embeddings_umap)
BERTopic: end-to-end pipeline from texts to named topics:
from bertopic import BERTopic
topic_model = BERTopic(language="russian", calculate_probabilities=True)
topics, probs = topic_model.fit_transform(documents)
topic_model.visualize_topics()
UMAP for Dimensionality Reduction
Before clustering: UMAP reduces 768-dimensional embeddings to 10–50 dimensions. Accelerates clustering, improves quality (curse of dimensionality).
import umap
reducer = umap.UMAP(n_components=10, metric="cosine", random_state=42)
embeddings_reduced = reducer.fit_transform(embeddings)
Cluster Interpretation
After clustering, each cluster needs naming. Methods:
- TF-IDF top words: most characteristic words of cluster vs others
- LLM interpretation: pass 10 random cluster documents, ask GPT to formulate common topic
- BERTopic built-in: automatically builds c-TF-IDF representation
Unsupervised Quality Assessment
- Silhouette Score: [-1, 1], higher—better separation. Goal: > 0.3
- Davies-Bouldin Index: lower—better. Compares cluster density
- Coherence: how semantically related top cluster words are







