Topic2Vector is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:
Get number of detected topics.
Get topics.
Search topics by keywords.
Search documents by topic.
Search documents by keywords.
Find similar words.
Find similar documents.
pip install top2vec
from top2vec import Top2Vec
model = Top2Vec(documents)
model.save("filename")
model = Top2Vec.load("filename")
>>> model.get_num_topics()
topic_words, word_scores, topic_nums = model.get_topics(77)
Algorithm:
1. Create jointly embedded document and word vectors using Doc2Vec.
2. Create lower dimensional embedding of document vectors using UMAP.
3. Find dense areas of documents using HDBSCAN.
4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.
5. Find n-closest word vectors to the resulting topic vector
