-
Notifications
You must be signed in to change notification settings - Fork 117
Open
Description
Hello,
I am currently using OCTIS in my Bachelor Thesis to compare various Topic Models, including LSI.
When performing Hyperparameter Tuning on LSI, the creation of the topic word matrix is extremly slow (Waited several Hours).
When checking the Implementation of the Wrapper I found these Code Snippets.
- Why does the current implementation do not simply use get_topics() from gensim.LsiModel, similary of the Implementation of LDA Model in OCTIS , where it just uses the output directly.
- Afterwards the matrix is normalized, and I am wondering if this is Necessary? LDA Implementation does not do this.
EDIT: Ofcourse by default (correct me if I am wrong) in contrast to LDA, where the output are true probabilties, LSI doesnt output normalized values for the matrix.
However If I understand the implementation of LSI in Gensim correctly, they apply a normalization in their get_topics() function:
Gensim LSI Code:
def get_topics(self):
"""Get the topic vectors.
Notes
-----
The number of topics can actually be smaller than `self.num_topics`, if there were not enough factors
in the matrix (real rank of input matrix smaller than `self.num_topics`).
Returns
-------
np.ndarray
The term topic matrix with shape (`num_topics`, `vocabulary_size`)
"""
projections = self.projection.u.T
num_topics = len(projections)
topics = []
for i in range(num_topics):
c = np.asarray(projections[i, :]).flatten()
norm = np.sqrt(np.sum(np.dot(c, c)))
topics.append(1.0 * c / norm)
return np.array(topics)
- If this is necessary, It would be advised to use numpy for normalization to increase the performance. Since doing debuging I noticed that this is the part where my LSI model is stuck for a long time.
def _get_topic_word_matrix(self):
"""
Return the topic representation of the words
"""
topic_word_matrix = self.trained_model.get_topics()
normalized = []
for words_w in topic_word_matrix:
minimum = min(words_w)
words = words_w - minimum
normalized.append([float(i)/sum(words) for i in words])
topic_word_matrix = np.array(normalized)
return topic_word_matrix
def _get_topics_words(self, topics):
"""
Return the most significative words for each topic.
"""
topic_terms = []
for i in range(self.hyperparameters["num_topics"]):
topic_words_list = []
for word_tuple in self.trained_model.show_topic(i, topics):
topic_words_list.append(word_tuple[0])
topic_terms.append(topic_words_list)
return topic_terms
def _get_topic_document_matrix(self):
"""
Return the topic representation of the
corpus
"""
topic_weights = self.trained_model[self.id_corpus]
topic_document = []
for document_topic_weights in topic_weights:
# Find min e max topic_weights values
minimum = document_topic_weights[0][1]
maximum = document_topic_weights[0][1]
for topic in document_topic_weights:
if topic[1] > maximum:
maximum = topic[1]
if topic[1] < minimum:
minimum = topic[1]
# For each topic compute normalized weight
# in the form (value-min)/(max-min)
topic_w = []
for topic in document_topic_weights:
topic_w.append((topic[1]-minimum)/(maximum-minimum))
topic_document.append(topic_w)
return np.array(topic_document).transpose()
Thanks for your response in advance!
Metadata
Metadata
Assignees
Labels
No labels