Skip to content

LSI Wrapper Implementation Performance #135

@schanikk

Description

@schanikk

Hello,

I am currently using OCTIS in my Bachelor Thesis to compare various Topic Models, including LSI.
When performing Hyperparameter Tuning on LSI, the creation of the topic word matrix is extremly slow (Waited several Hours).

When checking the Implementation of the Wrapper I found these Code Snippets.

  1. Why does the current implementation do not simply use get_topics() from gensim.LsiModel, similary of the Implementation of LDA Model in OCTIS , where it just uses the output directly.
  2. Afterwards the matrix is normalized, and I am wondering if this is Necessary? LDA Implementation does not do this.

EDIT: Ofcourse by default (correct me if I am wrong) in contrast to LDA, where the output are true probabilties, LSI doesnt output normalized values for the matrix.
However If I understand the implementation of LSI in Gensim correctly, they apply a normalization in their get_topics() function:

Gensim LSI Code:

def get_topics(self):
        """Get the topic vectors.

        Notes
        -----
        The number of topics can actually be smaller than `self.num_topics`, if there were not enough factors
        in the matrix (real rank of input matrix smaller than `self.num_topics`).

        Returns
        -------
        np.ndarray
            The term topic matrix with shape (`num_topics`, `vocabulary_size`)

        """
        projections = self.projection.u.T
        num_topics = len(projections)
        topics = []
        for i in range(num_topics):
            c = np.asarray(projections[i, :]).flatten()
            norm = np.sqrt(np.sum(np.dot(c, c)))
            topics.append(1.0 * c / norm)
        return np.array(topics)
  1. If this is necessary, It would be advised to use numpy for normalization to increase the performance. Since doing debuging I noticed that this is the part where my LSI model is stuck for a long time.
def _get_topic_word_matrix(self):
        """
        Return the topic representation of the words
        """
        topic_word_matrix = self.trained_model.get_topics()
        normalized = []
        for words_w in topic_word_matrix:
            minimum = min(words_w)
            words = words_w - minimum
            normalized.append([float(i)/sum(words) for i in words])
        topic_word_matrix = np.array(normalized)
        return topic_word_matrix

    def _get_topics_words(self, topics):
        """
        Return the most significative words for each topic.
        """
        topic_terms = []
        for i in range(self.hyperparameters["num_topics"]):
            topic_words_list = []
            for word_tuple in self.trained_model.show_topic(i, topics):
                topic_words_list.append(word_tuple[0])
            topic_terms.append(topic_words_list)
        return topic_terms

    def _get_topic_document_matrix(self):
        """
        Return the topic representation of the
        corpus
        """
        topic_weights = self.trained_model[self.id_corpus]

        topic_document = []

        for document_topic_weights in topic_weights:

            # Find min e max topic_weights values
            minimum = document_topic_weights[0][1]
            maximum = document_topic_weights[0][1]
            for topic in document_topic_weights:
                if topic[1] > maximum:
                    maximum = topic[1]
                if topic[1] < minimum:
                    minimum = topic[1]

            # For each topic compute normalized weight
            # in the form (value-min)/(max-min)
            topic_w = []
            for topic in document_topic_weights:
                topic_w.append((topic[1]-minimum)/(maximum-minimum))
            topic_document.append(topic_w)

        return np.array(topic_document).transpose()

Thanks for your response in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions