Monday, April 16, 2012

LDA explained

LDA (latent Dirichlet allocation) is a popular method used frequently in natural language processing. Originally it was proposed by David Blei in his 2003 paper.
A common LDA application is to have a document/word input matrix as the LDA input, and to cluster the documents and words into a set of topics.

The output of Blei's algorithm are two matrices: Beta and Gammma. Beta is a matrix of size
words x topics, and Gamma is a matrix of size documents x topics.

I got the following question about LDA from!/trowind, a graduate student working on NLP from Seoul.

I'm trying to recognize the topics of each given document.
To do this I built topic distributions of word by using LDA algorithm with "glcluster" of GraphLab.

The first output is a "Word X Topic" matrix which contained the probabilities of P(topic_k|word_i). Anyway, I want to know the probabilities of topic for a given document : P(topic_k | doc_j)

How can I get these probabilities?

I contacted Khaled El-Arini, a CMU graduate student and the best LDA expert I could get hold on, and asked him to shed some light on this rather confusing dilemma. And this is what I got from Khaled:

P(word|topic) is Beta (by definition).
Gamma are the posterior dirichlet parameters for a particular document [see page 1004: here]
the gamma file should contain a T-dimensional vector for each document (where T is the number of topics), and that represents the variational posterior Dirichlet parameters for a particular document. In other words, if T=3, and gamma = [1 2 2], it means that the posterior distribution over theta for this document is distributed Dirichlet(1, 2, 2), and thus if you just want to take the posterior mean topic distribution for this particular document, it would be [0.2 0.4 0.4], based on the properties of a Dirichlet distribution.

Additional insignts I got from Liang Xiong, another CMU graduate student and my LDA expert number 2:
Gamma gives the distribution of p(topic | doc). So if you normalize the
Gamma so that it sums to one, you get the posterior mean of p(topic | doc). Unfortunately I don't see a directly way to convert this to p(doc | topic), since lda is doing a different decomposition as plsa. But personally I think that p(topic | doc) is more useful than p(doc|topic).

Anyway, I hope now it is more clear how to use Blei's LDA output.

1 comment:

  1. But is there a way to get the posterior dirichlet parameters without using the approximate graph (i.e. the right hand side one on p.1003)?

    Likewise, is there a way to get the posterior dirichlet parameters if are using a package (lda R package) that doesn't do variational inference but fits with the collapsed Gibbs sampler?