Tuesday, February 19, 2013

Co-EM algorithm in GraphChi

Following the previous post about label propagation, as well as some request from US based startup to implement this method in GraphChi, I have decided to write a quick tutorial for Co-EM algorithm.

Co-EM is a very simple algorithm, extensively utilized by Rosie Jones in her PhD thesis. Originally by Nigam and Ghani (2000). The algorithm is used for clustering test entities into categories. Here is an example dataset (NPIC500) which explains the input format. The algorithm constructs a bipartite graph:
Where the left nodes are noun phrases, the right node are the sentence context, and edge weight is the number of times a certain noun phrase was within the context. The algorithm is very simple (described in page 43 of John's PhD thesis:

As seen above, the noun labels simply compute a weighted some of the edge values. The context nodes compute the same weighted sum (if they are not seed nodes). Seed nodes are the initial graph nodes we have ground truth labels about them.

The output of the probability for each noun phrase to be in a different categories.

Here are some more concrete example of the input file:

Additionally, ground truth is given about the negative and positive seeds. For example, assume we have two categories (city / not city). The seed lists classify certain nouns to their matching categories.

$ head city-seeds.txt 
^New York$
^Los Angeles$

$ head city-neg-seeds.txt 
^the world$
^a number$

And here is how to try it out in GraphChi
0) Install graphchi as explained here, and compile using "make ta"
1) Download the file http://graphlab.org/downloads/datasets/coem.zip and unzip it in your root graphchi folder
2) In the root graphchi folder run:

$ ./toolkits/text_analysis/coem --training=matrix.txt --nouns=nps.txt --contexts=contexts.txt --pos_seeds=city-seeds.txt --neg_seeds=city-neg-seeds.txt --D=1 

The output is generation in the file: matrix.txt_U.mm:
$ cat matrix.txt_U.mm
%%MatrixMarket matrix array real general
%This file contains COEM output matrix U. In each row D probabilities for the Y labels
88322 1

The first three noun phrases all have a prob of around 0.4 of being a city.

1 comment:

  1. Will you please check the download link given in step 1) ?