Co-EM is a very simple algorithm, extensively utilized by Rosie Jones in her PhD thesis. Originally by Nigam and Ghani (2000). The algorithm is used for clustering test entities into categories. Here is an example dataset (NPIC500) which explains the input format. The algorithm constructs a bipartite graph:
The output of the probability for each noun phrase to be in a different categories.
Here are some more concrete example of the input file:
Additionally, ground truth is given about the negative and positive seeds. For example, assume we have two categories (city / not city). The seed lists classify certain nouns to their matching categories.
$ head city-seeds.txt
$ head city-neg-seeds.txt
And here is how to try it out in GraphChi
0) Install graphchi as explained here, and compile using "make ta"
1) Download the file http://graphlab.org/downloads/datasets/coem.zip and unzip it in your root graphchi folder
2) In the root graphchi folder run:
$ ./toolkits/text_analysis/coem --training=matrix.txt --nouns=nps.txt --contexts=contexts.txt --pos_seeds=city-seeds.txt --neg_seeds=city-neg-seeds.txt --D=1
The output is generation in the file: matrix.txt_U.mm:
$ cat matrix.txt_U.mm
%%MatrixMarket matrix array real general
%This file contains COEM output matrix U. In each row D probabilities for the Y labels
The first three noun phrases all have a prob of around 0.4 of being a city.