In GraphLab v2, I have added parsers library that can help you accomplish this task hopefully easier.
Let's start with an example. Suppose I have a collection of documents and in each document I have
a bag of words that appear. The input to GraphLab parser is:
1::the boy with big hat was here
2::no one would have believe in the last years of the nineteenth century
where 1 and 2 are the file numeric ids, we use '::' as a separator, and the rest of the line contains keywords that appear in that document.
Assuming you have this format, it is very easy to convert it to be used in GraphLab. You simply use
the texttokenparser application.
Preliminaries: you will need to install GraphLab v2 (explanation under installation section here).
And here is an example run of the parser:
./texttokenparser --dir=./ --outdir=./ --data=document_corpus.txt --gzip=false --debug=true
WARNING: texttokenparser.cpp(main:209): Eigen detected. (This is actually good news!)
INFO: texttokenparser.cpp(main:211): GraphLab parsers library code by Danny Bickson, CMU
Send comments and bug reports to firstname.lastname@example.org
Currently implemented parsers are: Call data records, document tokens
Schedule all vertices
INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO: io.hpp(gzip_in_file:698): Opening input file: ./document_corpus.txt
INFO: io.hpp(gzip_out_file:729): Opening output file ./document_corpus.txt.out
Read line: 1 From: 1 To: 1 string: the
Read line: 1 From: 1 To: 2 string: boy
Read line: 4 From: 1 To: 3 string: with
INFO: texttokenparser.cpp(operator():159): Parsed line: 50000 map size is: 30219
INFO: texttokenparser.cpp(operator():159): Parsed line: 100000 map size is: 39510
INFO: texttokenparser.cpp(operator():159): Parsed line: 150000 map size is: 45200
INFO: texttokenparser.cpp(operator():159): Parsed line: 200000 map size is: 50310
INFO: texttokenparser.cpp(operator():164): Finished parsing total of 230114 lines in file document_corpus.txt
total map size: 52655
Finished in 17.0022
Total number of edges: 0
INFO: io.hpp(save_map_to_file:813): Save map to file: ./.map map size: 52655
INFO: io.hpp(save_map_to_file:813): Save map to file: ./.reverse.map map size: 52655
The output of the parser :
1) Text file containing consecutive integers in sparse matrix market format. In other words, each string is assigned an id, and a sparse matrix is formed where the rows are the document numbers and the non-zero columns are the strings.
NOTE: currently you will need to manually create the two header lines as explained here. The header lines specify the number of rows, columns and non-zero entires in the matrix. In the future I will automate this process.
2) A mapping from each text keyword to its matching integer
3) A mapping from each integer to its matching string.
1) It is possible to parse in parallel (on a multicore machine) multiple files and still have the ids assigned correctly. Use the --filter= command line argument to select all files starting with a certain prefix. Do not use the --data= command line argument in that case.
2) Support for gzip input format. Using --gzip=true command line option.
3) Save the mapping into readable text file using the --save_in_text=true command line argument.
4) Incrementally add more documents to an existing map by using the --load=true command line flag.
5) Limit the number of parsed lines using --lines=XX command line flag (useful for debugging!)
6) Enable verbose mode using --debug=true command line flag.