Large Scale Machine Learning and Other Animals: How to prepare your problem to be used in GraphLab / GraphLab parsers

Sunday, March 18, 2012

How to prepare your problem to be used in GraphLab / GraphLab parsers

In many cases your data is a collection of strings and you like to convert it to numerical form so it can be used in any machine learning software.

In GraphLab v2, I have added parsers library that can help you accomplish this task hopefully easier.
Let's start with an example. Suppose I have a collection of documents and in each document I have
a bag of words that appear. The input to GraphLab parser is:

1::the boy with big hat was here
2::no one would have believe in the last years of the nineteenth century

where 1 and 2 are the file numeric ids, we use '::' as a separator, and the rest of the line contains keywords that appear in that document.

Assuming you have this format, it is very easy to convert it to be used in GraphLab. You simply use
the texttokenparser application.
Preliminaries: you will need to install GraphLab v2 (explanation under installation section here).

And here is an example run of the parser:

./texttokenparser --dir=./ --outdir=./ --data=document_corpus.txt --gzip=false --debug=true

WARNING: texttokenparser.cpp(main:209): Eigen detected. (This is actually good news!)

INFO: texttokenparser.cpp(main:211): GraphLab parsers library code by Danny Bickson, CMU

Send comments and bug reports to danny.bickson@gmail.com

Currently implemented parsers are: Call data records, document tokens

Schedule all vertices

INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.

INFO: io.hpp(gzip_in_file:698): Opening input file: ./document_corpus.txt

INFO: io.hpp(gzip_out_file:729): Opening output file ./document_corpus.txt.out

Read line: 1 From: 1 To: 1 string: the

Read line: 1 From: 1 To: 2 string: boy

Read line: 4 From: 1 To: 3 string: with

INFO: texttokenparser.cpp(operator():159): Parsed line: 50000 map size is: 30219

INFO: texttokenparser.cpp(operator():159): Parsed line: 100000 map size is: 39510

INFO: texttokenparser.cpp(operator():159): Parsed line: 150000 map size is: 45200

INFO: texttokenparser.cpp(operator():159): Parsed line: 200000 map size is: 50310

INFO: texttokenparser.cpp(operator():164): Finished parsing total of 230114 lines in file document_corpus.txt

total map size: 52655

Finished in 17.0022

Total number of edges: 0

INFO: io.hpp(save_map_to_file:813): Save map to file: ./.map map size: 52655

INFO: io.hpp(save_map_to_file:813): Save map to file: ./.reverse.map map size: 52655

The output of the parser :
1) Text file containing consecutive integers in sparse matrix market format. In other words, each string is assigned an id, and a sparse matrix is formed where the rows are the document numbers and the non-zero columns are the strings.
NOTE: currently you will need to manually create the two header lines as explained here. The header lines specify the number of rows, columns and non-zero entires in the matrix. In the future I will automate this process.
2) A mapping from each text keyword to its matching integer
3) A mapping from each integer to its matching string.

Advanced options:
1) It is possible to parse in parallel (on a multicore machine) multiple files and still have the ids assigned correctly. Use the --filter= command line argument to select all files starting with a certain prefix. Do not use the --data= command line argument in that case.
2) Support for gzip input format. Using --gzip=true command line option.
3) Save the mapping into readable text file using the --save_in_text=true command line argument.
4) Incrementally add more documents to an existing map by using the --load=true command line flag.
5) Limit the number of parsed lines using --lines=XX command line flag (useful for debugging!)
6) Enable verbose mode using --debug=true command line flag.

Large Scale Machine Learning and Other Animals

Sunday, March 18, 2012

How to prepare your problem to be used in GraphLab / GraphLab parsers

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax