Sunday, December 25, 2011

How to write a multicore parser

When dealing with machine learning, one usually ignores the (usually boring!) task of preparing the data to be used in any of the machine learning algorithms. Most of the algorithms have either linear algebra or statistical foundation and thus the data has to be converted to a numeric form.

In the last couple of weeks I am working on the efficient design of multicore parser, that allows converting raw string data into a format usable by many machine learning algorithms. Specifically, I am using CDR (call data records) from a large European country. However, the dataset has several typical properties, so I believe my experience is useful for other domains.

The raw CDR data I am using looks like this:
YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
xtGBWvpYgYK jdJBsGbXUwu 050803 235959 242
ZpYQLFFKyTa atslZokWZRL 050803 235959 504
WMjbYygLglR BqFnCfuNgio 050803 235959 51
hcLiEoumskU RcNSJSEmidT 050803 235959 7
qBvUQlMABPv atslBvPNusB 050803 235959 3609
jdSqVjxPlBn BqFnHyiRwam 050803 235959 23
VCWyivVLzRr atslSbOOWXz 050803 235959 8411
PnLaFqLJrEV atslZokWZRL 050803 235959 8806
PnLaCsEnqei atslBvPNusB 050803 235959 590
The first column is an anonymized caller ID, the second column is an anonymized receiver ID, the third column is the date, the fourth is the time, and the last column is the duration of the call.

Now to the data magnitude. If your dataset is small, no need for any fancy parsing, you can write a python/perl/matlab script to convert it to numeric form and avoid reading further... However, this dataset is rather big: every day there are about 300M unique phone calls. So depending on how many days you aggregate together, you can get to quite a large magnitude. For a month there are about 9 billion phone calls logged.

To make the CDR useful, we need to convert the hashed string ID into a number, hopefully a consecutive increasing number. That way we can express the phone call information as a matrix.
Then we can use any of the fundamental machine learning algorithms like: SVM, Lasso, Sparse logistic regression, matrix factorization, etc. etc.

One possible approach for converting strings to integer, is taken in Vowpal Wabbit, where strings are hashed into numeric IDs during the run. However, there is a risk that two different string IDs will be mapped into the same integer. So depending on the application this may be acceptable. I have chosen to take a different approach - which is to simply assign a growing consecutive ID to each string.

I have implemented the code in GraphLab, where GraphLab is not the intuitive tool to be used for this task (although it was convenient to use). In a multicore machine, several GraphLab threads are running in parallel and parsing different portions of the input files concurrently. We have to be careful that node IDs will remain consecutive across the different files. Since stl/boost data structures are typically not thread safe, I had to use a mutex for defending against concurrent insertions to the map. (Concurrent reads from a stl/boost map are perfectly fine).

void assign_id(uint & outval, const string &name){

  //find if the string is already in the map.
  //this part is thread safe since find() is thread safe
  boost::unordered_map<string,uint>::iterator it = hash2nodeid.find(name);
  if (it != hash2nodeid.end()){
     outval = it->second;

  //if not, we need to insert it to the map
  //now, we must lock since operator[] is not thread safe
  outval = hash2nodeid[name];
  if (outval == 0){
      hash2nodeid[name] = ++conseq_id;
      outval = conseq_id;

One should be careful here, since as I verified using gprof profiler, about 95% of the running time is wasted on this critical section of assigning strings to ints.

Initially I used std::map<string,int> but I found it to be rather slow. It seems that std::map is implemented using an underlying tree and so insertions are costly: log(N). I switched to boost::unordered_map which is actually a hash table implementation with O(1) insertions. This gave x2 speedup in runtime.

Second, since each day of input file amount to about 5GB of gzipped file, I used boost gzipped stream for avoiding the intermediate extraction of the input files. Here is an example:
char linebuf[128];
    std::ifstream in_file(filename).c_str(), std::ios::binary);
    boost::iostreams::filtering_stream<boost::iostreams::input> fin;

      fin.getline(linebuf, 128);
      if (fin.eof())
      //parse the line
Overall, I believe the result is quite efficient: for parsing 115GB of compressed CDR data (total of 6.4 billion phone calls) it takes 75 minutes on a 4 core machine. There where about 182M unique IDs assigned. (Quad core AMD Opteron 2.6Ghz). Total of 12.8 billion map lookups (about 3M lookups a second).

Some performance graph:

Summary of lessons learned:

  1. C parsing is way more efficient than perl/python/matlab.
  2. Opening gzipped files is a waste of time and space - better work directly on the gzipped version.
  3. Parallel parsing has a good speedup up to a few (3) cores. More cores do not improve (due to heavy IO..).
  4. Better use hash table than sorted tree: boost::unordered_map is twice is fast than std::map


  1. Hi Danny, nice and interesting post. Do you have some comparison data for C++ vs. Perl/Python? In my experience the speed difference was not too big for I/O bound tasks, although in your case the multi-core speed-up indicates I/O is not the bottleneck.

  2. Hi Zeno! Thanks for your great feedback. I did not compare python/perl directly for this problem, but for previous problems like KDD CUP data, it tooks hundreds of seconds to parse the data in python while C code can do it in a couple of seconds. Anyway this is my intuition.