Thursday, December 29, 2011

Multicore parser - part 2 - parallel perl tutorial

In the first part of this post, I described how to program a multicore parser, where the task is to translate string IDs into consecutive integers that will be used for formatting the input of many machine learning algorithms. The output of part 1 is a map between strings and unsigned ints. The map is built using a single pass over all the dataset.

Now an additional task remains, namely translating the records (in my case phone call records) into a Graph to be used in Graphlab. This is an embarrassingly parallel task - since the map is read only - multiple threads can read it in parallel and translate the record names into graph edges. For example the following records:
YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
YXVaVQJfYZp BZurhasRwat 050803 235959 6
BqFnHyiRwam jdJBsGbXUwu 050803 235959 242
are translated into undirected edges:
1 2 
1 3
2 4
etc. etc.
The code is part of Graphlab v2 and can be downloaded from our download page.

In the current post, I will quickly explain how to continue setting up the parser.
The task we have now is to merge multiple phone calls into a single edge. It is also useful to sort the edges by their node id. I have selected to program this part in perl, since as you are going to see in a minute it is going to be a very easy task.

INPUT: A gzipped files with phone call records, where each row has two columns: the caller and the receiver; each of them as an unsigned integer. It is possible the same row will repeat multiple times in the file (in case multiple phone calls between the same pair of people where logged in different times).
OUTPUT: A gzipped output file with sorted unique phone call records. Each unique caller receiver pair will appear only once.

Tutorial - how to execute a parallel task in Perl.
1) Download and extract Parallel Fork Manager
wget http://search.cpan.org/CPAN/authors/id/D/DL/DLUX/Parallel-ForkManager-0.7.5.tar.gz
tar xvzf Parallel-ForkManager-0.7.5.tar.gz
mv Parallel-ForkManager-0.7.5 Parallel

2) Create a file named parser.pl with the following lines in it:
#!/bin/perl -w
my $THREADS_NUM = 8;
use Parallel::ForkManager;
$pm = new Parallel::ForkManager($THREADS_NUM);

opendir(DIR, "your/path/to/files");
@FILES= readdir(DIR); 

foreach $file (@FILES) {

  # Forks and returns the pid for the child:
  my $pid = $pm->start and next;

  print "working on file " . $file;
  system("gunzip -c $file | sort -u  -T . -n -k 1,1 -k 2,2 -S 4G >" . $file . "gz" );

  $pm->finish; # Terminates the child process
}

closedir(DIR);
Explanation
1) ForkManager($THREADS_NUM) sets the number of parallel threads - in this case 8.
2) For each file, the file is unzipped using "gunzip -c", and sorted uniquely (-u command line flag). The -T flag is an optional argument in case your temp drive does not have enough space. -k 1,1 sets the sorting key to be the first column and the second -k 2,2 sets column 2 as the key in case of a match in the first column. -S flag sets the buffer size such that the full input file will fit into memory. 4G is 4 Gygabytes of memory.
3) The system() command runs any shell command line, so you can change the parallel loop execution to perform your own task easily.

Overall, we got in a few lines of code a parallel execution environment which would be much harder to setup otherwise.

No comments:

Post a Comment