It compares Mahout vs. Distributed GraphLab on the popular task of matrix factorization using ALS algorithm (alternating least squares) on Netflix data. The bottom line is that GraphLab is about x20 faster than Mahout.
And here is the exact experiment setup, I got from Nezih:
- N is the number of ALS iterations, D is the number of latent factors. The experiments have been conducted on a 16 node cluster.
- We start GL as mpirun -hostfile ~/hostfile -x CLASSPATH ./als –ncpus=16 --matrix hdfs://host001:19000/user/netflix --D=$LATENT_FACTOR_COUNT --max_iter=$ITER_COUNT --lambda=0.065 --minval=0 --maxval=5
- To run mahout ALS, we use the factorize-netflix.sh script under the examples directory. It should be run as ./factorize-netflix.sh /path/to/training_set/ /path/to/qualifying.txt /path/to/judging.txt
- In our test cluster we have 16 machines each with 64GB of memory, 2 CPUs (Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz [8 cores each]) and 4 x 1 TB HDDs. The machines communicate over a 10Gb Ethernet interconnect.
- The Netflix dataset has been splitted into 32 equally sized chunks and then put into HDFS.