Large Scale Machine Learning and Other Animals: Tuning Hadoop configuration for high performance

In this post I will share some of the insights I got when tuning Hadoop/Mahout on Amazon EC regular and high performance nodes. I was using two algorithms.
1) Mahout's Alternating least squares application (See MAHOUT-542) with Netflix data. (Sparse matrix with 100,000,000 non zeros). Test was done with up to 64 HPC nodes (512 cores).
2) CoEM algorithm - NLP algorithm (R. Jones, 2005) with data graph of around 200,000,000 edges.

Below are running time results for running one iteration of alternating least squares (implemented by Sebastian Schelter) on Netflix data. Runtime is in seconds.
X-axis are the participating machines - from 4 to 64 machines.

My conclusion from this experiment, is that 16 HPC nodes (256 cores) are enough for computing matrix factorization/CoEM of this scale. Beyond 16 nodes there is no benefit in further parallism.

Below I explain how I fine-tuned performance.
Preliminaries: I assume you followed the instruction on part 1 of this tutorial to setup Hadoop on EC2.

1) The hdfs-site.xml file

dfs.replication

- I set dfs replication to 1. Replication determines the number of copies the hdfs data is saved on. When working with a relative low number of nodes (several) higher replication delays performance.

hadoop.tmp.dir
hadoop.data.dir
dfs.name.dir

You should set all those directories to point to DIFFERENT paths which have ENOUGH DISK SPACE.
Default hadoop configuration points to either /tmp or /usr/local/hadoop-0.20.2/ and in Amazon
EC2 there is a 10Gb disk space limit for the root partition. To increase available storage,
on regular nodes I set the above fields to /mnt/tmp1, /mnt/tmp2/ and /mnt/tmp3
On HPC nodes, I first mounted /dev/sdb using the command:

mkdir -p /home/data
mount -t ext3 /dev/sdb/ /home/data/

And then created /home/data/tmp1 /home/data/tmp2 /home/data/tmp3 and pointed the above fields to there.

dfs.block.size

The default is 64MB. For CoEM set it to 4MB, so there will be enough mappers for all cores. For Netflix data I set it to 16MB. When the block size is too small, there are too manny mappers, resulting in loading the system, having many task failures, and some of the job trackers gets black-listed. Having too few mappers does not exploit well parallism. Unfortunately it seems that block size should be tuned separately for each algorithm.

2) The file core-site.xml should be configured as explained in the first part of this post.

3) The file mapred-site.xml

mapred.map.task

empirically setting them to the number of cores -1 seemed to work the best. (On HPC nodes, 15 cores). Note that this number is per machine.

mapred.reduce.task

Common practice says to set it to 0.95 * number of machines * (number of cores-1).
For me that did not work well, especially with 64 machines - reduce phase becomes terribly slow with very slow copying phase (in Kb instead of MB). Finally I set it to 64 for all experiments.

mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum

set them to the values above. Note that it seems that reduce tasks maximum is a global maximum and not a limit per single machines. So in this case 64 was a global limit of 64 reduce tasks.

mapred.task.timeout, mapred.tasktracker.expiry.interval

default is 600000 milliseconds which was too low for ALS. If the interval is too low, task will be killed prematurely. I set it to 7200000

mapred.task.tracker.expiry.interval

don't ask me what is the difference to previous field - probably a bug. Anyway I set it as well.

mapred.compress.map.output, mapred.output.compress

again I set those fields to true. It reduced
significantly the disk writes to about 1/3 the size.

mapred.child.java.opts

set it to -Xmx2500Mb , the default is 500, which results in out of memory errors, java heap errors and GC errors.

4) The file hadoop-env.sh
On HPC nodes, set

JAVA_HOME=/usr/lib/jvm/jre-openjdk

On regular nodes, set

JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Heap size parameter controls the heap size. When it is too small you get
out of memory error and out of heap size erros.

HADOOP_HEAPSIZE=4000

5) Avoiding string parsing as much as possible
Java string parsing is rather slow. Avoid reading string input files as possible and write the data in binary format whenever possible. For the CoEM algorithm, avoiding string parsing resulted in x4 faster code, since the inputs files where read on each iteration.

Some tips I got from Julio Lopez, OpenCloud project @ CMU:
Block size and controlling the number of mappers. I believe someone already commented on this. In general, you want to have the block sizes relatively large in order to induce your job to perform sequential instead of random I/O. You can use the "InputFormat" to control how the work is split and how many tasks are created.

I've found that the first instincts users have is to match the number of mappers or reducers per node to the number of cores. For many Hadoop applications, this does not work. Properly setting these parameters is application dependent (module the available resources). In Hadoop these are framework-wide parameters. In my experience, how memory is allocated to tasks has a much larger impact on application performance. However, it is not clear how these memory parameters should be set, and there are all sorts of complex interactions among tasks.

For reference, in the cloud cluster, there are 8 cores per node, we allow 10 simultaneous tasks to execute per node and in general we see better throughput that way. As I mentioned earlier, most jobs experience contention for memory.

Interesting related projects/ papers:
1) http://www.cs.duke.edu/~shivnath/amr.html
2) Kai Ren, Julio López and Garth Gibson. Otus: Resource Attribution in Data-Intensive Clusters. MapReduce: The Second International Workshop on MapReduce and its Applications. San Jose, CA, June 2011. (bib, pdf)

Other useful tips:

When stopping and starting Hadoop you should be very careful since Hadoop generates a zillion of temp file, that if found on the next run makes a mess.

1) I always run from script

echo Y | hadoop namenode -format

Since if the file system was formatted the script will get stuck without getting the "Y" input.

2) Remove all /tmp/*.pid files, or else Hadoop will think some old processes are running.

3) Remove all files in the directories hadoop.tmp.dir, hadoop.data.dir, dfs.name.dir
especially VERSION files. Old VERSION files lead to namespaceID collisions.

4) Delete old logs from /usr/local/hadoop-0.20.2/logs/

Large Scale Machine Learning and Other Animals

Friday, March 4, 2011

Tuning Hadoop configuration for high performance - Mahut on Amazon EC2

6 comments:

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax