Today I learned from Prof. Garth Gibson from CMU, about a graph algorithm benchmark called http://graph500.org . It is a benchmarking suite design for comparing performance of graph algorithms.
Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.
Backed by a steering committee of over 50 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set).
They have very interesting and challenging data sizes:
There are five problem classes defined by their input size:
toy 17GB or around 1010 bytes, which we also call level 10,
mini 140GB (1011 bytes, level 11),
small 1TB (1012 bytes, level 12),
medium 17TB (1013 bytes, level 13),
large 140TB (1014 bytes, level 14), and
huge 1.1PB (1015 bytes, level 15).
The benchmark performs the following steps:
- Generate the edge list.
- Construct a graph from the edge list (timed, kernel 1).
- Randomly sample 64 unique search keys with degree at least one, not counting self-loops.
- For each search key:
- Compute the parent array (timed, kernel 2).
- Validate that the parent array is a correct BFS search tree for the given search tree.
- Compute and output performance information.
In comparison to our machine learning techniques, the Graph500 uses a completely useless and synthetic data:
The graph generator is a Kronecker generator similar to the Recursive MATrix (R-MAT) scale-free graph generation algorithm [Chakrabarti, et al., 2004].
Overall, it is a great initiative in the right direction, although we as ML researchers would love to see real data and real applications instead synthetic data. From the other hand, the magnitudes of data the parallel computing guys are handling are several order of magnitude larger than the ones we can fit in state of the art systems like GraphLab.
No comments:
Post a Comment