Friday, March 30, 2012

Interesting twitter dataset and other big datasets

I got this from Aapo Kyrola, how got it from Guy Blleloch:

An interesting paper which explores the twitter social network is:

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010.

The twitter graph is available fro download from here. The format is very simple:
user follower\n

The graph gas 41M nodes, and 1.4 billion edges. What is nice about it, is that you can view the profile of each node id using the twitter web API. For example, for user 12 you can do:
Some statistics about the graph are found here.

If you like to use it in Graphlab v2, you need to do the following:
1) assuming the graph file name is user_follower.txt, sort the graph using:
   sort -u -n -k 1,1 -k 2,2 -T . user_follower.txt > user_follower.sorted
2) Add the following matrix market format header to the file:
   %%MatrixMarket matrix coordinate real general
   61578414 61578414 1468365182

I am using k-cores algorithm to reveal this graph structure. I will add some results soon.

And here is a library of webgraphs and other big graphs I got from 
Kanat Tangwongsan.


  1. Can you check the download link? it is not working

    1. Now I checked and it is working. If not, you can contact the authors...

    2. It didn't work for me. It says forbidden. Maybe it is because you are authenticated to access that directory?

      Anyway, you can always access the data here:

  2. Is this the "twitter-2010" graph used in the "GraphCHI" paper. But the vertex/edge number is slightly different the number in the paper (42M nodes, 1.5B edges). Confused...

    1. I may be have rounded the number of nodes and edges in my blog description since only the magnitude matters. I suggest you download the original dataset and check how many nodes and edges are exactly if this is important for you.

  3. Is it possible to run sparse matrix twitter .mm file in matlab?

    1. After step 2, you can use the script:
      to load the dataset into matlab or octave. However it is likely that matlab will get out of memory since the dataset is big.