Large Scale Machine Learning and Other Animals: February 2012

Wednesday, February 29, 2012

The world's coolest machine learning internships - updated

Note: Openings for summer 2014 are here.

I believe that every graduate student should try to do at least two internships in the industry. It is a great experience. Below you can find a list I compiled by aggregating information from some of the companies I am in touch with as a part of our GraphLab project. This list is a academic resource - I am not involved in any of the companies below. I also got some angry comments about some company or another missing - this is a personal list. I will be happy to add more companies providing the are doing some interesting research.

Openings in the US - summer 2012
Note: Openings for summer 2014 are here.

Rosie Jones, a fellow Tartan, sent me the following: the Computational Advertising team at Akamai Technologies invites applications for summer 2012 internships. Unfortunately, with the help of this blog, all positions are now filled. We should wait till next year..

Srinivasan Soundar from Bosch Research sent me the following: The Bosch Research and Technology Center, with labs in Palo Alto, CA, Pittsburgh, PA, and Cambridge, MA focuses on innovative research and development for the next generation of Bosch products. The data mining team is developing advanced statistical and machine learning methods for application to patient health and electronic medical records. We are looking for highly qualified, motivated, and innovative individuals to join our team. Internships are expected to be at least 10-12 weeks long during the summer months. Previous internships in our group have led to successful publications and/or patents. Topics include Latent Variable Models, Unsupervised Clustering, Privacy Preserving Data Mining and Association Rule Mining.

This is what I got from Grant Ingersoll, a well known Mahout contributor: Lucid Imagination, the leading commercial company for Apache Lucene and Solr, is looking for interns to work on building next generation search, analytics and machine learning technologies based on Apache Solr, Mahout, Hadoop and other cutting edge capabilities. This internship will be practically focused on working on real problems in search and machine learning as they relate to Lucid products and technologies as well as open source. Interested students should send their resume/profile, course work and evidence of open source activity (github account, ASF patches or other, etc.) to careers@lucidimagination.com. Note: position requires eligibility to work in the US.

In the NIPS big learning workshop I had the pleasure of meeting Vaclav Petricek who is a senior research of matching in eHarmony. eHarmony is an online dating startup, with around 33M users around the world, based in Santa Monica, LA.

The first time I heard about eHarmony is in John Langford's talk on Vowpal Wabbit at the same workshop. John mentioned, that out of the many companies who is using his software, he is most proud that Vowpal Wabbit is being used by eHarmony, thus promoting love in the world.

His an excerpt from their website, I was not aware of:
"Nearly 5% of all marriages in the U.S. are created by eHarmony. That’s 271 marriages per day."
This is absolutely amazing!

So if you like to promote love, and you are a graduate student in top US universities in a related area to machine learning you are welcome to apply here for internship. Relevant previous internship and an opensource project involvement are a plus. And tell them I sent you!

There is no need to introduce LinkedIn, one of the most successful social and professional communities. Ron Bekkerman, a senior researcher at LinkedIn is looking for interns for the coming summer.

With hundreds of millions of users, there is infinite amount of data and exciting new applications to explore.

Another cool company is RocketFuel, a company specializing in display advertising. I got the following from Abhinav Gupta, Founder and VP Engineering:

We’re hiring interns to work on machine learning/ optimization problems as well as our core platform (ad-serving, bidding, modeling and data infrastructure) built using a mix of proprietary and open-source technologies. We’re looking for those excited about working on tough problems related to scalable/ reliable/ available algorithms, machine learning, data mining and optimization. We are building a platform to do automatic targeting and optimization of ads. Our pitch to advertisers is very simple - If you can measure metrics of success of your campaign, we can optimize. We buy most of our inventory through real time auctions on exchanges such as Google Doubleclick. We’re integrated with real time exchanges processing requests @100k qps. We have over 1PB of data and growing fast.

You can apply to RocketFuel here.

Another hot company is Cloudera. Josh Wills gave an excellent talk at the NIPS big learning workshop where he identified some of the coming challenges in large scale machine learning. And here is what I got from him:

We're hiring data science interns to work on developing new (and not necessarily MapReduce-based) optimization and model fitting algorithms that can be used on data stored in a Hadoop cluster. Specifically, we're interested in ways to more closely integrate open-source projects like Spark, GraphLab, and modifications to MapReduce (such as AllReduce) with the rest of the components of CDH in order to optimize every step of the model building process, from feature extraction to model deployment to evaluation. At Cloudera, the work you do doesn't just impact our company, it impacts the entire Hadoop community.
If that sounds like fun, and you are a graduate student at a top US university in CS/math/operations research, email me your resume at jwills+intern@cloudera.com.

Additional opening in Cloudera is with Josh Patterson: building ML / NLP tools on Hadoop, HBase, and openNLP. Email him at josh@cloudera.com

Shon Burton is the founder of Wildcog, a company specializing in assignments of technical dudes in top bay area companies. Currently they are working with Twitter, Tumblr, Palantir, and Yahoo!. And guess what? they are looking for interns! You are welcome to email Shon at: mlinterns@wildcog.com

The wet dream for any big data lover. Who can have more data then Walmart - ranked no. 1 in Fortune 500 list? Patrick Harrington is looking for both interns and big data engineers:

@WalmartLabs is seeking outstanding engineers and scientists to build our next generation
multi-‐dimensional targeting system to help revolutionize eCommerce. This targeting
system aggregates a variety of user based signals, e.g., click stream, social, web,
geo-‐location, etc, and outputs a portfolio of relevant products on a user specific
basis. As a senior engineer, you will be joining a team devoted to increasing the
percent of sales attributable to targeting via developing a portfolio
of diverse data-‐driven algorithms and the underlying batch-‐oriented and
real-‐time systems. For more details about his opening, contact Patrick Harrington at:
pharrington@walmartlabs.com

And here is a note I got from Mike Spreitzer from IBM. He asks not to forget that IBM is very interested in big data, as the whole "smarter planet" thing is about big data. IBM has internships in both product divisions and in Research.

Additional internship positions are available in the data mining and business analytics dept in IBM. And here is what I got from Priya Nagpurkar, a research stuff member:
Data mining for business analytics is one of the primary areas of focus in our department this year. More specifically our focus is on systems support (software and hardware) for high performance analytics, with the goal of designing next generation systems. Potential topics include, performance analysis for hardware-software co-design, acceleration (e.g. GPU), optimization of storage systems. For more details contact Priya.
Other internship jobs are found using IBM general job search.

Well, as a former IBMer I have sweet spot towards IBM. So it definitely gets a place in my list!

This is what I got from Hassan Shafi, Oracle Labs: Oracle Labs is investing a lot in the area of domain-specific languages.
One particular domain of interest is large graph-data analysis.
We are developing a DSL that simplifies implementing such algorithms and we are interested in all aspects from applications all the way down the hardware architecture. If you are interested in a great internship program in the SF Bay Area contact hassan.chafi@oracle.com

Anyone who ever used Mahout (and there are thousands if not more of users) knows Ted Dunning. To any question ever asked in the area of applied machine learning he knows the answer. After forming several successful startups, Ted has a new initiative for improving Hadoop infrastructure. He is looking for interns. His email is: tdunning@maprtech.com

And here is what I got from Jesse St. Charles from Knewton, a cool online education company:
Knewton is revolutionizing the practice of education with the world’s most powerful adaptive learning engine. We are a recognized leader in the
education and technology space by the World Economic Forum in Davos, and one of the top 25 best places to work by Crain’s New York Business. We're looking for Machine Learning interns with the know-how to help build an innovative online education system that adapts to each individual student. Interns will join a world-class team of data scientists and engineers who are pushing the boundaries of machine learning in both scalability and complexity. You'll get to work with a mountain of data and an exciting array of projects. If you have a passion for building scalable systems that analyze huge data sets and have coursework in machine learning, statistics, and advanced mathematics get in touch with us here.

My friend Udi Weinsberg from Technicolor raised my attention that Technicolor are also looking for interns. Technicolor Palo Alto research lab studies personalized computing, data privacy and recommendation systems. You can apply here.

Openings in Europe

I got this from Julien Nioche: DigitalPebble (Bristol, UK) is looking for a graduate / post-graduate student for this summer, ideally with the following interests or expertise : * NLP / text engineering / IE * statistical approaches and machine learning * web crawling and IR * large scale computing with Hadoop * good Java skills The internship would start in July for a duration of 2 or 3 months and will be based in Bristol. This should be a good opportunity to gain expertise in leading open source projects such as GATE, Mahout or Nutch and get directly involved in work with our clients. Note that the internship will be remunerated. To apply, email: jobs@digitalpebble.com

Now how about spending a summer in Madrid? Telefonica research is looking for interns all year long. I heard a very impressive talk by Nuria Olivier at our big learning workshop at NIPS about research done in Telefonica research. You can take a look at the slides here. In a nutshell once you have mobile phone call data combined with geographical data you can get into very interesting observations.

My avid reader alter0de sent me a link to internships in Xerox research center in Europe: http://www.xrce.xerox.com/About-XRCE/Internships. Thanks!

Monday, February 27, 2012

Matrix Market Format

Matrix Market is a very simple format devised by NIST to store different types of matrices.

For GraphLab matrix libraries: linear solvers, matrix factorization and clustering we recommend using this format for the input file. Once this format is used for the input, the same format will be used for the output files.

Sparse matrices
Matrix market has a very simple format: 2 header lines. Here are some examples. Here is
example 3x4 matrix:

A =

0.8147         0    0.2785         0
0.9058    0.6324    0.5469    0.1576
0.1270    0.0975    0.9575         0

And here is the matching matrix market file:

%%MatrixMarket matrix coordinate real general
% Generated 27-Feb-2012
3 4 9
1 1  0.8147236863931789
2 1  0.9057919370756192
3 1  0.1269868162935061
2 2  0.6323592462254095
3 2  0.09754040499940952
1 3  0.2784982188670484
2 3  0.5468815192049838
3 3  0.9575068354342976
2 4  0.1576130816775483

The first line, include the header. coordinate means this is a sparse matrix.
The third row includes the matrix size: 3 4 9 means 3 rows, 4 columns and 9 non zero entries.
The rest of the rows include the edges. For example the 4th row is:
1 1 0.8147236863931789, namely means that it is the first row, first column and its value.

TIP: Sparse matrices should NOT include zero entries!
For example, the row 1 1 0 is not allowed!
TIP: First two numbers in each non-zero entry line should be integers and not double notation. For example 1e+2 is not a valid row/col number. It should be 100 instead.
TIP: Row/column number always start from one (and not from zero as in C!)
TIP: It is not legal to include the same non zero entry twice. In GraphLab it will result in a duplicate edge error. Note that the number of edges in GraphLab starts in zero, so you have to add one to source and target values to detect the edge in the matrix market file.

Dense matrices:
Here is an example on how to save the same matrix as dense matrix:

A =

0.8147         0    0.2785         0
0.9058    0.6324    0.5469    0.1576
0.1270    0.0975    0.9575         0

%%MatrixMarket matrix array real general
3 4

0.8147

0.1576

Symmetric sparse matrices:
Here is an example for sparse symmetric matrix:

B =

    1.5652         0    1.4488
         0    2.0551    2.1969
    1.4488    2.1969    2.7814

And here is the matching matrix market file:

%%MatrixMarket matrix coordinate real symmetric
% Generated 27-Feb-2012
3 3 5
1 1  1.5652
3 1  1.4488
2 2  2.0551
3 2  2.1969
3 3  2.7813

Note that each non-diagonal edges is written only once.

Sparse vectors:
Here is an example for sparse vector:

v =

     1     0     0     1

%%MatrixMarket matrix coordinate real general
% Generated 27-Feb-2012
1 4 2
1 1 1
1 4 1

Working with matlab:
download the files http://graphlab.org/mmwrite.m and http://graphlab.org/mmread.m
In Matlab you can save a dense matrix using:

>> mmwrite('filename', full(matrixname));

And save a sparse matrix using:

>> mmwrite('filename', sparse(matrixname));

For reading a sparse or dense matrix you can:

>> A = mmread('filename');

Writing a conversion function in Java
This section explains how to convert Mahout 0.4 sequence vectors into matrix market format.

Create a file named Vec2mm.java with the following content:

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.Iterator;

import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;

/**
 * Code for converting Hadoop seq vectors to matrix market format
 * @author Danny Bickson, CMU
 *
 */

public class Vec2mm{
 
 
 public static int Cardinality;
 
 /**
  * 
  * @param args[0] - input svd file
  * @param args[1] - output matrix market file
  * @param args[2] - number of rows
  * @param args[3] - number of columns
  * @param args[4] - number oi non zeros
  * @param args[5] - transpose
  */
 public static void main(String[] args){
 
   try {
     if (args.length != 6){
        System.err.println(Usage: java Vec2mm <input seq vec file> < output matrix market file> <number of rows> <number of cols> <number of non zeros> <transpose output>);
           System.exit(1);
        }

        final Configuration conf = new Configuration();
        final FileSystem fs = FileSystem.get(conf);
        final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(args[0]), conf);
        BufferedWriter br = new BufferedWriter(new FileWriter(args[1]));
        int rows = Integer.parseInt(args[2]);
        int cols = Integer.parseInt(args[3]);
        int nnz = Integer.parseInt(args[4]);
        boolean transpose = Boolean.parseBoolean(args[5]);
        IntWritable key = new IntWritable();
        VectorWritable vec = new VectorWritable();
        br.write("%%MatrixMarket matrix coordinate real general\n");    
        if (!transpose)
          br.write(rows + " " +cols + " " + nnz + "\n");
        else br.write(cols + " " + rows + " " +  nnz + "\n");
        while (reader.next(key, vec)) {
          //System.out.println("key " + key);
          SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get();
          //System.out.println("key " + key + " value: " + vect);
          Iterator iter = vect.iterateNonZero();

          while(iter.hasNext()){
            Vector.Element element = iter.next();
           //note: matrix market offsets starts from one and not from zero
           if (!transpose)
             br.write((element.index()+1) + " " + (key.get()+1)+ " " + vect.getQuick(element.index())+"\n");
           else 
             br.write((key.get()+1) + " " + (element.index()+1) + " " + vect.getQuick(element.index())+"\n");
           }
       }
  
       reader.close();
       br.close();
   } catch(Exception ex){
     ex.printStackTrace();
   }
 }
}

Compile this code using the command:

javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar *.java

Now run it using the command:

java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/google-collections-1.0-rc2.jar Vec2mm A.seq A.mm 25 100 2500 false

Where A.seq is the input file in SequentialAccessSparseVector key/value store. A.mm
will be the resulting output file in matrix market format. 25 is the number of rows, 100 columns, and 2500 non zeros. false - not to transpose the resulting matrix.

Depends on the saved vector type, you may want to change my code from SequentialAccessSparseVector to the specific type you need to convert.

Sunday, February 26, 2012

SVD strikes again!

As you may now, SVD is a topic which constantly comes up in my blog. This is a chicken and egg problem: as more I work on SVD, I get more requests for help with large scale SVD solutions, and then I work more on SVD..

Here is some updates from the last couple of weeks.
Vincent Bor, an independent consultant from Dallas/Forth Worth area sent me the following. The problem size he was trying is a document term matrix of size 207,710 terms 153,328 documents with 22,518,138 non zero elements. It seems he has some older SVD implementation he is trying to compare to GraphLab.

I do bidiagonalization so I use original matrix A not ATA or AAT, maybe because of this Lanczos better processes ill-conditioned matrices (?), I have restarting but I haven't seen yet that restarting was ever used (I do not see all logs, only from time to time). The algorithm is based on the paper "Lanczos bidiagonalization with partial reorthogonalization" by Rasmus Munk Larsen, but I did full reorthogonalization. I use The Compressed Row Storage (CRS) format to keep matrix in memory and for vector matrix multiplication. During the iterations I keep V, U vectors on the disk, only matrix is in memory. I use parallel calculations but only in two places: vector,matrix multiplication and creating final singular vectors.

It seems that our first version of SVD solver has numerical issues with his matrix. In a few days I am going to release a fix that include restarts, as a part of GraphLab v2.

Brian Murphy, a postdoc at CMU, sent me the following:

By the way, the SVD task I have is not enormous - it's a 50m (web documents) x 40k (words of vocabulary) matrix of real values, and very sparse (3.8bn entries or sparsity of 0.2%). Until now we've been using the sparse SVD libraries in python's scipy (which uses ARPACK). But with that the largest matrix I can comfortably manage is about 10-20m.

Even if I could refactor the code to fit 50m, we would still like to be able to run this with much larger collections of web documents.

Another solution I have tried is the GenSim package, which uses a couple of approximate SVD solutions: , but so far the results using the decompositions it produces are not so encouraging. This paper describes the methods used in Gensim: Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms

Now, this sounds as a challenging magnitude - 3.8 billion non-zeros. I am definitely going to help Brian use GraphLab v2 for cracking his problem!

Andrew McGregor Olney, assistant prof at the dept. for psycology at the university of Memphis has sent me the following:

You might be interested in this paper that describes my partial SVD with no reorthongonalization. It's not highly original, but attempts to synthesize some of the previous work in to a simple, practical, and accurate low-memory solution for SVD of term-doc matrices.

I've got some open-source C# code to share (it's not really pretty, but at least it's OO) and some of the original octave scripts I cobbled together in 2005 (using B instead of AtA) if you are interested.

Andrew further sent me a sample matrix of size 4364328 rows, 3333798 columns, 513883084 non zeros to try out. He only care about the left vectors and singular values.

Edin Muharemagic is working on LexisNexis new ML implementation, and he is currently implementing SVD on top of it. You can read some more in my blog about it here.

The best SVD large scale implementation I am aware of is Slepc. Now, if this problem is solved, you may ask, why is everyone emailing me? Unfortunately using Slepc requires understanding of MPI, C programming ability and a non negligible learning curve. To understand why I a proper SVD implementation is no trivial, I asked Joel Tropp, who gave the following answer:

In short, the Lanczos method is very unstable because it tries to "cheat" when it computes an orthogonal basis for the Krylov subspace. This procedure only works well in exact arithmetic; it breaks very quickly in the presence of floating-point errors. Drastic remedies are required to fix this problem. See the book by Golub and Van Loan for a discussion and references. A more introductory presentation appears in Trefethen and Bau.

Another possible approach to computing a large-scale SVD is to use column-sampling approaches, perhaps followed by a randomized SVD as advocated by Kwok and Li. See Chiu and Demanet (arXiv) or Gittens (arXiv) for some related methods.

Anyway, stay tuned, since it the coming week or so I am going to release an improved SVD solver in Graphlab version 2. I hope it will be able to address all the above requests for help. And if you are working on large scale SVD/ or need such an implementation please let me know!

Thursday, February 9, 2012

New Java GraphLab tutorial

I am glad to report that we have a new addition to our GraphLab team: Jim Lim, a CMU sophomore in information systems (computer science).

Jim will be working on the GraphLab Java interface: making it more usable, adapting it to GraphLab v2 and adding new features. As a first step, Jim wrote a tutorial of the Java GraphLab interface.

Monday, February 6, 2012

Installing GraphLab on Ubuntu - 32 bit

Warning: this tutorial is deprecated. We no longer support GraphLab for Ubutnu 32 bit. You should use Ubutnu 64 bit with the instructions here: http://bickson.blogspot.co.il/2011/10/installing-graphlabboost-libboost-on.html

1) Computer setup
Login into your Ubuntu 10.10 machine or login into Amazon AWS console and select a 32 instance to run from here. I chose ami-71dc0b18.

2) Update your system
sudo apt-get update
sudo apt-get install build-essential

3) Install Boost

sudo apt-get install libboost-dev libboost-iostreams1.40-dev libboost-program-options1.40.0 libboost-filesystem1.40.0 libboost-system1.40-dev

Fix links:

sudo ln -s /usr/lib/libboost_program_options.so.1.40.0 /usr/lib/libboost_program_options.so
sudo ln -s /usr/lib/libboost_filesystem.so.1.40.0 /usr/lib/libboost_filesystem.so
sudo apt-get install libitpp-dev

4) Install Mercurial
sudo apt-get install mercurial

5) Install CMake
sudo apt-get install cmake

6a) Checkout graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.

or 6b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

7) configure and compile
cd graphlabapi
export BOOST_ROOT=/usr/
./configure --bootstrap --yes
cd release/
make

8) Test your build
cd tests
./runtests.sh

Possible errors:

libs/iostreams/src/zlib.cpp:20:76: error: zlib.h: No such file or directory
libs/iostreams/src/zlib.cpp:31: error: ‘Z_NO_COMPRESSION’ was not declared in this scope
libs/iostreams/src/zlib.cpp:32: error: ‘Z_BEST_SPEED’ was not declared in this scope
libs/iostreams/src/zlib.cpp:33: error: ‘Z_BEST_COMPRESSION’ was not declared in this scope
libs/iostreams/src/zlib.cpp:34: error: ‘Z_DEFAULT_COMPRESSION’ was not declared in this scope
libs/iostreams/src/zlib.cpp:38: error: ‘Z_DEFLATED’ was not declared in this scope
libs/iostreams/src/zlib.cpp:42: error: ‘Z_DEFAULT_STRATEGY’ was not declared in this scope
libs/iostreams/src/zlib.cpp:43: error: ‘Z_FILTERED’ was not declared in this scope
libs/iostreams/src/zlib.cpp:44: error: ‘Z_HUFFMAN_ONLY’ was not declared in this scope
libs/iostreams/src/zlib.cpp:48: error: ‘Z_OK’ was not declared in this scope
libs/iostreams/src/zlib.cpp:49: error: ‘Z_STREAM_END’ was not declared in this scope
libs/iostreams/src/zlib.cpp:50: error: ‘Z_STREAM_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:51: error: ‘Z_VERSION_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:52: error: ‘Z_DATA_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:53: error: ‘Z_MEM_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:54: error: ‘Z_BUF_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:58: error: ‘Z_FINISH’ was not declared in this scope
libs/iostreams/src/zlib.cpp:59: error: ‘Z_NO_FLUSH’ was not declared in this scope
libs/iostreams/src/zlib.cpp:60: error: ‘Z_SYNC_FLUSH’ was not declared in this scope
libs/iostreams/src/zlib.cpp: In static member function ‘static void boost::iostreams::zlib_error::check(int)’:
libs/iostreams/src/zlib.cpp:77: error: ‘Z_OK’ was not declared in this scope
libs/iostreams/src/zlib.cpp:78: error: ‘Z_STREAM_END’ was not declared in this scope
libs/iostreams/src/zlib.cpp:81: error: ‘Z_MEM_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp: In constructor ‘boost::iostreams::detail::zlib_base::zlib_base()’:
libs/iostreams/src/zlib.cpp:94: error: expected type-specifier before ‘z_stream’
libs/iostreams/src/zlib.cpp:94: error: expected ‘)’ before ‘z_stream’
libs/iostreams/src/zlib.cpp: In destructor ‘boost::iostreams::detail::zlib_base::~zlib_base()’:
libs/iostreams/src/zlib.cpp:97: error: expected type-specifier before ‘z_stream’
libs/iostreams/src/zlib.cpp:97: error: expected ‘>’ before ‘z_stream’
libs/iostreams/src/zlib.cpp:97: error: expected ‘(’ before ‘z_stream’
libs/iostreams/src/zlib.cpp:97: error: ‘z_stream’ was not declared in this scope
libs/iostreams/src/zlib.cpp:97: error: expected primary-expression before ‘>’ token
libs/iostreams/src/zlib.cpp:97: error: expected ‘)’ before ‘;’ token
libs/iostreams/src/zlib.cpp: In member function ‘void boost::iostreams::detail::zlib_base::before(const char*&, const char*, char*&, char*)’:
libs/iostreams/src/zlib.cpp:102: error: ‘z_stream’ was not declared in this scope

Solution:
Either install zlib, or use the above commands for installing libboost1.40 from apt-get (recommended).

Error:

In file included from /home/ubuntu/graphlabapi/demoapps/gabp/gamp.cpp:26:
/home/ubuntu/graphlabapi/demoapps/gabp/cas_array.h: In function ‘void mul(double*, double)’:
/home/ubuntu/graphlabapi/demoapps/gabp/cas_array.h:175: warning: dereferencing type-punned pointer will break strict-aliasing rules
/home/ubuntu/graphlabapi/demoapps/gabp/cas_array.h:175: warning: dereferencing type-punned pointer will break strict-aliasing rules
c++: Internal error: Killed (program cc1plus)
Please submit a full bug report.
See  for instructions.

Solution: It is a known compiler error in Ubuntu 32 bit. I simply tried again and it worked..
Alternatively you may need to upgrade your compiler.

Error:

Probing for boost...
Probing in /usr/
/usr/bin/ld: cannot find -lboost_system
collect2: ld returned 1 exit status
Probing in /home/ubuntu/graphlabapi/deps
/usr/bin/ld: cannot find -lboost_system

Solution:
If the package libboost_system is not installed, install it. You can view all available
package by using the command:
apt-cache pkgnames | grep boost | grep 1.40
You can view all installed packages by
dpkg -l | grep boost

If the package is installed, create a symbolic link as explained in section 3 (subsection fix links).

Error:

Mon Feb  6 13:25:48 UTC 2012
~/graphlabapi ~/graphlabapi
cmake detected in /usr/bin/cmake. Skipping cmake installation.
~/graphlabapi
~/graphlabapi ~/graphlabapi
Probing for boost...
BOOST_ROOT not defined. Probing in usual directories...
/usr/bin/ld: cannot find -lboost_iostreams
collect2: ld returned 1 exit status
Probing in /home/ubuntu/graphlabapi/deps
/usr/bin/ld: cannot find -lboost_iostreams
collect2: ld returned 1 exit status

Solution:
export BOOST_ROOT as described above.

Wednesday, February 1, 2012

Community detection using GraphLab

I got this from Timmy Wilson, our man in Ohio and a power Graphlab user:

you need an html5 compliant browser to see/interact w/ the visualization.

I'm using an agent based dimension reduction method to reduce a
directed social graph to two dimensions -- here's the library i'm
currently using.

The social map represents twitter links and cluster together users who frequently interact.
I was interested to learn which algorithm is used for performing the clustering, and here is what I got from Andreas Noack, the creator of the linloglayout package:

The algorithm resembles force simulation algorithms from physics, though not exactly spring relaxation. I have not published a description, but it is similar to the FADE algorithm, which again is based on the Barnes-Hut algorithm.

While the social map activity does not currently use GraphLab, previously Timmy used GraphLab for performing community detection in twitter social graph. He also contributed some datasets to be played with. Thanks Timmy for the updates!

Large Scale Machine Learning and Other Animals