Friday, February 4, 2011

Mahout - SVD matrix factorization - reading output

Converting Mahout's SVD Distributed Matrix Factorization Solver Output Format into CSV format

Purpose
The code below, shows how to convert a matrix from Mahout's SVD output format
into a matrix format.


This code is based on code by Danny Leshem, ContextIn.

Command line arguments:
args[0] - path to svd output file
args[1] - path to output csv file

Compilation:
Copy the java code below into an java file named SVD2CSV.java
Add to the project path both Mahout and Hadoop jars.


#
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.Iterator;

import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;


public class SVD2CSV {
   
   
    public static int Cardinality;
   
    /**
     *
     * @param args[0] - input csv file
     * @param args[1] - cardinality (length of vector)
     * @param args[2] - output file for svd
     */
    public static void main(String[] args){
   
try {
    final Configuration conf = new Configuration();
    final FileSystem fs = FileSystem.get(conf);
    final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(args[0]), conf);
    BufferedWriter br = new BufferedWriter(new FileWriter(args[1]));
    IntWritable key = new IntWritable();
    VectorWritable vec = new VectorWritable();
         
          while (reader.next(key, vec)) {
              //System.out.println("key " + key);
              SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get();
               System.out.println("key " + key + " value: " + vect);
              Iterator<Vector.Element> iter = vect.iterateNonZero();

               while(iter.hasNext()){
                  Vector.Element element = iter.next();
                  br.write(key + "," + element.index() + "," + vect.getQuick(element.index())+"\n");
            }
          }
     
          reader.close();
          br.close();

       
    } catch(Exception ex){
        ex.printStackTrace();
    }
    }
}


When parsing the output please look here about a discussion regarding validity of computed results.

Further reading: Yahoo! KDD Cup 2011 - large scale matrix factorization.

18 comments:

  1. I'm getting ...

    java.lang.ClassCastException: org.apache.mahout.math.DenseVector cannot be cast to org.apache.mahout.math.SequentialAccessSparseVector
    at SVD2CSV.main(SVD2CSV.java:38)

    from SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get();

    Also btw the args in the comments seem to be from the other utility.

    ReplyDelete
  2. in recentish Mahout 0.5/ trunk, I should have said...

    ReplyDelete
  3. SequentialAccessSparseVector vect = new SequentialAccessSparseVector(vec.get() );
    ...seems to fix it (though perhaps inefficent for large results?)

    ReplyDelete
  4. Strange.. for me it worked.. I worked on mahout-0.4
    and hadoop 0.20.2. Maybe you used other version?

    ReplyDelete
  5. Yes, I'm running from recent trunk builds (0.5 as mentioned above).

    So, I have a very basic question. My understanding of SVD is that it will decompose my input matrix into three others whose product gets me back to the input. However the 'rawEigenvectors' file I'm getting back from my hadoop job, and which your utility parses, is only a single square matrix x by x, where x is the number of columns in the original input.

    My process was to run:

    hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver
    --input svdoutput.mht --output outpath --numRows 0 --numCols 4 --rank 5

    ...where svdoutput.mht was the output of your Convert2SVD utility. I'd RTFM but I can't find one :)

    ReplyDelete
  6. Hi!
    Actually Mahout's SVD is implementing the Laczos algorithm, see http://en.wikipedia.org/wiki/Lanczos_algorithm
    Given a matrix A, it implements an iterative method for computing a matrix T s.t. T~V'AV
    Then the matrices T and V are used for computing the eigenvectors and eigenvalues pairs. So the output is SVD is actually the eignevectors and eigenvalues.

    By the way - I have notice you used numRows to be zero? numRows should be the number of rows in your original matrix. So it seems that you are doing something wrong.

    ReplyDelete
    Replies
    1. But in fact the actual output is the premultiplied right vectors no? I don't see where the eigenvalues are output.

      Delete
  7. Thanks. The numRows zero thing came from the original documentation in JIRA which said https://issues.apache.org/jira/browse/MAHOUT-180

    --numRows 0 (currently ignored, not needed)

    ...however I started providing correct values there, just in case. And the info: output from the Laczos code shows it does notice this parameter now (and presumably uses it).

    Ted also replied on list explaining that I should expect the eigenvector/values rather than 3 matrices. So now it's back to linear algebra school for me, to figure out how exactly those are used, compared to the matrices I had been expecting. Once I get Mahout working properly I'd love to test-drive your tools too...

    ReplyDelete
  8. Hi, Anybody would be interested in sanity testing my mahout-593 stochastic SVD for stochastic errors etc.? I have my hands full at the moment so i only tested it with reuters dataset so far but i suspect you guys might have a better /alternative ways of assessing results.

    The patch itself is out of sync with Mahout trunk at the moment, but i have that work frozen here in github : https://github.com/dlyubimov/ssvd-lsi, the most stable and efficient branch is believed to be ssvd-preprocessing at the moment. I also have command line doc here. This code can produce either of U, U*Sigma^0.5, V, V*Sigma^0.5 based on request. The eigenvalues are output as single-row distributed row matrix.

    ReplyDelete
  9. PS. the ssvd-preprocessing branch dependencies are actually changed to use Cloudera's hadoop distribution (CDH3b3), standard Mahout dependencies are kept in the MAHOUT-593 branch but it is about 30% slower because of lack of preprocessing in DRM underpinnings of standard Mahout (and also requires more memory, depending on how wide your input is, but ssvd-preprocessing branch is unbound for n of A.)

    ReplyDelete
  10. Hi Dmitry!
    Sounds interesting. Can you send me a link to the paper you are implementing? I would like to read it first to understand your method better.
    Can you send me a jar file of the algorithm? I am working with mahout-0.4 and hadoop-0.20.2
    will your patch work there?
    Is the input file to your algorithm in the same format of the distributed Lanczos?

    Thanks,

    Danny

    ReplyDelete
  11. The paper should be referenced in Mahout jira 376 (Halko, et.al: Finding Structure with Randomness: Stochastic algorithms for constructing aproximate matrix decompositions). You should be able to find it in arxiv.org but I have a copy if you need it. Also my blog contains concrete modification of the original family of methods they show there on a high level. I added two minor modifications to enable it running on Hadoop MR -- MR-based QR decomposition. More details on algorithms is in my working notes -- but they are way to detailed and lack a structure.

    The input is Distributed Row Matrix (whatever WritableComparable for key and VectorWritable for the row). Outputs are also all DRMs. See command line usage here https://github.com/dlyubimov/ssvd-lsi/blob/doc/SSVD-scaling%20notes.pdf?raw=true and also attached to Mahout-593.

    The distinguishing factors from Lanczos in terms of use are
    -- you can specify whatever labels in the keys of the rows. they will be copied into keys of U. Convenient for LSI processing of output of seq2sparse which produces file names (i believe) as labels.
    -- you can ask for outputs of U, V or U*sigma^0.5, or V*sigma^0.5 which will not require running over initial input of matrix A.
    -- You don't have to specify number of columns or number of rows in the command line. It's whatever it gets to be. However, i do rely on fact that rows specify correct matrix width (i think).

    I think Mahout 593 is going to 0.5 release (at least it seems we have consensus we want to commit it) so i will commit it at some point as soon as i address all the style issues brought up by Owen.

    ReplyDelete
  12. PS in terms of jar, since it is a Mahout patch, it is just the whole Mahout snapshot. Do you think you can check out and build Mahout snapshot from this git branch of mine -- https://github.com/dlyubimov/ssvd-lsi/tree/MAHOUT-593 ? or you think you better have me to build it for you?

    Also like i said, there's more performant snapshot there that i actually use, (ssvd-preprocessing branch), runs about 10 to 30% faster depending on the width of your input like i mentioned before but right now it will use Cloudera's CDH3b3 jars for hadoop (is not Ec2 elastic MR compliant if that's what you normally use).

    ReplyDelete
  13. In terms of paper, it's Halko, et.al "Finding structure with randomness: stochastic algorithms for matrix decompositions". It comes up in search easily but if you need it, i have a copy. The concrete modification of the algorithm i use is in my blog and QR part that i implemented for MR is also in my blog (high level overview). More details are in my working notes in github but they are rather unstructured and because of that are somewhat difficult read, i think.

    ReplyDelete
  14. in terms of input, the input is a regular Mahout DRM (distributed row matrix: one or more sequence files with whatever writable as key and VectorWritable as value). the CLI usage is attached to Mahout-593 jira. Distinguishing factors from Lanczos solver are:
    -- you don't have to specify width or height, it's whatever it gets to be.
    -- your keys don't have to be IntWritables, they only need to be ? extends Writable and are copied to correspondent rows of U
    -- you may optionally request either of U, V, U*Sigma*0.5, V*Sigma*0.5, and those jobs will not have to do a pass over input of A (unlike you'd need to do with Lanczos), the processing there is quite faster.

    ReplyDelete
  15. yes, I found the paper, let me first read it to understand what is going on...

    ReplyDelete
  16. PS as a method itself, there is also a presumably incredibly fast implementation of this method in google code (look for redsvd project there) -- but i did not have a chance to verify it. for many sizes chances are this would be a perfect solution, but hadoop-based one of course can take on problems of billions in width or height dense data (CPU flops permitting).

    ReplyDelete
  17. John Stewart has left a new comment on your post "Mahout - SVD matrix factorization - reading outpu...":

    But in fact the actual output is the premultiplied right vectors no? I don't see where the eigenvalues are output.

    Answer: the eigenvalues are output into the stdout - check your program output.

    ReplyDelete