Comments on Large Scale Machine Learning and Other Animals: Mahout - SVD matrix factorization - formatting input matrix

+1 for concerns Marco has expressed. I was also wo...

2014-05-30T20:12:26.755-07:00

+1 for concerns Marco has expressed. I was also worried if Mahout takes vectors of columns and only when I read the comments my misconception was cleared.

Thanks for the code though :)

thank you bickson. my work on svd dimension reduct...

2014-03-21T08:54:48.736-07:00

thank you bickson.
my work on svd dimension reduction i converted a csv to sequence file using seqdirectory and den convert it in to vector seq2sparse i dont know how to identify the number if rows and column in vector can u help me out.

Hi, I suggest a much easier way to start: http://...

2014-03-14T00:15:54.958-07:00

Hi,
I suggest a much easier way to start: http://graphlab.com/learn/notebooks/index.html

hi Bickson, My project is on malware analysis in b...

2014-03-14T00:10:20.923-07:00

hi Bickson,
My project is on malware analysis in big data. can u please provide some input about how to use SVD in it. I am little bit confused. SVD as a recommender or Dimension reduction in mahout. the Post on SVD are about an recommendation engine. The factorizer in mahout are based on numerical inputs. how it works with data inputs. I am totally out of line. could you suggest me on this?

Sorry for the confusion, you are right, the matrix...

2012-09-26T23:12:29.399-07:00

Sorry for the confusion, you are right, the matrix is transposed so I treat the key as the column number. However, it is very easy to reverse...

Hi Danny, thanks for providing this, but I am sor...

2012-09-25T15:55:30.370-07:00

Hi Danny,

thanks for providing this, but I am sorry to say that your explanation is very very misleading.
As ehtsham points out, it is common practice to have :

ROWS = documents (users in ehtsham case)
COLUMNS = tokens (movies in ehtsham case)

You are reversing this common assumption, which will confuse many people, including me. I was puzzled that Mahout takes as input vectors of columns instead of vectors of rows, until I figured out that you inverted these two concepts.

I highly suggest you to revise your (otherwise very useful!) blog entry, so that newbies won't get confused too much.

Thanks for sharing your code,
Marco

okay, many thanks for your reply. Sorry for missin...

2012-07-29T23:17:07.529-07:00

okay, many thanks for your reply.
Sorry for missing it instead of publishing the following comment as I did not though I could got such a fast reply. You are a great blogger. -）

Got it ,I changed the code as follows to fix the e...

2012-07-29T23:13:47.118-07:00

Got it ,I changed the code as follows to fix the error mentioned above:

from = Integer.parseInt(st.nextToken())
to = Integer.parseInt(st.nextToken());

Hi! I assume the row and column ids are 1,2,3,.., ...

2012-07-29T23:13:04.005-07:00

Hi!
I assume the row and column ids are 1,2,3,.., and it seems that in you case they are start with zero and up. So you need to remove the "-1" , namely change the code to:
from = Integer.parseInt(st.nextToken());
to = Integer.parseInt(st.nextToken());

Best,

DB

I used the code and the given example csv file,and...

2012-07-29T22:33:28.616-07:00

I used the code and the given example csv file,and got the following error:

java.lang.NumberFormatException: wrong data0 to: -1 val: 3.0
at Convert2SVD.main(Convert2SVD.java:69)

In my case, to (the 2nd column) is the user, the f...

2011-10-23T02:47:11.185-07:00

In my case, to (the 2nd column) is the user, the first column is the movie, and the cardinality is 17770 in your example. Sorry for the confusion. Maybe I will rename from,to => movie, user ?

Thanks Danny for the prompt reply, but I still don...

2011-10-23T02:18:40.233-07:00

Thanks Danny for the prompt reply, but I still dont understand, I apologize for my ignorance. Let me clarify what I understand: The Netflix data is 480189 users and 17770 movies which I assume is arranged in a No.of.users(rows) by no.of.movies(columns) matrix. You are saying that 2nd column is the 'from' field but you read-in the first column as the 'from' field in your tokenizer code and 'to' is reading the 2nd column and val is reading the rating(third column).
So your code would produce a Sequence file with 'key' being the column number and row number as the index of the value. Have I understood it correctly? Thanks,

Hi Eshtsham! I assume the matrix is sorted by the ...

2011-10-23T01:51:44.302-07:00

Hi Eshtsham!
I assume the matrix is sorted by the columns, and the 2nd column is the from field (the user) and the 1st column is the to field (the item). Cardinality is the total number of items. I apologize if this may confused you. It is quite easy to swap this ordering, if your dataset is sorted by the 1st column, which are the users, then you can simply swap between from and to.
I have fixed the example. Thanks!

Hi Danny, thank you for the code, it is a great he...

2011-10-23T00:33:52.574-07:00

Hi Danny, thank you for the code, it is a great help, but there are couple of things that I dont really understand.
1) Cardinality is the number of columns in the input matrix, then why the failure condition is "from > cardinality" where from is the row index as evident by your 3x3 matrix at the start of this blog. (or is it because the netflix dataset is organized as movies (columns in the input matrix) appearing with all the users who rated it(rows in the input matrix).
2) the 2nd is a minor thing: your System.out.print statement is outputing Columns where as the example output you have shown at the end is showing Rows.

Ehtsham

Definitely. SVD GraphLab iteration takes 10 second...

2011-09-28T08:29:15.595-07:00

Definitely. SVD GraphLab iteration takes 10 seconds on a single 8-core machine using 100,000,000 non zeros. We have tried it out up to 100,000,000,000 non zeros. But now we are working to scale to even larger settings. Email me if like to know more.

Best,

DB

Hi Danny, We are working on a couple of text mini...

2011-09-28T07:24:31.179-07:00

Hi Danny,

We are working on a couple of text mining applications with several corpus that can have more than 1 million of documents and 200k terms (I can't provide more details...). Is GraphLab prepared for large scale factorization? In fact we are very surprise that we didn't find GraphLab until now since we have been evaluating several options.

Hi Percha! You are definitely right, send Ana my t...

2011-09-28T06:52:59.739-07:00

Hi Percha!
You are definitely right, send Ana my thanks, I have fixed the code. Would you mind sharing what are you working on? Maybe you will be interested in testing my own SVD implementation in GraphLab?

Best,

DB

Hi Dan, I think there is a small error that will ...

2011-09-28T06:37:48.867-07:00

Hi Dan,

I think there is a small error that will cause some problems if the matrix is not squared.
At line 67 we force the input matrix to have at most "cardinality" different column values, I think this condition should work on the row id's due to the fact that after, at line 80 we initialize the vector with dimension = "cardinality".
So at this line we should change "to > Cardinality" to "from > Cardinality". Otherwise if the input matrix has size mxn, the program will load it as an mxm matrix producing garbage in the results.

Thank you very much for your code, it has been very helpful, and thanks to my coleague Ana for reporting me this error.

Cheers,

Hi Dan, Your research sounds interesting! Is this ...

2011-02-24T09:07:41.636-08:00

Hi Dan,
Your research sounds interesting!
Is this academic venue, a hobby or some kind of a start up? I would consider using GraphLab PMF, which factorizes the matrix W~=UV to two smaller matrices. Let me know if you need help, I would love to help you test it - I have experimented with matrices with several billion non zeros.

Hey Danny,

So re zero index, I forgot to mention I was talking about your example input in the blog post, rather than the logic of the source code:

"For example,
The 3x3 matrix:
0 1.0 2.1
3.0 4.0 5.0
-5.0 6.2 0"

Also btw I found it useful to break out the failure conditions,

if (from < 0)
throw new NumberFormatException("from < 0 fail: " + from + " to: " + to + " val: " + val);

if (to < 0)
throw new NumberFormatException("to < 0 fail: " + from + " to: " + to + " val: " + val);

if (to > Cardinality)
throw new NumberFormatException("Exceeded cardinality fail: " + from + " to: " + to + " val: " + val);

if (val == 0.0)
throw new NumberFormatException("Zero value fail: " + from + " to: " + to + " val: " + val);

... as I didn't initially realise that zero values were a no-go. I started digging into SVD a while ago, and after finding that the Ruby setup in
http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ .... was both a bit fragile (re the underlying libraries being a pain in OSX), I found my way to Mahout, where I got distracted by other pieces (recommender etc) until finally today resolving to take a serious look at SVD.

So I started testing by taking the toy example from the above blog post and putting it into your format. The zeros being the same as missing data wasn't something I'd completely absorbed before, but if that's the deal, fine.

My actual usage plans - nothing running yet - are from the NoTube project (http://www.notube.tv/), which is about TV and the Web. So we have a pile of data about TV content (video archives) including subject classification codes. Not Web-scale scary big, but chunky enough. And I'm also (Pig/Hadoop) mining Twitter crawls for info about users (urls they post, celebs they follow), cross linked to dbpedia/wikipedia, ... which gives somewhat bulkier data. My initial thought was just to get my hands dirty with SVD by taking a table of content items (videos) and features (subject classification codes), then explore from there. Since we have also some explicit metadata about how the content items relate to each other, and how the subject codes relate too, there's plenty to investigate there in terms of making a sensible workflow and SVD representation.

But as I say, early days. Having your tools to get things in/out of the Mahout format removes one of those obstacles that was putting me off. Next stop is the little bit of RTFM'ing to get Mahout talking to the real Hadoop cluster properly...

cheers,

Dan

Hey Danny, So re zero index, I forgot to mention ...

2011-02-24T08:41:49.376-08:00

Hey Danny,

So re zero index, I forgot to mention I was talking about your example input in the blog post, rather than the logic of the source code:

"For example,
The 3x3 matrix:
0 1.0 2.1
3.0 4.0 5.0
-5.0 6.2 0"

Also btw I found it useful to break out the failure conditions,

if (from < 0)
throw new NumberFormatException("from < 0 fail: " + from + " to: " + to + " val: " + val);

if (to < 0)
throw new NumberFormatException("to < 0 fail: " + from + " to: " + to + " val: " + val);

if (to > Cardinality)
throw new NumberFormatException("Exceeded cardinality fail: " + from + " to: " + to + " val: " + val);

if (val == 0.0)
throw new NumberFormatException("Zero value fail: " + from + " to: " + to + " val: " + val);

... as I didn't initially realise that zero values were a no-go. I started digging into SVD a while ago, and after finding that the Ruby setup in
http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ .... was both a bit fragile (re the underlying libraries being a pain in OSX), I found my way to Mahout, where I got distracted by other pieces (recommender etc) until finally today resolving to take a serious look at SVD.

So I started testing by taking the toy example from the above blog post and putting it into your format. The zeros being the same as missing data wasn't something I'd completely absorbed before, but if that's the deal, fine.

My actual usage plans - nothing running yet - are from the NoTube project (http://www.notube.tv/), which is about TV and the Web. So we have a pile of data about TV content (video archives) including subject classification codes. Not Web-scale scary big, but chunky enough. And I'm also (Pig/Hadoop) mining Twitter crawls for info about users (urls they post, celebs they follow), cross linked to dbpedia/wikipedia, ... which gives somewhat bulkier data. My initial thought was just to get my hands dirty with SVD by taking a table of content items (videos) and features (subject classification codes), then explore from there. Since we have also some explicit metadata about how the content items relate to each other, and how the subject codes relate too, there's plenty to investigate there in terms of making a sensible workflow and SVD representation.

But as I say, early days. Having your tools to get things in/out of the Mahout format removes one of those obstacles that was putting me off. Next stop is the little bit of RTFM'ing to get Mahout talking to the real Hadoop cluster properly...

cheers,

Dan

Hi Dan! Rows and columns should start from the sam...

2011-02-24T04:17:15.975-08:00

Hi Dan!
Rows and columns should start from the same zero index. Let me know if I have some bug!

Would you mind sharing what are you working on and what is the size of the factorized matrix? The reason I am asking is because I am writing a very efficient matrix factorization in GraphLab, when assuming matrix fits into memory, is about 50 times faster. Also I have some comments about the way eigenvalues are computed in SVD. See my comment in ; https://issues.apache.org/jira/browse/MAHOUT-369

Best,

Danny

Handy :) Am I misreading, or are you counting row...

2011-02-24T03:35:43.254-08:00

Handy :)

Am I misreading, or are you counting rows from 0, columns from 1?