Monday, December 17, 2012

Some interesting feedback for the 3rd generation cf discussed in this blog

You can find below two emails I got from industry regarding my 3rd generation solver.
I definitely can understand some of the frustration - as I can claim anything in my blog,
for the industry guys it is harder to prove their claims as they charge money for it... While my solution cost nothing, so if it is worth more than nothing than everyone is happy about it.

The first feedback I got from Dinesh Vadhia, founder of

Hi Danny
A solution to the scalable CF problem with additional information has been available for a while with a Bayesian machine learning method developed by Professor Zoubin Ghahramani and Dr Katherine Heller (see reference below). Together with Zoubin, we at Xyggy are focused on making it simpler and faster for developers and data scientists to deploy intelligent services online with scalable Bayesian machine learning.  Briefly, the key aspects for machine learning practitioners are:
i) Feature engineering of any data type
Feature vectors can be created from any data type.  Features can also be combined for example, bag-of-words + votes (events) + preference ratings + custom features and so on.  If data is remotely of value then include it as a feature.  Feature vector lengths can range from the tens to tens of millions.  The method is equally suitable for items x items, items x users and users x items problem types.
ii) No training
No training is required. Once the feature engineering is completed, the sparse binarized data is fed to the Xyggy engine.  The method automatically learns and generalizes.
iii) Multiple items per query
A query consists of one or more items.  Typically, the more items per query the better the results.
iv) Dynamic predictions
Predictions are calculated dynamically in near real-time.  If the query items change, the predictions are re-calculated.  Think of it as dynamic clustering. 
v) Scalability and parallelism
The method can be viewed as IR+RecSys and will scale to any required size.  The inherent parallelism offers scalability and performance routes to deliver services at web-scale.
vi) Relevance feedback, engineered serendipity and novelty detection.
The Android showcase app demonstrates personalization with autonomous discovery utilizing both positive and negative relevance feedback, as well as engineered serendipity.  The source code for the Android app will be released soon to show how these capabilities can be built into applications with the Xyggy api.  If there is interest, the feature engineering process for this app can be explained as it is instructive.
More information can be found here.  We welcome the opportunity to work with organizations who want a simpler and faster way to deploy scalable intelligent services. 
I'll be happy to continue the discussion.
Best ...
Dinesh Vadhia

Though written a while back, Information Retrieval using a Bayesian Model of Learning and Generalization provides a good overview. However, the post doesn't cover relevance feedback, engineered serendipity or RecSys in any detail.  Also, note that the demo has been r
eplaced with the Android showcase app.

The second feedback I got from Nick Vasiloglou, ISMION:

I don't think you expected so much feedback for your blog post, but I guess this is good. Let me add my comments to your post
  1. You will be surprised about how many things happen in the industry and nobody ever bothers to publish. I can easily believe that somebody else has tried this approach. The truth is that I have been trying to convince some of my clients to use tensor factorization (PARAFAC) instead of linear regression, but they are not convinced. One of the reason is that traditional industry prefers linear regression because of the confidence intervals and statistical significance of the factors. In your approach you aggregate several factors into the time variable. You could have instead used a multidimensional tensor x[i,j,k,l]=a[i]*b[j]*c[k]*d[l] use L1 regularization and that would have given you some importance on the variables. I am not sure how you can match the linear regression metrics, but worst case you use bootstrap.
My answer: yes, I have been there, done that.. typically in the industry you write patents, on issues that will be hardly accepted to decent conference. But I know there is not enough time to pursue publications. By the way, in my approach, I am not aggregating several factors into time variables (I only noted this can be done using traditional matrix factorization approach), I am using separate factors for each variable. The tensor case can be hardly scaled beyond the 3rd dimension because of all the interactions between the latent feature vectors. 
  1. Another problem that I see with your approach is handling continuous variables. So You are trying to predict delays and you might have continuous variable like weather temperature or humidity. I believe you implicitly quantize them through hashing. This might be suboptimal since values that are too close might fall into separate bins. Even worse if let's say your original dataset didn't have any categorical/ordinal variables, but only d continuous factors, then by quantizing them with hashing there is high probability that true nearest neighbors will be assigned in different bins. A different approach would be to build m random-trees with l number of leafs each. So now your original d-dimensional space has been transformed into a m*l one. Each point now is mapped on m different leafs. I think this is a better approach thatn quantizing each variable separately. It preserves up to a small error the euclidean distances.
My answer: I agree there is a delicate point here. For the target variable (like flight delay) I support continues variables. But the other feature I quantize. Thus the normal relation between increasing values is lost. But for most problems I tried it works well in practice. Regarding the lost relations, whenever a single feature appears in two different samples then they are connected, and the gradient is compute with respect to both. Thus data dependancies are maintained very well.
  1. There is also one detail that it is not mentioned in any matrix factorization works and it is very critical in practice. If the matrix has block diagonal components, in other words the corresponding graph has disconnected components, you can get very bad results in your recommender systems when you are looking for similar items. So a good advice is to use the graphlab method for identifying them first and then use your favorite factorization.
My answer: Most of the graphs I work on are connected. So you may be right but I am not concered with this issue.
  1. One of the reasons why companies don't broadcast their solutions is because they are afraid of patents. This is also my advise to my clients,  specially if they are in the medical/healthcare industry, never to disclose their exact methods. Patent trolls are watching
My answer: good life in academia!! Earn a low salary and broadcast as much as you like!!
  1. At last I would like to say that I am very glad you posted this approach. I happen to teach an introductory course to practitioners with the title "Learning Machine Learning by Example" and we use the airline dataset.  Your post is giving me an opportunity to introduce the students to matrix factorization coming straight from linear regression!
My answer: very interesting. I hope you will give an opportunity for the students to actually play with my code, I think it would be benfiticial regarding their understanding how different feature contribute to the overall solution quality. 

Nick has kindly added GraphChi CF toolkit to his machine learning meetup course.

No comments:

Post a Comment