Tuesday, May 29, 2012

San Francisco - here we come!

I am glad to report we have already 135 registrations for the GraphLab workshop. We are soon running out of space! Register today or you will stay out. Here is our current list of whos and whos:


  • Senior Data Scientist, LinkedIn 

    • TargetChoice 

    • Senior Machine Learning Researcher, GE Global Research 

    • Graduate Student, University of Rochester 

    • Research Staff Member, Samsung Advanced Institute of Technology 

    • The Harker School 

    • Professor, University of Washington 

    • Principle Data Scientist, PocketGems 

    • VMware 

    • Research Manager, Oracle Labs 

    • Chief Technology Officer, Cetas Cloud & Big Data A, VMware 

    • Vice President, Cetas Cloud & Big Data Analytics, VMware 

    • Carnegie Mellon University 

    • Netflix 

    • Postdoctoral Fellow, Stanford University 

    • Research Scientist, Google 

    • Graduate Student, Carnegie Mellon University 

    • Research Scientist, Akamai Technologies 

    • Team Leader, MeraLabs LLC 

    • Carnegie Mellon University 

    • Grad Student, University of Washington CSE 

    • University of Washington 

    • Ph.D. Student, University of Rochester 

    • pnnl/uw 

    • Applications Systems Engineer, Wells Fargo Bank 

    • Google Inc. 

    • Grad student, University of Washington 

    • R & D staff member, VMware Inc. 

    • Research Scientist, Intel Labs 

    • Research Scientist, Intel Labs 

    • Research Scientist, Intel Labs 

    • Research Scientist, Intel Labs 

    • Research Scientist, Intel Labs 

    • Rotation Engineer, Intel Labs 

    • Principal Engineer, Intel Labs 

    • Research Assistant , University of Washington 

    • Instructor, Yale University 

    • Google 

    • Principal Scientist, Technicolor Labs 

    • PhD student, UC Berkeley 

    • Research Scientist, Yahoo! Research 

    • Assistant Professor, VSB-Technical University of Ostrava 

    • Assistant Professor, VSB-Technical University of Ostrava 

    • Graduate Student Researcher, University of California, San Francisco 

    • Software Developer, Kaggle 

    • Kaggle 

    • Product Manager, Kaggle 

    • Data Scientist, Kaggle 

    • Research Staff Member, IBM Research 

    • Joanna Inc. 

    • Founder, Snapwiz, NA 

    • Principle Engineer, Docomo Innovations Inc./Data Mining Group 

    • Professor, VSB-Technical University of Ostrava 

    • Assistant Professor, VSB-Technical University of Ostrava 

    • Assistant Professor, VSB-Technical University of Ostrava 

    • Assistant Professor, VSB-Technical University of Ostrava 

    • Technicolor 

    • Mr, qlink resources 

    • Sr. Product Analyst, Apollo Group 

    • Computer Scientist, Frank Olken Consulting 

    • Timefire 

    • CTO, Yahoo! 

    • Graduate student, Carnegie Mellon University 

    • Machine Learning Dept, CMU 

    • Aurelius 

    • Tagschema 

    • MedHelp 

    • MedHelp 

    • Pandora 

    • Consultant, GLG 

    • Architect, Reltio 

    • CEO, Reltio 

    • student, Lynbrook Highschool 

    • engineer, self employeed 

    • Owner, Blackcloud BSG 

    • Carnegie Mellon University 

    • Carnegie Mellon University 

    • Data mining specialist, Shopiy 

    • Stanford 

    • Associate Research Professor, Carnegie Mellon University 

    • Sales Rep, Oracle 

    • Founder, BIG DatalytiX, Inc. 

    • Researcher, Technicolor 

    • Senior Software Engineer, C3 

    • UCB/UCSF 

    • Research Scientist, Intel 

    • BDM, NetApp 

    • Senior Engineer, SAP Labs 

    • diegobasch.com 

    • Director of Playlist Engineering, Pandora 

    • Carnegie Mellon University 

    • Graduate Student, Carnegie Mellon University 

    • CTO, Joyent 

    • Carnegie Mellon University 

    • VP, Blaxkrock 

    • Director, Blackrock 

    • AdMobius 

    • Assistant Research Professor, TTI-Chicago 

    • NYU-Poly 

    • Data Scientist, Intuit 

    • University of Texas at Austin 

    • Co-Founder, Mailvest 

    • Director of Software Architecture, HauteLook, Inc. 

    • Software Engineer, Sift Science 

    • Software Engineer, SIft Science 

    • Software Engineer, SIft Science 

    • Development Director, Kabam Inc 

    • AppDynamics 

    • CEO, Ismion Inc 

    • VP, Corp Dev, Joyent, Inc 

    • Strategy Architect, undisclosed 

    • Chief Research Officer, ZVOOQ 

    • Professor, UC Berkeley 

    • Twitter Inc. 

    • Student, UPenn 

    • BigDataR Linux 

    • Student, Carnegie Mellon University 

    • American Express 

    • VP Research & Chief Data Scientist, madvertise Mobile Advertising GmbH 

    • VP Product & Prof Svcs, The Filter, Inc. 

    • Project Scientist, CMU 

    • Scientist, Yahoo 

    • Product Manager, One Kings Lane 

    • Sr. Software Engineer, One Kings Lane 

    • Founder, EigenDog 

    • Senior Researcher, Toyota ITC 

    • Postdoc, UC Santa Cruz 

    • Williams-Sonoma 

    • Senior Research Engineer, Cambridge Semantics 

    • Assistant Research Scientist, Dept. of Applied Math & Stats at Johns Hopkins 

Tuesday, May 15, 2012

GraphLab News!

There are a lot of exciting developments around GraphLab that I am glad to report here.
First of all, due to the tremendous support of Intel Labs, and especially Ted Willke, Intel is the platinum sponsor of our GraphLab workshop. We are going to have one of two lectures from Intel about the ongoing collaboration with us.

Additionally, with the great help of another Dr. Ted, this time Ted Dunning from MapR Technologies, MapR is going to be a gold sponsor of our workshop. Ted will also give a lecture at our workshop. Ted is also known for initiating the Mahout Apache project.

Amazon has extended our EC2 computing grant for additional 6 months. Thanks to Carly Buwalda and James Hamilton.

And now to some more workshop news: Alex Smola from Yahoo Research! will give a talk about his large scale machine learning work. Joe Hellerstein from Berkeley will describe his bloom work. Tao Ye and Eric Bieschke from Pandora Internet Radio will give a talk about music rating at Pandora. Xavier Amatriain from Netflix will describe machine learning activity in Netflix. Pankaj Gupta from Twitter will talk about Cassovary - their new graph processing system. Amol Gothing from IBM Watson will describe his large scale ML work. Mohit Signh, one of our earliest adopters will talk about Graphlab deployment at One Kings Lane.

We are also expecting some cool demos. Sungpack Hong from Oracle Labs will give a demo of Green Marl their Graph processing system. Alex Gray from Georgia Tech will also give a demo.

Email me for a discount registration code! And don't forget to mention you are reading my blog!
:-)

As registration is piling up, we got people from the following companies: One Kings Lane, Discovix, Cambridge Semantics, Williams-Sonoma, Toyota ITC, Eigendog, Yahoo! Labs, The Filter Inc, Madvertise mobile advertising, BigDataR Linux, American Express, ZVOOQ, Twitter, Ismion, Appdynamics, Kabam, Sift Science, Hautelook, Mailvest, Intuit, Admobius, Blackbox, Joyent, Pandora Internet Radio, DiegoBasch, SAP Labs, NetApps, Intel Labs, C3, Technicolor Labs, Bigdatalytix Blackcloud BSG, Shofify, Reltio, Toyota Technical Institute Chicago.

We also have academic presence from the following universities: Carnegie Mellon University, Stanford, UC Berkely, UC Santa Cruz, Georgia Tech, UPENN, Polytechnic NY, Johns Hopkins.

Thanks to Shon Burton from GeekSessions who is responsible of organizing the workshop.

We got an interesting email from Abhinav Visnsu, a senior researcher at the Pacific Northwest National Lab:
I am a research scientist at PNNL, and working on scalable execution modes, programming models and communication subsystems for some of the largest supercomputers today (InfiniBand, Cray’s, Blue Gene’s). Recently, we have started a project on large scale data analytics where we are looking at different algorithms for clustering, classification and ARM. I have been following GraphLab’s work and I think that there is a lot of synergy here. 
We are now looking to find some ways to collaborate with PNNL for extending GraphLab applicability for supercomputers.

Monday, May 14, 2012

ELF (ensemble learning framework)

ELF is an ensemble learning software recommended by JustinYan. Using this software it is possible to predict a few ratings to combine a higher quality prediction. It was written by Michael Jahrer the winner of the Netflix prize. We used it for KDD CUP 2011.

Disclaimer: this software is very rough - not for the weak hearted.. Installation is rather complicated, usage is rather complicated and I have experienced many crashes. However it is a very comprehensive experience towards creating a proper ensemble library.

Installation

 Run ubuntu 11.10 on Intel platform (on Amazon EC2 use image: ami-6743ae0e) connect to the ubuntu instance:
ssh -i graphlabkey.pem ubuntu@ec2-184-73-45-88.compute-1.amazonaws.com

sudo apt-get update
sudo apt-get install build-essential ia32-libs rpm gcc-multilib curl libcurl4-openssl-dev


Download Intel c++ compiler from here:
 You should select: Intel® C++ Composer XE 2011 for Linux Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® Parallel Building Blocks 

Register using the form, you will get an email with the license number.

tar xvzf l_ccompxe_intel64_2011.10.319.tgz
cd l_ccompxe_intel64_2011.10.319
./install.sh
>>select option 2
Follow instructions using the default options until completion. Add the following lines to /etc/ld.so.conf:
/opt/intel/composer_xe_2011_sp1.10.319/compiler/lib/intel64/
/opt/intel/composer_xe_2011_sp1.10.319/compiler/mkl/lib/intel64/
/opt/intel/composer_xe_2011_sp1.10.319/compiler/ipp/lib/intel64/

Run the command:
sudo ldconfig

For bash:
source /opt/intel/composer_xe_2011_sp1.10.319/bin/compilervars.sh intel64

Edit Makefile to have:
INTEL_PATH = /opt/intel/composer_xe_2011_sp1.10.319/

And also:
INCLUDE = -I$(INTEL_PATH)/compiler/include -I$(INTEL_PATH)/mkl/include -I$(INTEL_PATH)/ipp/include
LIB = -L$(INTEL_PATH)/mkl/lib/intel64/ -L$(INTEL_PATH)/ipp/lib/intel64/ -lmkl_solver_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lippcore -lipps -openmp -lpthread

Now run make. If all went fine you will get an executable named ELF.

Common errors:
1) YahooFinance.h(6): catastrophic error: cannot open source file "curl/curl.h"
Solution: install libcurl4-openssl-dev as instructed above.
 2) AlgorithmExploration.o InputFeatureSelector.o KernelRidgeRegression.o NeuralNetworkRBMauto.o nnrbm.o Autoencoder.o GBDT.o LogisticRegression.o YahooFinance.o -L/opt/intel/composer_xe_2011_sp1.10.319//mkl/lib/em64t -L/opt/intel/composer_xe_2011_sp1.10.319//ipp/em64t/sharedlib -lmkl_solver_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lippcoreem64t -lippsem64t -openmp -lpthread 
ld: cannot find -lippcoreem64t
ld: cannot find -lippsem64t
make: *** [main] Error 1
Solution: edit the Makefile as instructed above.

Setting up the software

Prepare you training data in CSV format where the last column is the target. Prepare your test data in CSV format. Create a directory named CSV, and inside it a file named Master.dsc with the following configuration:
dataset=CSV
isClassificationDataset=1

maxThreads=2
maxThreadsInCross=2
nCrossValidation=6
validationType=Retraining
positiveTarget=1.0
negativeTarget=-1.0
randomSeed=124391994
nMixDataset=20
nMixTrainList=100
standardDeviationMin=0.01
blendingRegularization=1e-4
blendingEnableCrossValidation=0
blendingAlgorithm=LinearRegression
enablePostNNBlending=0
enableCascadeLearning=0
enableGlobalMeanStdEstimate=0
enableSaveMemory=1
addOutputNoise=0
enablePostBlendClipping=0
enableFeatureSelection=0
featureSelectionWriteBinaryDataset=0
enableGlobalBlendingWeights=0
errorFunction=RMSE
disableWriteDscFile=0
enableStaticNormalization=0
#staticMeanNormalization=7.5
#staticStdNormalization=10
enableProbablisticNormalization=0
dimensionalityReduction=no
subsampleTrainSet=1.0
subsampleFeatures=1.0
globalTrainingLoops=1

[ALGORITHMS]
LinearModel_1.dsc
#KNearestNeighbor_1.dsc
#NeuralNetwork_1.dsc
#KernelRidgeRegression_1.dsc
#PolynomialRegression_1.dsc
#NeuralNetwork_1.dsc
#GBDT_1.dsc
Then create a LinearModel_1.dsc file with the following configuration:
ALGORITHM=LinearModel
ID=1
#TRAIN_ON_FULLPREDICTOR=
DISABLE=0

[int]
maxTuninigEpochs=10

[double]
initMaxSwing=1.0
initReg=0.01

[bool]
tuneRigeModifiers=0
enableClipping=0
enableTuneSwing=0

minimzeProbe=0
minimzeProbeClassificationError=0
minimzeBlend=1
minimzeBlendClassificationError=0

[string]
weightFile=LinearModel_1_weights.dat
fullPrediction=LinearModel_1.dat

Now create a subfolder called CSV/DataFiles, inside it a file called settings.txt with the following:
delimiter=,
train=train.csv
trainTargetColumn=19
test=test.csv
Where train.csv and test.csv point to your train and test filenames, and trainTargetColumn points to the last column of your data (column numbers start from zero).

Note: train and test should have the same number of columns. If the test does not have labels, then add a column with zeros.

Running ELF

For training do:

ubuntu@domU-12-31-35-00-21-42:~$ ./ELF CSV/ t
maxThreads(OPENMP): 4
Scheduler
Constructor Data
Open master .dsc file:CSV//Master.dsc
isClassificationDataset: 1
Set max. threads in MKL and IPP: 2
maxThreads(OPENMP): 2
Train 6-fold cross validation
ValidationType: Retraining
Set random seed to: 124391994
randomSeed: 124391994
frameworkMode: 0
Start scheduled training
Fill data
gradientBoostingLoops:1
DatasetReader
Read CSV from: CSV//DataFiles
#feat:5
Target values: [0]-1 [1]1 
descructor DatasetReader
reduce training set (current size:6162863) to 100% of its original size  [nothing to do]
subsample the columns (current:5) to 100% of columns (skip constant 1 features)  [nothing to do]
subsample the columns (current:5) to 100% of columns (skip constant 1 features)  [nothing to do]
Randomize the train dataset: 123257260 line swaps [..........] mixInd[0]:467808  mixInd[6162862]:3154542
Enable bagging:0
Set algorithm list (nTrained:0)
Load descriptor file: CSV//LinearModel_1.dsc
[META] ALGORITHM: LinearModel
[META] ID: 1
[META] DISABLE: 0
maxTuninigEpochs: 10
initMaxSwing: 1.0
initReg: 0.01
tuneRigeModifiers: 0
enableClipping: 0
enableTuneSwing: 0
minimzeProbe: 0
minimzeProbeClassificationError: 0
minimzeBlend: 1
minimzeBlendClassificationError: 0
weightFile: LinearModel_1_weights.dat
fullPrediction: LinearModel_1.dat
Alloc mem for cross validation data sets (doOnlyNormalization:0)
Cross-validation settings: 6 sets
Calculating mean and std per input
f:3lim f:4lim 

StdMin:0.01
Normalization:[Min|Max mean: -2.72612|-0.940528  Min|Max std: 0.01|0.687338]  Features: RawInputs[Min|Max value: -5.7863|0.64705]  AfterNormalization[Min|Max value:-4.45221|10.8926] on 5 features
Targets: min|max|mean [Nr0:-1|1|0.803235] [Nr1:-1|1|-0.803235] 
Save mean and std: CSV//TempFiles/normalization.dat.algo1.add0
Random seed:124391994
nFeatures:5
nClass:2
nDomain:1
nTrain:6162863 nValid:0 nTest:0
Make 616286300 index swaps (randomize sample index list)
partition size: 1.02714e+06
slot: TRAIN | PROBE
===================
0: 5135719 | 1027144
1: 5135719 | 1027144
2: 5135719 | 1027144
3: 5135720 | 1027143
4: 5135719 | 1027144
5: 5135719 | 1027144
6: 6162863 | 0
probe sum:6162863
Train algorithm:CSV//LinearModel_1.dsc
Load descriptor file: CSV//LinearModel_1.dsc
[META] ALGORITHM: LinearModel
[META] ID: 1
[META] DISABLE: 0
maxTuninigEpochs: 10
initMaxSwing: 1.0
initReg: 0.01
tuneRigeModifiers: 0
enableClipping: 0
enableTuneSwing: 0
minimzeProbe: 0
minimzeProbeClassificationError: 0
minimzeBlend: 1
minimzeBlendClassificationError: 0
weightFile: LinearModel_1_weights.dat
fullPrediction: LinearModel_1.dat
AlgoTemplate:CSV//LinearModel_1.dsc  Algo:CSV//DscFiles/LinearModel_1.dsc
Output File for cout redirect is set now to CSV//DscFiles/LinearModel_1.dsc
Floating point precision: 4 Bytes
Partition dataset to cross validation sets
Can not open effect file:CSV//FullPredictorFiles/
Init residuals
Write first 1000 lines of the trainset(Atrain.txt) and targets(AtrainTarget.txt)
Apply mean and std correction to train input features
Min/Max feature values after apply mean/std: -4.45221/10.8926
Min/Max target: -1/1
Mean target: 0.803235 -0.803235 

Constructor Data
Algorithm
StandardAlgorithm
LinearModel
Set data pointers
Start train StandardAlgorithm
Init standard algorithm
Read dsc maps (standard values)
Constructor BlendStopping
Number of predictors for blendStopping: 2 (+1 const, +1 new)


Blending regularization: 0.0001
 [CalcBlend] lambda:0.0001  [classErr:9.83825%] 
ERR Blend:0.59568

============================ START TRAIN (param tuning) =============================

Parameters to tune:
[REAL] name:reg   initValue:0.01
(min|max. epochs: 0|10)


==================== auto-optimize ====================

(epoch=0) reg=0.01 ...... [classErr:38.0955%]  [probe:0.992891]  [CalcBlend] lambda:0.0001  [classErr:9.83952%] ERR=0.583664 11[s][saveBest][SB]
(epoch=1) reg=0.008 ...... [classErr:38.1632%]  [probe:0.992889]  [CalcBlend] lambda:0.0001  [classErr:9.83963%] ERR=0.583661 11[s] !min! [saveBest][SB]
(epoch=2) reg=0.0064 ...... [classErr:38.2209%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83973%] ERR=0.58366 11[s] !min! [saveBest][SB] accelerate 
(epoch=3) reg=0.0048422 ...... [classErr:38.2776%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83976%] ERR=0.583661 11[s]
(epoch=4) reg=0.008 ...... [classErr:38.1632%]  [probe:0.992889]  [CalcBlend] lambda:0.0001  [classErr:9.83963%] ERR=0.583661 11[s]
(epoch=5) reg=0.00535367 ...... [classErr:38.2585%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83979%] ERR=0.583661 12[s]
(epoch=6) reg=0.00738248 ...... [classErr:38.1849%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83968%] ERR=0.583661 11[s]
(epoch=7) reg=0.00570903 ...... [classErr:38.2454%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83983%] ERR=0.58366 11[s]
(epoch=8) reg=0.00701252 ...... [classErr:38.1978%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83968%] ERR=0.58366 11[s]
(epoch=9) reg=0.00594873 ...... [classErr:38.2369%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83983%] ERR=0.58366 11[s]
(epoch=10) reg=0.00678554 max. epochs reached.
expSearchErrorBest:0.58366  error:0.58366

============================ END auto-optimize =============================


Calculate FullPrediction (write the prediction of the trainingset with cross validation)

Blending weights (row: classes, col: predictors[1.col=const predictor])
0.799 1.011 
-0.799 1.011 
Save blending weights: CSV//TempFiles/blendingWeights_02.dat

Write full prediction: CSV//FullPredictorFiles/LinearModel_1.dat (RMSE:0.992888)
Validation type: Retraining
Update model on whole training set

Save:CSV//TempFiles/LinearModel_1_weights.dat.006
Calculate retrain RMSE (on trainset)
Train of this algorithm (RMSE after retraining): 0.992894
Total retrain time:3[s]

===========================================================================
Constructor BlendStopping
ADD:CSV//FullPredictorFiles/LinearModel_1.dat Number of predictors for blendStopping: 2 (+1 const)

File:CSV//FullPredictorFiles/LinearModel_1.dat  RMSE:0.992888

Blending regularization: 0.0001
 [CalcBlend] lambda:0.0001 Blending weights (row: classes, col: predictors[1.col=const predictor])
0.799 1.011 
-0.799 1.011 
[Write train prediction:CSV//TempFiles/trainPrediction.data] nSamples:6162863
 [classErr:9.83973%] Blending weights (row: classes, col: predictors[1.col=const predictor])
0.799 1.011 
-0.799 1.011 
Save blending weights: CSV//TempFiles/blendingWeights_02.dat

BLEND RMSE OF ACTUAL FULLPREDICTION PATH:0.58366
===========================================================================

destructor BlendStopping
delete algo
descructor LinearModel
descructor StandardAlgorithm
destructor BlendStopping
descructor Algorithm
destructor Data
Finished train algorithm:CSV//LinearModel_1.dsc
Finished in 275[s]
Clear output file for cout
Delete internal memory
Total training time:399[s]
descructor Scheduler
destructor Data

Friday, May 11, 2012

RBM (Restricted Bolzman Machines) in GraphLab

I am glad to announce I have added an efficient multiple implementation of restricted Bolazman machines (RBM) algorithm. The algorithm is described in Hinton's paper. The code is based on an excellent C code by my collaborator JustinYan. Who by the way is still looking for a US based internship!

Some explanation about the algorithm parameters:

1) run mode should be set to 16

 2) RBM assumes the rating is binary. Namely for Netflix data, rating is between 1 to 5, so we have 6 bins (0,1,2,3,4,5). For KDD CUP data, rating is between 0 -> 100. To save memory, we can scale it by 10 to have 11 bins. --rbm_scaling - tells the program how much to scale the bins. 
--rbm_bins - tells the program how many bins there are.

 3) RBM is a gradient descent type algorithm. --rbm_alpha is the step size, and --rbm_beta is the regularization parameter. --rbm_mult_step_dec tells the program how much to decrease the step size at each iteration.

 Example run:
./pmf smallnetflix_mm 16 --matrixmarket=true --scheduler="round_robin(max_iterations=10,block_size=1)" --rbm_scaling=1 --rbm_bins=6 --rbm_alpha=0.06 --rbm_beta=.1 --ncpus=8 --minval=1 --maxval=5 --rbm_mult_step_dec=0.8

INFO:     pmf.cpp(do_main:430): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING:  pmf.cpp(do_main:434): Program compiled with Eigen Support
Setting run mode RBM (Restriced Bolzman Machines)
INFO:     pmf.cpp(start:306): RBM (Restriced Bolzman Machines) starting

loading data file smallnetflix_mm
Loading Matrix Market file smallnetflix_mm TRAINING
Loading smallnetflix_mm TRAINING
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:131): Loaded total edges: 3298163
loading data file smallnetflix_mme
Loading Matrix Market file smallnetflix_mme VALIDATION
Loading smallnetflix_mme VALIDATION
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:131): Loaded total edges: 545177
loading data file smallnetflix_mmt
Loading Matrix Market file smallnetflix_mmt TEST
Loading smallnetflix_mmt TEST
skipping file
RBM (Restriced Bolzman Machines) for matrix (95526, 3561, 1):3298163.  D=20
INFO:     rbm.hpp(rbm_init:424): RBM initialization ok
complete. Objective=8.37956e-304, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO:     pmf.cpp(run_graphlab:251): starting with scheduler: round_robin
max iterations = 10
step = 1
Entering last iter with 1
5.99073) Iter RBM 1, TRAIN RMSE=0.9242 VALIDATION RMSE=0.9762.
Entering last iter with 2
11.0763) Iter RBM 2, TRAIN RMSE=0.9109 VALIDATION RMSE=0.9673.
Entering last iter with 3
16.1259) Iter RBM 3, TRAIN RMSE=0.9054 VALIDATION RMSE=0.9633.
Entering last iter with 4
21.2074) Iter RBM 4, TRAIN RMSE=0.9015 VALIDATION RMSE=0.9600.
Entering last iter with 5
26.3222) Iter RBM 5, TRAIN RMSE=0.8986 VALIDATION RMSE=0.9560.
Entering last iter with 6
31.409) Iter RBM 6, TRAIN RMSE=0.8960 VALIDATION RMSE=0.9540.
Entering last iter with 7
36.4693) Iter RBM 7, TRAIN RMSE=0.8941 VALIDATION RMSE=0.9508.
...
Let me know if you try it out!

Thursday, May 3, 2012

John Langford's transition from Yahoo! Labs to Microsoft Research NY

Is described in his blog post.

I especially liked the following paragraph:
Machine Learning turns out to be a very hot technology. Every company and government in the world is drowning in data, and Machine Learning is the prime tool for actually using it to do interesting things. More generally, the demand for high quality seasoned machine learning researchers across startups, mature companies, government labs, and academia has been astonishing, and I expect the outcome to reflect that.
To be honest we were not worried about you John staying unemployed for long..

What about Vowpal Wabbit? Amongst other things, VW is the ultrascale learning algorithm, not the kind of thing that you would want to put aside lightly. I negotiated to continue the project and succeeded. This surprised me greatly—Microsoft has made serious commitments to supporting open source in various ways and that commitment is what sealed the deal for me. In return, I would like to see Microsoft always at or beyond the cutting edge in machine learning technology.

BigDataR Linux distro

I got this note from Nick Kolegraff:


I am working on a Linux Distro (BigDataR) www.bigdatarlinux.com with a focus around machine learning and have included Graphlab!I am working on building some compelling examples around Graphlab for the Graphlab workshop http://graphlab.org/workshop2012/
I've also started a project that surrounds BigDataR with some compelling examples, The idea here is to provide stable consistent examples (or at least that is the thought)https://github.com/koooee/BigDataR_Examples
If anyone is interested in building some compelling Graphlab examples against BigDataR feel free to reach out, would love to chat.
Cheers,Nick
PS: this is very much in gamma/dev at the moment and have a lot of work to do so be gentle :-)

Nick has an awesome idea of having a single Linux distribution with all the machine learning frameworks configured to run. That way anyone can compare different platforms on the same data to find out the best platform for his needs. Furthermore, it will significantly help new commers to learn how to run and operate different machine learning platforms.

Since we highly support this effort, we invited Nick to give a demo in our upcoming workshop.