- Senior Data Scientist, LinkedIn
- TargetChoice
- Senior Machine Learning Researcher, GE Global Research
- Graduate Student, University of Rochester
- Research Staff Member, Samsung Advanced Institute of Technology
- The Harker School
- Professor, University of Washington
- Principle Data Scientist, PocketGems
- VMware
- Research Manager, Oracle Labs
- Chief Technology Officer, Cetas Cloud & Big Data A, VMware
- Vice President, Cetas Cloud & Big Data Analytics, VMware
- Carnegie Mellon University
- Netflix
- Postdoctoral Fellow, Stanford University
- Research Scientist, Google
- Graduate Student, Carnegie Mellon University
- Research Scientist, Akamai Technologies
- Team Leader, MeraLabs LLC
- Carnegie Mellon University
- Grad Student, University of Washington CSE
- University of Washington
- Ph.D. Student, University of Rochester
- pnnl/uw
- Applications Systems Engineer, Wells Fargo Bank
- Google Inc.
- Grad student, University of Washington
- R & D staff member, VMware Inc.
- Research Scientist, Intel Labs
- Research Scientist, Intel Labs
- Research Scientist, Intel Labs
- Research Scientist, Intel Labs
- Research Scientist, Intel Labs
- Rotation Engineer, Intel Labs
- Principal Engineer, Intel Labs
- Research Assistant , University of Washington
- Instructor, Yale University
- Google
- Principal Scientist, Technicolor Labs
- PhD student, UC Berkeley
- Research Scientist, Yahoo! Research
- Assistant Professor, VSB-Technical University of Ostrava
- Assistant Professor, VSB-Technical University of Ostrava
- Graduate Student Researcher, University of California, San Francisco
- Software Developer, Kaggle
- Kaggle
- Product Manager, Kaggle
- Data Scientist, Kaggle
- Research Staff Member, IBM Research
- Joanna Inc.
- Founder, Snapwiz, NA
- Principle Engineer, Docomo Innovations Inc./Data Mining Group
- Professor, VSB-Technical University of Ostrava
- Assistant Professor, VSB-Technical University of Ostrava
- Assistant Professor, VSB-Technical University of Ostrava
- Assistant Professor, VSB-Technical University of Ostrava
- Technicolor
- Mr, qlink resources
- Sr. Product Analyst, Apollo Group
- Computer Scientist, Frank Olken Consulting
- Timefire
- CTO, Yahoo!
- Graduate student, Carnegie Mellon University
- Machine Learning Dept, CMU
- Aurelius
- Tagschema
- MedHelp
- MedHelp
- Pandora
- Consultant, GLG
- Architect, Reltio
- CEO, Reltio
- student, Lynbrook Highschool
- engineer, self employeed
- Owner, Blackcloud BSG
- Carnegie Mellon University
- Carnegie Mellon University
- Data mining specialist, Shopiy
- Stanford
- Associate Research Professor, Carnegie Mellon University
- Sales Rep, Oracle
- Founder, BIG DatalytiX, Inc.
- Researcher, Technicolor
- Senior Software Engineer, C3
- UCB/UCSF
- Research Scientist, Intel
- BDM, NetApp
- Senior Engineer, SAP Labs
- diegobasch.com
- Director of Playlist Engineering, Pandora
- Carnegie Mellon University
- Graduate Student, Carnegie Mellon University
- CTO, Joyent
- Carnegie Mellon University
- VP, Blaxkrock
- Director, Blackrock
- AdMobius
- Assistant Research Professor, TTI-Chicago
- NYU-Poly
- Data Scientist, Intuit
- University of Texas at Austin
- Co-Founder, Mailvest
- Director of Software Architecture, HauteLook, Inc.
- Software Engineer, Sift Science
- Software Engineer, SIft Science
- Software Engineer, SIft Science
- Development Director, Kabam Inc
- AppDynamics
- CEO, Ismion Inc
- VP, Corp Dev, Joyent, Inc
- Strategy Architect, undisclosed
- Chief Research Officer, ZVOOQ
- Professor, UC Berkeley
- Twitter Inc.
- Student, UPenn
- BigDataR Linux
- Student, Carnegie Mellon University
- American Express
- VP Research & Chief Data Scientist, madvertise Mobile Advertising GmbH
- VP Product & Prof Svcs, The Filter, Inc.
- Project Scientist, CMU
- Scientist, Yahoo
- Product Manager, One Kings Lane
- Sr. Software Engineer, One Kings Lane
- Founder, EigenDog
- Senior Researcher, Toyota ITC
- Postdoc, UC Santa Cruz
- Williams-Sonoma
- Senior Research Engineer, Cambridge Semantics
- Assistant Research Scientist, Dept. of Applied Math & Stats at Johns Hopkins
Tuesday, May 29, 2012
San Francisco - here we come!
I am glad to report we have already 135 registrations for the GraphLab workshop. We are soon running out of space! Register today or you will stay out. Here is our current list of whos and whos:
Tuesday, May 15, 2012
GraphLab News!
There are a lot of exciting developments around GraphLab that I am glad to report here.
First of all, due to the tremendous support of Intel Labs, and especially Ted Willke, Intel is the platinum sponsor of our GraphLab workshop. We are going to have one of two lectures from Intel about the ongoing collaboration with us.
Additionally, with the great help of another Dr. Ted, this time Ted Dunning from MapR Technologies, MapR is going to be a gold sponsor of our workshop. Ted will also give a lecture at our workshop. Ted is also known for initiating the Mahout Apache project.
Amazon has extended our EC2 computing grant for additional 6 months. Thanks to Carly Buwalda and James Hamilton.
And now to some more workshop news: Alex Smola from Yahoo Research! will give a talk about his large scale machine learning work. Joe Hellerstein from Berkeley will describe his bloom work. Tao Ye and Eric Bieschke from Pandora Internet Radio will give a talk about music rating at Pandora. Xavier Amatriain from Netflix will describe machine learning activity in Netflix. Pankaj Gupta from Twitter will talk about Cassovary - their new graph processing system. Amol Gothing from IBM Watson will describe his large scale ML work. Mohit Signh, one of our earliest adopters will talk about Graphlab deployment at One Kings Lane.
We are also expecting some cool demos. Sungpack Hong from Oracle Labs will give a demo of Green Marl their Graph processing system. Alex Gray from Georgia Tech will also give a demo.
Email me for a discount registration code! And don't forget to mention you are reading my blog!
:-)
As registration is piling up, we got people from the following companies: One Kings Lane, Discovix, Cambridge Semantics, Williams-Sonoma, Toyota ITC, Eigendog, Yahoo! Labs, The Filter Inc, Madvertise mobile advertising, BigDataR Linux, American Express, ZVOOQ, Twitter, Ismion, Appdynamics, Kabam, Sift Science, Hautelook, Mailvest, Intuit, Admobius, Blackbox, Joyent, Pandora Internet Radio, DiegoBasch, SAP Labs, NetApps, Intel Labs, C3, Technicolor Labs, Bigdatalytix Blackcloud BSG, Shofify, Reltio, Toyota Technical Institute Chicago.
We also have academic presence from the following universities: Carnegie Mellon University, Stanford, UC Berkely, UC Santa Cruz, Georgia Tech, UPENN, Polytechnic NY, Johns Hopkins.
Thanks to Shon Burton from GeekSessions who is responsible of organizing the workshop.
We got an interesting email from Abhinav Visnsu, a senior researcher at the Pacific Northwest National Lab:
First of all, due to the tremendous support of Intel Labs, and especially Ted Willke, Intel is the platinum sponsor of our GraphLab workshop. We are going to have one of two lectures from Intel about the ongoing collaboration with us.
Additionally, with the great help of another Dr. Ted, this time Ted Dunning from MapR Technologies, MapR is going to be a gold sponsor of our workshop. Ted will also give a lecture at our workshop. Ted is also known for initiating the Mahout Apache project.
Amazon has extended our EC2 computing grant for additional 6 months. Thanks to Carly Buwalda and James Hamilton.
And now to some more workshop news: Alex Smola from Yahoo Research! will give a talk about his large scale machine learning work. Joe Hellerstein from Berkeley will describe his bloom work. Tao Ye and Eric Bieschke from Pandora Internet Radio will give a talk about music rating at Pandora. Xavier Amatriain from Netflix will describe machine learning activity in Netflix. Pankaj Gupta from Twitter will talk about Cassovary - their new graph processing system. Amol Gothing from IBM Watson will describe his large scale ML work. Mohit Signh, one of our earliest adopters will talk about Graphlab deployment at One Kings Lane.
We are also expecting some cool demos. Sungpack Hong from Oracle Labs will give a demo of Green Marl their Graph processing system. Alex Gray from Georgia Tech will also give a demo.
Email me for a discount registration code! And don't forget to mention you are reading my blog!
:-)
As registration is piling up, we got people from the following companies: One Kings Lane, Discovix, Cambridge Semantics, Williams-Sonoma, Toyota ITC, Eigendog, Yahoo! Labs, The Filter Inc, Madvertise mobile advertising, BigDataR Linux, American Express, ZVOOQ, Twitter, Ismion, Appdynamics, Kabam, Sift Science, Hautelook, Mailvest, Intuit, Admobius, Blackbox, Joyent, Pandora Internet Radio, DiegoBasch, SAP Labs, NetApps, Intel Labs, C3, Technicolor Labs, Bigdatalytix Blackcloud BSG, Shofify, Reltio, Toyota Technical Institute Chicago.
We also have academic presence from the following universities: Carnegie Mellon University, Stanford, UC Berkely, UC Santa Cruz, Georgia Tech, UPENN, Polytechnic NY, Johns Hopkins.
Thanks to Shon Burton from GeekSessions who is responsible of organizing the workshop.
We got an interesting email from Abhinav Visnsu, a senior researcher at the Pacific Northwest National Lab:
I am a research scientist at PNNL, and working on scalable execution modes, programming models and communication subsystems for some of the largest supercomputers today (InfiniBand, Cray’s, Blue Gene’s). Recently, we have started a project on large scale data analytics where we are looking at different algorithms for clustering, classification and ARM. I have been following GraphLab’s work and I think that there is a lot of synergy here.We are now looking to find some ways to collaborate with PNNL for extending GraphLab applicability for supercomputers.
Monday, May 14, 2012
ELF (ensemble learning framework)
ELF is an ensemble learning software recommended by JustinYan. Using this software it is possible to predict a few ratings to combine a higher
quality prediction. It was written by Michael Jahrer the winner of the Netflix prize. We used it for KDD CUP 2011.
Disclaimer: this software is very rough - not for the weak hearted.. Installation is rather complicated, usage is rather complicated and I have experienced many crashes. However it is a very comprehensive experience towards creating a proper ensemble library.
Download Intel c++ compiler from here:
You should select: Intel® C++ Composer XE 2011 for Linux Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® Parallel Building Blocks
Register using the form, you will get an email with the license number.
Run the command:
For bash:
Edit Makefile to have:
And also:
Now run make. If all went fine you will get an executable named ELF.
Common errors:
Now create a subfolder called CSV/DataFiles, inside it a file called settings.txt with the following:
Note: train and test should have the same number of columns. If the test does not have labels, then add a column with zeros.
Disclaimer: this software is very rough - not for the weak hearted.. Installation is rather complicated, usage is rather complicated and I have experienced many crashes. However it is a very comprehensive experience towards creating a proper ensemble library.
Installation
Run ubuntu 11.10 on Intel platform (on Amazon EC2 use image: ami-6743ae0e) connect to the ubuntu instance:ssh -i graphlabkey.pem ubuntu@ec2-184-73-45-88.compute-1.amazonaws.com sudo apt-get update sudo apt-get install build-essential ia32-libs rpm gcc-multilib curl libcurl4-openssl-dev
Download Intel c++ compiler from here:
You should select: Intel® C++ Composer XE 2011 for Linux Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® Parallel Building Blocks
Register using the form, you will get an email with the license number.
tar xvzf l_ccompxe_intel64_2011.10.319.tgz cd l_ccompxe_intel64_2011.10.319 ./install.sh >>select option 2Follow instructions using the default options until completion. Add the following lines to /etc/ld.so.conf:
/opt/intel/composer_xe_2011_sp1.10.319/compiler/lib/intel64/ /opt/intel/composer_xe_2011_sp1.10.319/compiler/mkl/lib/intel64/ /opt/intel/composer_xe_2011_sp1.10.319/compiler/ipp/lib/intel64/
Run the command:
sudo ldconfig
For bash:
source /opt/intel/composer_xe_2011_sp1.10.319/bin/compilervars.sh intel64
Edit Makefile to have:
INTEL_PATH = /opt/intel/composer_xe_2011_sp1.10.319/
And also:
INCLUDE = -I$(INTEL_PATH)/compiler/include -I$(INTEL_PATH)/mkl/include -I$(INTEL_PATH)/ipp/include LIB = -L$(INTEL_PATH)/mkl/lib/intel64/ -L$(INTEL_PATH)/ipp/lib/intel64/ -lmkl_solver_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lippcore -lipps -openmp -lpthread
Now run make. If all went fine you will get an executable named ELF.
Common errors:
1) YahooFinance.h(6): catastrophic error: cannot open source file "curl/curl.h"Solution: install libcurl4-openssl-dev as instructed above.
2) AlgorithmExploration.o InputFeatureSelector.o KernelRidgeRegression.o NeuralNetworkRBMauto.o nnrbm.o Autoencoder.o GBDT.o LogisticRegression.o YahooFinance.o -L/opt/intel/composer_xe_2011_sp1.10.319//mkl/lib/em64t -L/opt/intel/composer_xe_2011_sp1.10.319//ipp/em64t/sharedlib -lmkl_solver_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lippcoreem64t -lippsem64t -openmp -lpthread ld: cannot find -lippcoreem64t ld: cannot find -lippsem64t make: *** [main] Error 1Solution: edit the Makefile as instructed above.
Setting up the software
Prepare you training data in CSV format where the last column is the target. Prepare your test data in CSV format. Create a directory named CSV, and inside it a file named Master.dsc with the following configuration:dataset=CSV isClassificationDataset=1 maxThreads=2 maxThreadsInCross=2 nCrossValidation=6 validationType=Retraining positiveTarget=1.0 negativeTarget=-1.0 randomSeed=124391994 nMixDataset=20 nMixTrainList=100 standardDeviationMin=0.01 blendingRegularization=1e-4 blendingEnableCrossValidation=0 blendingAlgorithm=LinearRegression enablePostNNBlending=0 enableCascadeLearning=0 enableGlobalMeanStdEstimate=0 enableSaveMemory=1 addOutputNoise=0 enablePostBlendClipping=0 enableFeatureSelection=0 featureSelectionWriteBinaryDataset=0 enableGlobalBlendingWeights=0 errorFunction=RMSE disableWriteDscFile=0 enableStaticNormalization=0 #staticMeanNormalization=7.5 #staticStdNormalization=10 enableProbablisticNormalization=0 dimensionalityReduction=no subsampleTrainSet=1.0 subsampleFeatures=1.0 globalTrainingLoops=1 [ALGORITHMS] LinearModel_1.dsc #KNearestNeighbor_1.dsc #NeuralNetwork_1.dsc #KernelRidgeRegression_1.dsc #PolynomialRegression_1.dsc #NeuralNetwork_1.dsc #GBDT_1.dscThen create a LinearModel_1.dsc file with the following configuration:
ALGORITHM=LinearModel ID=1 #TRAIN_ON_FULLPREDICTOR= DISABLE=0 [int] maxTuninigEpochs=10 [double] initMaxSwing=1.0 initReg=0.01 [bool] tuneRigeModifiers=0 enableClipping=0 enableTuneSwing=0 minimzeProbe=0 minimzeProbeClassificationError=0 minimzeBlend=1 minimzeBlendClassificationError=0 [string] weightFile=LinearModel_1_weights.dat fullPrediction=LinearModel_1.dat
Now create a subfolder called CSV/DataFiles, inside it a file called settings.txt with the following:
delimiter=, train=train.csv trainTargetColumn=19 test=test.csvWhere train.csv and test.csv point to your train and test filenames, and trainTargetColumn points to the last column of your data (column numbers start from zero).
Note: train and test should have the same number of columns. If the test does not have labels, then add a column with zeros.
Running ELF
For training do:
ubuntu@domU-12-31-35-00-21-42:~$ ./ELF CSV/ t maxThreads(OPENMP): 4 Scheduler Constructor Data Open master .dsc file:CSV//Master.dsc isClassificationDataset: 1 Set max. threads in MKL and IPP: 2 maxThreads(OPENMP): 2 Train 6-fold cross validation ValidationType: Retraining Set random seed to: 124391994 randomSeed: 124391994 frameworkMode: 0 Start scheduled training Fill data gradientBoostingLoops:1 DatasetReader Read CSV from: CSV//DataFiles #feat:5 Target values: [0]-1 [1]1 descructor DatasetReader reduce training set (current size:6162863) to 100% of its original size [nothing to do] subsample the columns (current:5) to 100% of columns (skip constant 1 features) [nothing to do] subsample the columns (current:5) to 100% of columns (skip constant 1 features) [nothing to do] Randomize the train dataset: 123257260 line swaps [..........] mixInd[0]:467808 mixInd[6162862]:3154542 Enable bagging:0 Set algorithm list (nTrained:0) Load descriptor file: CSV//LinearModel_1.dsc [META] ALGORITHM: LinearModel [META] ID: 1 [META] DISABLE: 0 maxTuninigEpochs: 10 initMaxSwing: 1.0 initReg: 0.01 tuneRigeModifiers: 0 enableClipping: 0 enableTuneSwing: 0 minimzeProbe: 0 minimzeProbeClassificationError: 0 minimzeBlend: 1 minimzeBlendClassificationError: 0 weightFile: LinearModel_1_weights.dat fullPrediction: LinearModel_1.dat Alloc mem for cross validation data sets (doOnlyNormalization:0) Cross-validation settings: 6 sets Calculating mean and std per input f:3lim f:4lim StdMin:0.01 Normalization:[Min|Max mean: -2.72612|-0.940528 Min|Max std: 0.01|0.687338] Features: RawInputs[Min|Max value: -5.7863|0.64705] AfterNormalization[Min|Max value:-4.45221|10.8926] on 5 features Targets: min|max|mean [Nr0:-1|1|0.803235] [Nr1:-1|1|-0.803235] Save mean and std: CSV//TempFiles/normalization.dat.algo1.add0 Random seed:124391994 nFeatures:5 nClass:2 nDomain:1 nTrain:6162863 nValid:0 nTest:0 Make 616286300 index swaps (randomize sample index list)
partition size: 1.02714e+06 slot: TRAIN | PROBE =================== 0: 5135719 | 1027144 1: 5135719 | 1027144 2: 5135719 | 1027144 3: 5135720 | 1027143 4: 5135719 | 1027144 5: 5135719 | 1027144 6: 6162863 | 0 probe sum:6162863 Train algorithm:CSV//LinearModel_1.dsc Load descriptor file: CSV//LinearModel_1.dsc [META] ALGORITHM: LinearModel [META] ID: 1 [META] DISABLE: 0 maxTuninigEpochs: 10 initMaxSwing: 1.0 initReg: 0.01 tuneRigeModifiers: 0 enableClipping: 0 enableTuneSwing: 0 minimzeProbe: 0 minimzeProbeClassificationError: 0 minimzeBlend: 1 minimzeBlendClassificationError: 0 weightFile: LinearModel_1_weights.dat fullPrediction: LinearModel_1.dat AlgoTemplate:CSV//LinearModel_1.dsc Algo:CSV//DscFiles/LinearModel_1.dsc Output File for cout redirect is set now to CSV//DscFiles/LinearModel_1.dsc Floating point precision: 4 Bytes Partition dataset to cross validation sets Can not open effect file:CSV//FullPredictorFiles/ Init residuals Write first 1000 lines of the trainset(Atrain.txt) and targets(AtrainTarget.txt) Apply mean and std correction to train input features Min/Max feature values after apply mean/std: -4.45221/10.8926 Min/Max target: -1/1 Mean target: 0.803235 -0.803235 Constructor Data Algorithm StandardAlgorithm LinearModel Set data pointers Start train StandardAlgorithm Init standard algorithm Read dsc maps (standard values) Constructor BlendStopping Number of predictors for blendStopping: 2 (+1 const, +1 new) Blending regularization: 0.0001 [CalcBlend] lambda:0.0001 [classErr:9.83825%] ERR Blend:0.59568 ============================ START TRAIN (param tuning) ============================= Parameters to tune: [REAL] name:reg initValue:0.01 (min|max. epochs: 0|10) ==================== auto-optimize ==================== (epoch=0) reg=0.01 ...... [classErr:38.0955%] [probe:0.992891] [CalcBlend] lambda:0.0001 [classErr:9.83952%] ERR=0.583664 11[s][saveBest][SB] (epoch=1) reg=0.008 ...... [classErr:38.1632%] [probe:0.992889] [CalcBlend] lambda:0.0001 [classErr:9.83963%] ERR=0.583661 11[s] !min! [saveBest][SB] (epoch=2) reg=0.0064 ...... [classErr:38.2209%] [probe:0.992888] [CalcBlend] lambda:0.0001 [classErr:9.83973%] ERR=0.58366 11[s] !min! [saveBest][SB] accelerate (epoch=3) reg=0.0048422 ...... [classErr:38.2776%] [probe:0.992888] [CalcBlend] lambda:0.0001 [classErr:9.83976%] ERR=0.583661 11[s] (epoch=4) reg=0.008 ...... [classErr:38.1632%] [probe:0.992889] [CalcBlend] lambda:0.0001 [classErr:9.83963%] ERR=0.583661 11[s] (epoch=5) reg=0.00535367 ...... [classErr:38.2585%] [probe:0.992888] [CalcBlend] lambda:0.0001 [classErr:9.83979%] ERR=0.583661 12[s] (epoch=6) reg=0.00738248 ...... [classErr:38.1849%] [probe:0.992888] [CalcBlend] lambda:0.0001 [classErr:9.83968%] ERR=0.583661 11[s] (epoch=7) reg=0.00570903 ...... [classErr:38.2454%] [probe:0.992888] [CalcBlend] lambda:0.0001 [classErr:9.83983%] ERR=0.58366 11[s] (epoch=8) reg=0.00701252 ...... [classErr:38.1978%] [probe:0.992888] [CalcBlend] lambda:0.0001 [classErr:9.83968%] ERR=0.58366 11[s] (epoch=9) reg=0.00594873 ...... [classErr:38.2369%] [probe:0.992888] [CalcBlend] lambda:0.0001 [classErr:9.83983%] ERR=0.58366 11[s] (epoch=10) reg=0.00678554 max. epochs reached. expSearchErrorBest:0.58366 error:0.58366 ============================ END auto-optimize ============================= Calculate FullPrediction (write the prediction of the trainingset with cross validation) Blending weights (row: classes, col: predictors[1.col=const predictor]) 0.799 1.011 -0.799 1.011 Save blending weights: CSV//TempFiles/blendingWeights_02.dat Write full prediction: CSV//FullPredictorFiles/LinearModel_1.dat (RMSE:0.992888) Validation type: Retraining Update model on whole training set Save:CSV//TempFiles/LinearModel_1_weights.dat.006 Calculate retrain RMSE (on trainset) Train of this algorithm (RMSE after retraining): 0.992894 Total retrain time:3[s] =========================================================================== Constructor BlendStopping ADD:CSV//FullPredictorFiles/LinearModel_1.dat Number of predictors for blendStopping: 2 (+1 const) File:CSV//FullPredictorFiles/LinearModel_1.dat RMSE:0.992888 Blending regularization: 0.0001 [CalcBlend] lambda:0.0001 Blending weights (row: classes, col: predictors[1.col=const predictor]) 0.799 1.011 -0.799 1.011 [Write train prediction:CSV//TempFiles/trainPrediction.data] nSamples:6162863 [classErr:9.83973%] Blending weights (row: classes, col: predictors[1.col=const predictor]) 0.799 1.011 -0.799 1.011 Save blending weights: CSV//TempFiles/blendingWeights_02.dat BLEND RMSE OF ACTUAL FULLPREDICTION PATH:0.58366 =========================================================================== destructor BlendStopping delete algo descructor LinearModel descructor StandardAlgorithm destructor BlendStopping descructor Algorithm destructor Data Finished train algorithm:CSV//LinearModel_1.dsc Finished in 275[s] Clear output file for cout Delete internal memory Total training time:399[s] descructor Scheduler destructor Data
Friday, May 11, 2012
RBM (Restricted Bolzman Machines) in GraphLab
I am glad to announce I have added an efficient multiple implementation of restricted Bolazman machines (RBM) algorithm.
The algorithm is described in Hinton's paper. The code is based on an excellent C code by my collaborator JustinYan. Who by the way is still looking for a US based internship!
Some explanation about the algorithm parameters:
1) run mode should be set to 16
2) RBM assumes the rating is binary. Namely for Netflix data, rating is between 1 to 5, so we have 6 bins (0,1,2,3,4,5). For KDD CUP data, rating is between 0 -> 100. To save memory, we can scale it by 10 to have 11 bins. --rbm_scaling - tells the program how much to scale the bins.
--rbm_bins - tells the program how many bins there are.
3) RBM is a gradient descent type algorithm. --rbm_alpha is the step size, and --rbm_beta is the regularization parameter. --rbm_mult_step_dec tells the program how much to decrease the step size at each iteration.
Example run:
Some explanation about the algorithm parameters:
1) run mode should be set to 16
2) RBM assumes the rating is binary. Namely for Netflix data, rating is between 1 to 5, so we have 6 bins (0,1,2,3,4,5). For KDD CUP data, rating is between 0 -> 100. To save memory, we can scale it by 10 to have 11 bins. --rbm_scaling - tells the program how much to scale the bins.
--rbm_bins - tells the program how many bins there are.
3) RBM is a gradient descent type algorithm. --rbm_alpha is the step size, and --rbm_beta is the regularization parameter. --rbm_mult_step_dec tells the program how much to decrease the step size at each iteration.
Example run:
./pmf smallnetflix_mm 16 --matrixmarket=true --scheduler="round_robin(max_iterations=10,block_size=1)" --rbm_scaling=1 --rbm_bins=6 --rbm_alpha=0.06 --rbm_beta=.1 --ncpus=8 --minval=1 --maxval=5 --rbm_mult_step_dec=0.8 INFO: pmf.cpp(do_main:430): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(do_main:434): Program compiled with Eigen Support Setting run mode RBM (Restriced Bolzman Machines) INFO: pmf.cpp(start:306): RBM (Restriced Bolzman Machines) starting loading data file smallnetflix_mm Loading Matrix Market file smallnetflix_mm TRAINING Loading smallnetflix_mm TRAINING Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:131): Loaded total edges: 3298163 loading data file smallnetflix_mme Loading Matrix Market file smallnetflix_mme VALIDATION Loading smallnetflix_mme VALIDATION Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:131): Loaded total edges: 545177 loading data file smallnetflix_mmt Loading Matrix Market file smallnetflix_mmt TEST Loading smallnetflix_mmt TEST skipping file RBM (Restriced Bolzman Machines) for matrix (95526, 3561, 1):3298163. D=20 INFO: rbm.hpp(rbm_init:424): RBM initialization ok complete. Objective=8.37956e-304, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000. INFO: pmf.cpp(run_graphlab:251): starting with scheduler: round_robin max iterations = 10 step = 1 Entering last iter with 1 5.99073) Iter RBM 1, TRAIN RMSE=0.9242 VALIDATION RMSE=0.9762. Entering last iter with 2 11.0763) Iter RBM 2, TRAIN RMSE=0.9109 VALIDATION RMSE=0.9673. Entering last iter with 3 16.1259) Iter RBM 3, TRAIN RMSE=0.9054 VALIDATION RMSE=0.9633. Entering last iter with 4 21.2074) Iter RBM 4, TRAIN RMSE=0.9015 VALIDATION RMSE=0.9600. Entering last iter with 5 26.3222) Iter RBM 5, TRAIN RMSE=0.8986 VALIDATION RMSE=0.9560. Entering last iter with 6 31.409) Iter RBM 6, TRAIN RMSE=0.8960 VALIDATION RMSE=0.9540. Entering last iter with 7 36.4693) Iter RBM 7, TRAIN RMSE=0.8941 VALIDATION RMSE=0.9508. ...Let me know if you try it out!
Thursday, May 3, 2012
John Langford's transition from Yahoo! Labs to Microsoft Research NY
Is described in his blog post.
I especially liked the following paragraph:
I especially liked the following paragraph:
Machine Learning turns out to be a very hot technology. Every company and government in the world is drowning in data, and Machine Learning is the prime tool for actually using it to do interesting things. More generally, the demand for high quality seasoned machine learning researchers across startups, mature companies, government labs, and academia has been astonishing, and I expect the outcome to reflect that.To be honest we were not worried about you John staying unemployed for long..
What about Vowpal Wabbit? Amongst other things, VW is the ultrascale learning algorithm, not the kind of thing that you would want to put aside lightly. I negotiated to continue the project and succeeded. This surprised me greatly—Microsoft has made serious commitments to supporting open source in various ways and that commitment is what sealed the deal for me. In return, I would like to see Microsoft always at or beyond the cutting edge in machine learning technology.
BigDataR Linux distro
I got this note from Nick Kolegraff:
Nick has an awesome idea of having a single Linux distribution with all the machine learning frameworks configured to run. That way anyone can compare different platforms on the same data to find out the best platform for his needs. Furthermore, it will significantly help new commers to learn how to run and operate different machine learning platforms.
Since we highly support this effort, we invited Nick to give a demo in our upcoming workshop.
I am working on a Linux Distro (BigDataR) www.bigdatarlinux.com with a focus around machine learning and have included Graphlab!I am working on building some compelling examples around Graphlab for the Graphlab workshop http://graphlab.org/workshop2012/
I've also started a project that surrounds BigDataR with some compelling examples, The idea here is to provide stable consistent examples (or at least that is the thought)https://github.com/koooee/BigDataR_Examples
If anyone is interested in building some compelling Graphlab examples against BigDataR feel free to reach out, would love to chat.
Cheers,Nick
PS: this is very much in gamma/dev at the moment and have a lot of work to do so be gentle :-)
Nick has an awesome idea of having a single Linux distribution with all the machine learning frameworks configured to run. That way anyone can compare different platforms on the same data to find out the best platform for his needs. Furthermore, it will significantly help new commers to learn how to run and operate different machine learning platforms.
Since we highly support this effort, we invited Nick to give a demo in our upcoming workshop.