Large Scale Machine Learning and Other Animals: Vowpal Wabbit Tutorial

Monday, January 9, 2012

Vowpal Wabbit Tutorial

Vowpal Wabbit is a popular online machine learning implementation for solving linear models like LASSO, sparse logistic regression, etc. Library was initiated in and written by John Langford, Yahoo! Research.

Download version 6.1 from here. Compile using:

make
make install

Note: A newer version of VW is now found in GitHub.

Here are some tutorial slides given in the big learning workshop.

Now to a quick example on how to run logistic regression:
Prepare an input file named inputfile with the following data in it:

-1 | 1:12 2:3.5 4:1e-2
1 | 3:11 4:12
-1 | 2:4 3:1

Explanation: -1/1 are the labels. 1:12 -> means that the first feature is 12. 2:3.5 means
that the 2nd feature is 3.5 and so on. Note that feature names can be strings, as well as their values. In case feature are string they will be hashed into integers during the run.

Now run vw using:

./vw -d A --loss_function logistic --readable_model outfile
using no cache
Reading from A
num sources = 1
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
0.679009   0.679009          3      3.0    -1.0000  -0.1004        3

finished run
number of examples = 3
weighted example sum = 3
weighted label sum = -1
average loss = 0.679
best constant = -1
total feature number = 10

Explanation: -d is the input file. --loss_function is the type of loss function (can be one of: squared,logistic,hinge,quantile,classic). --readable_model speicifies the output file name in readable format.

The program output is:

bickson@thrust:~/JohnLangford-vowpal_wabbit-9c65131$ cat outfile 
Version 6.1
Min label:-100.000000 max label:100.000000
bits:18
ngram:0 skips:0
index:weight pairs:
rank:0
lda:0
1:-0.139726
2:-0.360716
3:-0.011953
4:0.074106
116060:-0.085449

In the above example, we did a single pass on the dataset. Now assume we want to make several passes for fine tuning the solution. We can do:

./vw -d inputfile --loss_function logistic --readable_model outfile --passes 6 -c

Explanation: -c means creating a cache file, which significantly speeds execution. it is required when running multiple iterations.

When running multiple passes we get:

creating cache_file = inputfile.cache
Reading from inputfile
num sources = 1
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
0.895728   0.895728          3      3.0    -1.0000   0.2633        5
0.626871   0.358014          6      6.0    -1.0000  -0.9557        5
0.435506   0.205868         11     11.0     1.0000   1.1889        5

finished run
number of examples = 18
weighted example sum = 18
weighted label sum = -6
average loss = 0.3181
best constant = -0.4118
total feature number = 102

Now assume we want to compute predictions on test data. We use the same command as before:

./vw -d inputfile --loss_function logistic -f outfile

But we changes the --readable_model to -f, output binary file.

Next we compute predictions on test data using:

 ./vw -d testinputfile --loss_function logistic -i outfile -t -p out_predictions

Note that we use the -i flag to indicate the input model, and -p flag to output the predictions file to.

Further reading: a more comprehensive VW tutorial by Rob Zinkov.

28 comments:

Kevin FinkApril 24, 2012 at 6:09 PM
I'm still pretty new to VW, so thought I'd mention a couple of things that caused me to waste a fair bit of time when I first started using it.

First, the input file format is very complex and very particular. For example, it is extremely sensitive to white space - e.g.

1 |test two three

means something completely different than

1 | test two three

(The former has the features "two" and "three" in the "test" namespace, while the latter has three features and no namespace, resulting in a very different model.)

Also, the input file format is the same for generating predictions as for generating a model, even though certain fields (like the labels) are completely irrelevant and ignored when generating predictions.
ReplyDelete
Replies
radoApril 25, 2012 at 1:05 PM
Thank you for this simple end to end intro with commands you can actually run and understand the output from. The main vw site does not give you that.
ReplyDelete
Replies
AnonymousJune 7, 2012 at 9:12 AM
Thanks for this note. Super useful.
The input format is indeed very particular.
Does VW deal with multiclass classification? If yes, the first column can be any string other than 0/1. Is that correct?
ReplyDelete
Replies
Madan M DabbeeruAugust 1, 2012 at 12:27 AM
I would like to use VW for regression problems. The example provided in John's website is not clear to me.

Also, the spaces in the input format sometimes gives "malformed example". Though I changed the spaces of the input file. I still am getting these errors.
ReplyDelete
Replies
UnknownSeptember 2, 2012 at 9:59 AM
So far I've installed and compiled GraphCHI and VowpalWabbit (and boost) and both seem to be working with their test data.
I wrote a Java utility to convert flight CSV to VW input format. Head of several million records looks like this:

head /Volumes/brad/Dropbox-Overflow/ASADataExpo2009/2008.csv.txt
-14 1 1|DepDelay:8 FlightNum:335 Distance:810 DepTime:2003 ActualElapsedTime:128 ArrTime:2211 AirTime:116 DayofMonth:3 Month:1 DayOfWeek:4
2 1 2|DepDelay:19 FlightNum:3231 Distance:810 DepTime:754 ActualElapsedTime:128 ArrTime:1002 AirTime:113 DayofMonth:3 Month:1 DayOfWeek:4

Am I doing this right?

And running vw looks like this:

imac:vowpalWabbit Brad$ vw /Volumes/brad/Dropbox-Overflow/ASADataExpo2009/2008.csv.txt --cache --audit >audit.txt -p pred.txt
using cache_file = /Volumes/brad/Dropbox-Overflow/ASADataExpo2009/2008.csv.txt.cache
ignoring text input in favor of cache input
num sources = 1
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = pred.txt
average since example example current current current
loss last counter weight label predict features
253.365987 253.365987 3 3.0 14.0000 -3.5527 10
222.075467 190.784946 6 6.0 11.0000 34.0000 10
183.182085 136.510027 11 11.0 1.0000 -0.3415 10
547.109904 911.037723 22 22.0 19.0000 -18.0000 10
993.467205 1439.824505 44 44.0 -26.0000 -26.0000 10
3621.129285 6309.899785 87 87.0 78.0000 120.9613 10
4906.932688 6192.736091 174 174.0 22.0000 114.8907 10
4332.321049 3757.709410 348 348.0 39.0000 39.3193 10
3316.887131 2301.453214 696 696.0 95.0000 78.1861 10
2940.052849 2563.218567 1392 1392.0 23.0000 79.7706 10
2308.099629 1676.146409 2784 2784.0 13.0000 -10.1179 10
2003.443074 1698.786520 5568 5568.0 2.0000 -5.9222 10
1964.735590 1926.021153 11135 11135.0 21.0000 0.9365 10
1384.648145 804.508600 22269 22269.0 28.0000 -0.0364 10
922.324877 459.980847 44537 44537.0 0.0000 -1.6663 10
888.889076 855.452525 89073 89073.0 -6.0000 -9.0099 10
1138.161107 1387.433138 178146 178146.0 17.0000 12.1046 10
1090.651227 1043.141080 356291 356291.0 -5.0000 -0.1046 10
1182.735683 1274.820139 712582 712582.0 45.0000 93.6976 10
1232.995228 1283.254843 1425163 1425163.0 -5.0000 -12.3362 10
1134.289299 1035.583370 2850326 2850326.0 43.0000 54.9404 10
1216.480057 1298.670845 5700651 5700651.0 42.0000 59.5428 10

finished run
number of examples = 6855029
weighted example sum = 6.855e+06
weighted label sum = 5.599e+07
average loss = 1376
best constant = 8.168
total feature number = 68550290

And the prediction file (pred.txt) looks like this:

imac:vowpalWabbit Brad$ head pred.txt
0.000000 1
-14.000000 2
-3.552719 3
-12.584439 4
34.000000 5
34.000000 6
57.000000 7
-1.245023 8
14.000894 9
-0.000006 10

imac:vowpalWabbit Brad$ tail pred.txt
27.738859 7009719
-0.110489 7009720
-3.580440 7009721
0.160012 7009722
-0.080507 7009723
3.700165 7009724
0.018907 7009725
-2.126096 7009726
15.932721 7009727
17.999048 7009728

The problem is, I don't have a clue as to what any of this means. I watched the author's presentation,
but he took so much for granted it didn't help at all.

I think what I'm missing is the insider lingo. I assume "label" means one datum
from the learning set, and id is row number in my case. If so,
what is a "prediction?", particularly insofar as there's a separate one for each input redord

And so forth for other terms in this output. such as:
average since example example current current current
loss last counter weight label predict features

So, how to read these sheep entrails? ;) Hope you can help.
ReplyDelete
Replies
AnonymousMarch 5, 2013 at 11:31 AM
How can you do online learning with this? I have a model that I've trained and would like to give new input to update the model. Is this supported by VW?
ReplyDelete
Replies
Robert PoorMarch 14, 2013 at 5:56 AM
I noticed that the biglearn.org link to the slides is broken. The old link was http://biglearn.org/files/slides/invited/langford-01.pdf, the correct link is http://biglearn.org/2011/files/slides/invited/langford-01.pdf
ReplyDelete
Replies
AnonymousJune 18, 2013 at 4:12 AM
i followed the above steps n getting the following error in ubuntu:

only testing
bad model format!
terminate called after throwing an instance of 'std::exception'
what(): std::exception
Aborted (core dumped)

am i missing something?

Thanks
ReplyDelete
Replies
AnonymousJuly 5, 2013 at 3:48 AM
Hi Danny,

I'm very new to VW (today in fact) and I am going to use it to have a play with the Kaggle Titanic data. I've formated the data file VW and run the training data. A binary model is created. Now I want to run that model against the test data. My input training data looks like this:

-1 survival| class:3 gender:0 age:22 sibsp:1 parch:0 fare:7.25
1 survival| class:1 gender:1 age:38 sibsp:1 parch:0 fare:71.2833

What format should the test data take? I've used the same layout as for the training data, but substituted the -1, 1 with 0 for survival. When I run VW it throws an exception 'bad model' error.

Any help would be greatly appreciated.

Thanks in advance
Paul
ReplyDelete
Replies
Paul AlfordJuly 5, 2013 at 3:50 AM
Here is a VW data file validator:

http://hunch.net/~vw/validate.html
ReplyDelete
Replies

Large Scale Machine Learning and Other Animals

Monday, January 9, 2012

Vowpal Wabbit Tutorial

28 comments:

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax