I got some great tips from Brian Dolan, Founder and VP product of Discovix Inc, a cool California area startup specializing in building machine learning and mathematical products.
Brian asked me a few days ago if we thought about interfacing GraphLab to some commonly used databases, since usually there is where the client data is found. Since I have no green clue in databses, Brian volunteered to teach me (and you my blog reader!) how to do it.
And here is the tutorial as I got it from Brian. First we create table with fake data
to be used in the example:
DROP SCHEMA IF EXISTS glab; CREATE SCHEMA glab; -- Create a table to hold some bogus data DROP TABLE IF EXISTS glab.bs_data; CREATE TABLE glab.bs_data ( rw int DEFAULT NULL , cl int DEFAULT NULL , vl numeric DEFAULT NULL ) DISTRIBUTED BY (rw, cl) ; -- Now create some bogus data. I use a few postgres tricks. -- Ultimately, this just creates a relatively sparse "matrix" in tall format -- I'm trying to use best practices in formatting and stuff INSERT INTO glab.bs_data (rw, cl, vl) SELECT x , y , random() AS vl FROM ( SELECT CASE WHEN random() < 0.1 THEN x ELSE NULL END AS x , CASE WHEN random() < 0.1 THEN y ELSE NULL END AS y FROM generate_series(1, 1000) AS x , generate_series(1, 1000) AS y ) AS A WHERE 1=1 AND a.x IS NOT NULL AND a.y IS NOT NULL ;After the table is ready we can push it into a file:
-- So we have some data. Let's create an external table -- to push it OUT of the database -- You need to run gpfdist to write to an external table. -- In this example, from the directory /Users/gpadmin/glab/tables the command was -- > gpfdist DROP EXTERNAL TABLE IF EXISTS glab.ext_out; CREATE WRITABLE EXTERNAL TABLE glab.ext_out ( i bigint , j bigint , v numeric ) LOCATION ('gpfdist://macbuddha.local/glab_out.txt') FORMAT 'TEXT' (DELIMITER ',' NULL ' ') DISTRIBUTED BY (i, j) ; -- Right, now let's put the data in it. This is two queries unioned so -- we can create the first row. Best to do them separately. INSERT INTO glab.ext_out SELECT max(rw) AS i , max(cl) AS j , count(*) AS v FROM glab.bs_data; INSERT INTO glab.ext_out SELECT rw AS i , cl AS j , vl AS v FROM glab.bs_data ORDER BY 1,2 ;And here is the example output glab_out.txt:
1000,1000,10002 1,69,0.71019060537219 1,577,0.747299919836223 1,627,0.252593372948468 1,753,0.120338548440486 1,768,0.34520703041926 1,826,0.756854422856122 1,838,0.827316934708506 1,936,0.342057122848928 2,323,0.090937509201467 ...Now it is very easy to read it in GraphLab. In Matlab/Octave do:
load glab_out.txt; % load data to memory A = sparse(glab_out(2:end,1), glab_out(2:end,2), glab_out(2:end,3), glab_out(1,1), glab_out(1,2), glab_out(1,3)); % create a sparse matrix mmwrite('A', A); % save matrix to file. You may need to donload the script mmwrite.m from http://graphlab.org/mmwrite.m
Since this is potentially useful, I am going to have support for this data format soon in GraphLab, without the matlab conversion.
Next, Brian promised to teach me how to export data from GraphLab back to SQL.