I got some great tips from Brian Dolan, Founder and VP product of Discovix Inc, a cool California area startup specializing in building machine learning and mathematical products.
Brian asked me a few days ago if we thought about interfacing GraphLab to some commonly used databases, since usually there is where the client data is found. Since I have no green clue in databses, Brian volunteered to teach me (and you my blog reader!) how to do it.
And here is the tutorial as I got it from Brian. First we create table with fake data
to be used in the example:
DROP SCHEMA IF EXISTS glab; CREATE SCHEMA glab; -- Create a table to hold some bogus data DROP TABLE IF EXISTS glab.bs_data; CREATE TABLE glab.bs_data ( rw int DEFAULT NULL , cl int DEFAULT NULL , vl numeric DEFAULT NULL ) DISTRIBUTED BY (rw, cl) ; -- Now create some bogus data. I use a few postgres tricks. -- Ultimately, this just creates a relatively sparse "matrix" in tall format -- I'm trying to use best practices in formatting and stuff INSERT INTO glab.bs_data (rw, cl, vl) SELECT x , y , random() AS vl FROM ( SELECT CASE WHEN random() < 0.1 THEN x ELSE NULL END AS x , CASE WHEN random() < 0.1 THEN y ELSE NULL END AS y FROM generate_series(1, 1000) AS x , generate_series(1, 1000) AS y ) AS A WHERE 1=1 AND a.x IS NOT NULL AND a.y IS NOT NULL ;After the table is ready we can push it into a file:
-- So we have some data. Let's create an external table -- to push it OUT of the database -- You need to run gpfdist to write to an external table. -- In this example, from the directory /Users/gpadmin/glab/tables the command was -- > gpfdist DROP EXTERNAL TABLE IF EXISTS glab.ext_out; CREATE WRITABLE EXTERNAL TABLE glab.ext_out ( i bigint , j bigint , v numeric ) LOCATION ('gpfdist://macbuddha.local/glab_out.txt') FORMAT 'TEXT' (DELIMITER ',' NULL ' ') DISTRIBUTED BY (i, j) ; -- Right, now let's put the data in it. This is two queries unioned so -- we can create the first row. Best to do them separately. INSERT INTO glab.ext_out SELECT max(rw) AS i , max(cl) AS j , count(*) AS v FROM glab.bs_data; INSERT INTO glab.ext_out SELECT rw AS i , cl AS j , vl AS v FROM glab.bs_data ORDER BY 1,2 ;And here is the example output glab_out.txt:
1000,1000,10002 1,69,0.71019060537219 1,577,0.747299919836223 1,627,0.252593372948468 1,753,0.120338548440486 1,768,0.34520703041926 1,826,0.756854422856122 1,838,0.827316934708506 1,936,0.342057122848928 2,323,0.090937509201467 ...Now it is very easy to read it in GraphLab. In Matlab/Octave do:
load glab_out.txt; % load data to memory A = sparse(glab_out(2:end,1), glab_out(2:end,2), glab_out(2:end,3), glab_out(1,1), glab_out(1,2), glab_out(1,3)); % create a sparse matrix mmwrite('A', A); % save matrix to file. You may need to donload the script mmwrite.m from http://graphlab.org/mmwrite.m
Since this is potentially useful, I am going to have support for this data format soon in GraphLab, without the matlab conversion.
Next, Brian promised to teach me how to export data from GraphLab back to SQL.
No comments:
Post a Comment