Showing posts with label SQL. Show all posts
Showing posts with label SQL. Show all posts

Friday, January 20, 2012

GraphLab input via Greenplum SQL



I got some great tips from Brian Dolan, Founder and VP product of Discovix Inc, a cool California area startup specializing in building machine learning and mathematical products.

Brian asked me a few days ago if we thought about interfacing GraphLab to some commonly used databases, since usually there is where the client data is found. Since I have no green clue in databses, Brian volunteered to teach me (and you my blog reader!) how to do it.

And here is the tutorial as I got it from Brian. First we create table with fake data
to be used in the example:
DROP SCHEMA IF EXISTS glab;
CREATE SCHEMA glab;

-- Create a table to hold some bogus data
DROP TABLE IF EXISTS glab.bs_data;
CREATE TABLE glab.bs_data
(
  rw int DEFAULT NULL
, cl int DEFAULT NULL
, vl numeric DEFAULT NULL
) DISTRIBUTED BY (rw, cl)
;

-- Now create some bogus data.  I use a few postgres tricks.
-- Ultimately, this just creates a relatively sparse "matrix" in tall format
-- I'm trying to use best practices in formatting and stuff
INSERT INTO glab.bs_data (rw, cl, vl)
SELECT
  x
, y
, random() AS vl
FROM 
  (
  SELECT
    CASE WHEN random() < 0.1 THEN x ELSE NULL END AS x
  , CASE WHEN random() < 0.1 THEN y ELSE NULL END AS y
  FROM
    generate_series(1, 1000) AS x
  , generate_series(1, 1000) AS y
  ) AS A
WHERE 1=1
  AND a.x IS NOT NULL 
  AND a.y IS NOT NULL 
;
After the table is ready we can push it into a file:
-- So we have some data.  Let's create an external table
-- to push it OUT of the database
-- You need to run gpfdist to write to an external table.
-- In this example, from the directory /Users/gpadmin/glab/tables the command was
-- > gpfdist 
DROP EXTERNAL TABLE IF EXISTS glab.ext_out;
CREATE WRITABLE EXTERNAL TABLE glab.ext_out (
  i bigint
, j bigint
, v numeric
) 
LOCATION ('gpfdist://macbuddha.local/glab_out.txt')
FORMAT 'TEXT' (DELIMITER ',' NULL ' ')
DISTRIBUTED BY (i, j)
;

-- Right, now let's put the data in it.  This is two queries unioned so
-- we can create the first row.  Best to do them separately.
INSERT INTO glab.ext_out
SELECT
  max(rw) AS i
, max(cl) AS j
, count(*) AS v
FROM glab.bs_data;

INSERT INTO glab.ext_out
SELECT 
  rw AS i
, cl AS j
, vl AS v
FROM  glab.bs_data
ORDER BY 1,2
;
And here is the example output glab_out.txt:
1000,1000,10002
1,69,0.71019060537219
1,577,0.747299919836223
1,627,0.252593372948468
1,753,0.120338548440486
1,768,0.34520703041926
1,826,0.756854422856122
1,838,0.827316934708506
1,936,0.342057122848928
2,323,0.090937509201467
...
Now it is very easy to read it in GraphLab. In Matlab/Octave do:
load glab_out.txt; % load data to memory
A = sparse(glab_out(2:end,1), glab_out(2:end,2), glab_out(2:end,3), glab_out(1,1), glab_out(1,2), glab_out(1,3)); % create a sparse matrix
mmwrite('A', A);  % save matrix to file. You may need to donload the script mmwrite.m from http://graphlab.org/mmwrite.m

Since this is potentially useful, I am going to have support for this data format soon in GraphLab, without the matlab conversion.

Next, Brian promised to teach me how to export data from GraphLab back to SQL.