Tuesday, November 29, 2011

Oracle Labs Report on GraphLab


Recently we had some interesting discussions with Eric Sedlar, Technical Director, Oracle Labs. Oracle examined GraphLab as a potential framework for implementing graph algorithms and assigned Sungpack Hong, a researcher at Oracle Labs to write a tech report on this subject.

A few days ago I got a copy of this seven pages tech report which is called "A brief report on Graphlab and Green-Marl". I asked for permission from Eric and Sungpack to discuss their main finding here.  Their full tech report may be published by them at the later date.

According to Sungpack there are 4 characterizing aspects of algorithms which Graphlab supports:

  1. Algorithms can be described in a node-centric way; same computation is repeatedly performed on every node.
  2. Significant amounts of computations are performed on each node.
  3. The underlying graphs have large diameter.
  4. Determinism can be neglected.
Here is my reaction to those four points:
  1. Agreed.
  2. Agreed. 
  3. This is not necessarily correct - you can use Graphlab for any Graph as long they are sparse.
  4. Actually, we support round robin scheduling, so it is quite easy to have full sweeps over all nodes. Many of the matrix factorization algorithms we implemented fall into this category - they are completely deterministic (besides of maybe some random initialization of the starting state).
Next, some issues are discussed:
  1. Programmability: user must restructure his algorithm in a node centric way.
  2. There is an overhead of runtime system when the amount of computation performed at each node is small.
  3. Small world graphs: GraphLab lock scheme may suffer from frequent conflicts for such graphs.
  4. Determinism: the result of computed algorithm in Graphlab may become non-deterministic.
  5. Concerns about distributed execution.
Here is my thoughts on those points:
  1. This is a correct concern, not every algorithm is suitable for programming in GraphLab.
  2. Agreed.
  3. Agreed. In GraphLab v2 Joey is heading significant improvement which will allow factoring the update function in to several different computations to deal with this issue.
  4. Objection! As shown, for example, by our participation in the ACM KDD CUP contest this year, we got very high quality prediction results, that where completely deterministic. It is possible to implement a wide variety of deterministic algorithms in GraphLab.
  5. Agreed. Yucheng is heading the effort of completely rewriting the distributed engine and we believe significant performance improvements are possible in the distributed settings. 
Finally, the following conclusion is given:
... I believe GraphLab still is a valuable back-end for Green-Marl (DB: their programming language) .. it is a very natural to use GraphLab as one of Green-Marl's back-ends....  GraphLab and Green-Marl can be helpful to each other... Finally it will be interesting to wait .. for GraphLab2. A fair comparison of distributed graph processing ... would be meaningful then.

Overall, I wanted to thank Eric and Sungpack for the great and serious effort they did in evaluating GraphLab! While I am of course biased, it is important to get unbiased evaluation of third parties.

Spotlight: Dmitriy Golovashkin - Oracle and the R project


A few days ago I got the following note from Dmitry, a principal stuff member at Oracle:

Hi Danny,

I found out about GraphLab just two days ago.

I was working on a MapReduce based QR factorization and whilst searching web for references, found your blog & GraphLab. No question, I am planning to learn more about the project. Looks very exciting!

In general, our group is focusing on in-database and Hadoop based
data mining and statistical algorithms development.
R http://www.r-project.org/ is a big part of it.

Kind regards,
DG

As always I am absolutely thrilled for getting my readers feedback! I asked Dmitry if he can share some  more insight about R project and here is what he wrote:

R is huge in data mining and statistical camps.
The number of contributed packages is staggering, it is amongst the most complete and feature rich environments for statistical and data mining computing.
Another very important observation concerns the quality of some of the contributed packages: outstanding work & implementation.

The biggest problem with R has to do with its inherent data storage model: everything must be stored in memory and most algorithms are sequential.
For instance the notion of a matrix in R is captured in the following C one liner:
double* a = (double*) malloc(sizeof(double) * nElements));

It is possible to build R with a vendor-supplied matrix packages (BLAS and LAPACK) and thus have multithreaded matrix computations in R (which helps a lot).
However if the input does not fit into memory, then it is somewhat problematic to run even the simplest algorithms.

We enable R folks to carry out computations directly on the database data (no need to move the data out). The in-memory limitation has been lifted for some algorithms (not all of course).

More is here
http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html

Kind regards,
DG

We definitely agree that R is very useful statistical package. In fact, one of our users, Steve Lianoglou from Weill Cornell Medical Collage ported our Shotgun solver package to R.   And here is an excerpt from Steve's homepage which summarizes his ambivalent relation to R:

I have a love/hate relationship with R. I appreciate its power, but at times I feel like I'm "all thumbs" trying use it to to bend the computer to my will (update: I actually don't feel like this anymore. In fact I'm quite comfortable in R now, but try not to get too frustrated if you're not yet ... it only took me about 6 months or so!).
If that feeling sounds familiar to you, these references might be useful.
  • The R Inferno [PDF]. "If you are using R and you think you're in hell, this is a map for you." I stumbled on this document after using R for about 8 months or so and I could still, sympathize with that statement.
  • An R & Bioconductor Manual by Thomas Girke




Anyway if you are a reader of this blog or a Graphlab user - send me a note!

Tuesday, November 22, 2011

Installing GraphLab on Fedora Core 16

FOR LAZY USERS: You can use Amazon EC2 image: ami-1510d87c
As instructed here.

0) Login into your FC 16 machine.
On amazon EC2 - you can launch instance ami-0316d86a

1) Installing libboost
yum install boost boost-devel

2) Installing itpp
Note: For FC 15 and earlier it is easy to setup itpp, see packages here: http://koji.fedoraproject.org/koji/packageinfo?packageID=2142 and explanation on http://sourceforge.net/apps/wordpress/itpp/download/
Using the command:
yum install itpp itpp-devel itpp-doc

Here I will explain how to install itpp manually, for FC 16

yum install lapack autoconf autogen libtool
ln -s /usr/lib64/liblalpack.so.3 /usr/lib64/liblapack.so
ln -s /usr/lib64/libblas.so.3 /usr/lib64/libblas.so

yum groupinstall "Development Tools"
wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz
tar xvzf itpp-4.2.tar.gz
cd itpp-4.2
./autogen.sh 
./configure --enable-debug --without-fft --with-blas=/usr/lib64/libblas.so.3  --with-lapack=/usr/lib64/liblapack.so.3 CFLAGS=-fPIC CXXFLAGS=-fPIC  CPPFLAGS=-fPIC
make && make install
Note: future version of lalpack may have different library name, check your /usr/lib64
dir for the correct lapack/blas lib names.

3) Install Mercurial
yum install mercurial

4) Install cmake
yum install cmake

5a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.


or 5b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6) configure and compile
./configure --bootstrap --itpp_include_dir=/usr/local/include/  --itpp_dynamic_link_dir=/usr/local/lib/libitpp.so  --itpp_lapack_dir=/usr/lib64/
cd graphlabapi
cd release/
make -j4

7) Test Graphlab
cd tests
export LD_LIBRARY_PATH=/usr/local/lib/
./runtests.sh

8) Optional (but recommended): install Octave.
yum install octave

Sunday, November 20, 2011

Installing GraphLab on RedHat Enterprise Server 6.1

0) Login into your Red Hat machine Enterprise 6.1 machine (64 bit).
On amazon EC2 - you can launch instance ami-31d41658

1) Installing libboost
yum install boost boost-devel
yum groupinstall "Development Tools"

2) Install Mercurial
yum install mercurial

3) Install cmake
yum install cmake

4a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.


or 4b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

5) configure and compile
cd graphlabapi
./configure 
cd release/
make -j4


Saturday, November 19, 2011

Installing GraphLab on Gentoo Linux



For Lazy Users: simply use  ami-d9e820b0  Amazon EC2 image. (More detailed instructions)


Detailed instructions.
Login into your Gentoo Linux machine (I used preinstalled Amazon EC2 image: AMI-e234c78b).

1) Install itpp using:
emerge lapack
USE="blas lapack debug doc" emerge itpp

2) Install boost using:
emerge boost

3) Install mercurial using:
emerge mercurial

4) Install cmake from sources:
Note: I was not able to execute "emerge cmake" because of some versioning problem in
the Gentoo repository, so I had to install it manually from sources as follows:
get http://www.cmake.org/files/v2.8/cmake-2.8.6.tar.gz
tar xvzf cmake-2.8.6.tar.gz
cd cmake-2.8.6
./configure
make
make install

5a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.

copy the command string: "hg clone..." and execute it in your Gentoo shell.

or 5b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6) configure and compile
cd graphlabapi
./configure --bootstrap --yes
cd release/
make -j4

7) Test Graphlab
cd tests
./runtests.sh

8) Optional: install Octave.
If you don't have access to Matlab, Octave is an open source replacement. Octave is useful for preparing input formats to GraphLab's collaborative filtering library and reading the output.
You can install Octave using the command:
emerge octave

Let me know how it went!

Thursday, November 17, 2011

SpotLight: Tianqi Chen - SVDFeature collaborative filtering project


Here is an interview With Tianqi Chen, the winner in the 3rd place in ACM KDD CUP track 1 this year. Since he got two places ahead of us (Lebusishu team), I definitely appreciate his achievement! Tianqi is one of the main collaborators of the SVDFeature project. In a nutshell, SVDFeature is an efficient c++ implementation of stochastic gradient descent that allows for additional feature tuning by the user. It can scale to large models since it keeps only part of the problem in memory. However, it is not fully parallel yet.

1) First I wanted to congratulate you for your great achievement in the KDD CUP this year. In retrospective, can you explain what are the crucial features of SVDFeature project that made it so successful ?

One of the key factor of winning is ensemble predictor of course, our teammates in HKUST did a great job in that. On the other hand, SVDFeature creates the chance to build a powerful single predictor that takes many information into account, we report the best single method(RMSE=22.12 without blending) in track1 using SVDFeature. The most interesting thing of SVDFeature is user can build variants of matrix factorization model by feature defining. In our final predictor of track1, we incorporate user temporal information, implicit feedback, item neighborhood information, taxonomy information etc. Many famous models such as time-SVD++, SVD with neighborhood can also be implemented by defining user temporal related feature, and neighborhood feature. Another important thing of SVDFeature is we don't try to load all the training data into the memory, which suits the large-scale CF data such as KDDCup(of course we still need the memory for storing the model). For example, we can use less than 2GB memory to get comparable performance in KDDCup track1 and track2.

2) What is the project focus?

We really want to make the project as an generic tool for developing Informative Collaborative Filtering and Ranking Algorithms. By informative, we mean to build matrix factorization algorithm that takes available information into account( taxonomy information, user temporal, neighborhood, social network, users' previous click history or browsing history etc.). Similar context exists long in traditional classification tools such as LIBSVM, SVMLight, SVMRank. These tools takes in feature vectors and give predict using the information given in the feature-vector. We want SVDFeature to be the counterpart of the feature-based predictors in Collaborative Filtering and Ranking. Users can define features to describe user(e.g user's click history and browsing history), item(e.g taxonomy information), and global effect(e.g neighborhood effect) to build actually new variants of matrix factorization. In that sense, SVDFeature is suitable for those who want to develop new recommendation algorithms for feature define. It's also suitable for building context-aware recommendation algorithms.

3) Who are the contributors?

There are four authors in the project page, I am the major author of the project. We do welcome comments and feedback on the project.

4) What is the open source model?

The project is released under Apache License, Version 2.0. The toolkit doesn't depends on extra libraries and compiles under both linux(g++) and windows(MSVC).

5) What are the advantages of using the project?

(1) Implement existing variants of Matrix Factorization( time-SVD++, neighborhood SVD) and composite of them.

(2) Build new CF algorithms( for ranking or for rate prediction) that makes use of extra information by defining new features.

(3) Working with a team building feature components independently and merge up together to see the result. Since the input only involves different kinds of feature, we can have one guy working on user temporal feature using python, and another one generation item neighborhood feature using C++. The generated features can be used to build independent models or merge up to build a hybrid predictor.

(4) Large-scale data handling, the toolkit works for existing large-scale datasets such as kddcup, netflix with frugal usage of memory and good speed, if we build features for basic matrix factorization in KDDCup track1, here are some statistics for one round over the dataset: Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz. 274 sec(64 factors), 357sec(128 factors) , the memory cost is within 2GB.

6) Are there some algorithms which will be hard to express using your framework?

Yes, to be honest, currently SVDFeature supports feature-based matrix factorization model, that involves user, item and global feature. Which do covers large-range of existing algorithms. It currently doesn't cover the models that involves multiple independent factorization parts. For example, tensor time bias, that has time vs user/item in addition to user vs item factor. It's not hard to add additional tweaks to the code to add some specific features. We do welcome requests of specific features. On the other hand, we want to think about a more abstract model that involves configuration of multiple factorization parts, that's a future direction of the toolkit.

7) Can you point to some tutorials for people who want to try it out? Conceptually, what is needed from the users to be defined?

We have a detailed user reference manual and technical report about the model that comes with the project archive. The user need to define a feature-file similar to the common SVM format and provide a configure file that specifies the parameters such as number of latent factors, learning_rate. etc. We have a demo folder in the toolkit archive that have simple example scripts for movielen data. We also have a non-trivial example that shows how to do experiment on KDDCup track2 dataset. We also welcome questions from users.

8) Is the implementation parallel?

The training algorithm is not parallel, but we do create a independent thread to fetch the input data to ensure the speed, we also use SSE2 to optimize some vector operations.

9) Can you give some performance numbers for kdd cup data?

For the KDDCup track1, SVDFeature can build a model with RMSE 22.1x, we also did an afterwards experiments on KDDCup track2 ( see all the scripts and codes) that error=3.1x. The performance of SVDFeature really depends on the user's intelligence of feature defining, the kddcup track2 is a good example to show performance of different feature compositions.

10) What is the difference of SVDFeature to libfm?

(1) Yes, it's very related to FM, we do respect the work of libFM. We pointed it out in our related work in the tech-report.
(2) The difference in the model: feature-based matrix factorization distinguish different types of feature, it has factorization interaction between user feature and item feature, and linear only effect with the input of global feature. FM is more like a tensor-factorization style model, which enforce factorization interaction between every features. The distinguish of features allows linear only feature such as neighbourhood information, and ensures user temporal feature and user implicit feedback will only have factorization interaction with item features.
(3) Difference in toolkit: SVDFeature supports both rate prediction and feature-based collaborative ranking, it also has a fast algorithm for training with implicit feedback, large-scale data handling(it's very important if you define many features), and other minor differences: e.g detailed options for parameter tuning and various loss functions( hinge loss, logistic loss, square loss ) .
(4) Some examples that may tell these differences: try to implement SVD++ model, SVD with neighbourhood information, or see our non-trival experiment on KDDCup track2(http://apex.sjtu.edu.cn/apex_wiki/kddtrack2)
We call our model Feature-based Matrix Factorization because the idea descends more naturally from matrix factorization using features. We do not try to claim we proposed a freshly new feature-based model, the credits shall goes to FM. But we do try to make a model and a toolkit that's useful and can help doing informative collaborative filtering and ranking, and SVDFeature at least works better in some scenarios:)

News from 29 Nov, 2011:
Titanqi published ready-to-run experiment on KDDCup track1 in http://apex.sjtu.edu.cn/apex_wiki/kddtrack1

Wednesday, November 16, 2011

GAMP: Generalized Approximate Message Passing in GraphLab

About a week ago I was invited to visit Ohio State University by Phil Schniter. I heard about a lot of interesting research activities, that I will probably report in this blog in the future. Specifically, Phil is working on inference on very related linear models (relative to my work).

I heard from Phil about the GAMP method, an extension of Montanari's AMP (approximate message passing methods). GAMP is supposed to overcome some of the difficulties in AMP, namely Gaussian matrix assumption. Here is an excerpt from the GAMP wiki:
Generalized Approximate Message Passing (GAMP) is an approximate, but computationally efficient method for estimation problems with linear mixing. In the linear mixing problem an unknown vector, x, with independent components, is first passed through linear transform z=Ax and then observed through a general probabilistic, componentwise measurement channel to yield a measurement vector y. The problem is to estimate x and z from y and A. This problem arises in a range of applications including compressed sensing. Optimal solutions to linear mixing estimation problems are, in general, computationally intractable as the complexity of most brute force algorithms grows exponentially in the dimension of the vector . GAMP approximately performs the estimation through a Gaussian approximation of loopy belief propagation that reduces the vector-valued estimation problem to a sequence of scalar estimation problems on the components of the vectors x and z.
This project is intended to develop the theory and applications of GAMP. We also provide open source MATLAB software for others who would like to experiment with the algorithm.
Below is a simplified version of GAMP matlab code I got from phil. It is a highly simplified instance of the GAMPmatlab code, which includes a library of signal priors and matrix constructions, as well as stepsize control (to improve stability when the matrix has highly correlated columns). Automatic tuning of GAMPmatlab parameters is provided by the EM-BG-GAMP code.

% this m-file uses GAMP to estimate the matrix X (of dimension NxT) from the noisy matrix observation Y=A*X+W of dimension MxT.  here, A has independent Gaussian entries with matrix mean and variance (A_mean,A_var), W has iid Gaussian entries with scalar mean and variance (0,psi), and X has iid Bernoulli-Gaussian entries with scalar activity rate "lam" and with non-zero elements having scalar mean and variance (theta,phi).

 % declare signal parameters
 N = 1000;      % # weights     [dflt=1000]
 M = 400;   % # observations    [dflt=400]
 T = 10;   % # dimensions per observation   [dflt=10]
 lam = 0.01*ones(N,T);  % BG: prior probability of a non-zero weight [dflt=0.01]
 theta = 0*ones(N,T);  % BG: prior mean of non-zero weights  [dflt=0]
 phi = 1*ones(N,T);   % BG: prior variance of non-zero weights [dflt=1]
 SNRdB = 15;   % SNR in dB      [dflt=15]

 % declare algorithmic parameters
 iterMax = 50;   % maximum number of AMP iterations  [dflt=20]

 % generate true Bernoulli-Gaussian signal (i.e., "weights")
 B_true = (rand(N,T)>1-lam);   % sparsity pattern
 X_true = B_true.*(theta+sqrt(phi).*randn(N,T));% sparse weights

 % generate measurement (i.e., "feature") matrix
 A_mean = randn(M,N)/sqrt(M);       % element-wise mean 
 A_mean2 = A_mean.^2;
 A_var = abs(A_mean).^2;   % element-wise variance
 A_true = A_mean + sqrt(A_var).*randn(M,N); % realization

 % generate noisy observations
 psi = norm(A_true*X_true,'fro')^2/(M*T)*10^(-SNRdB/10);   % additive noise variance 
 Y = A_true*X_true + sqrt(psi).*randn(M,T);  % AWGN-corrupted observation

 % initialize GAMP
 X_mean = lam.*theta;  % initialize posterior means
 X_var = lam.*phi;  % initialize posterior variances
 S_mean = zeros(M,T);  % initialized scaled filtered gradient
 NMSEdB_ = NaN*ones(1,iterMax); % for debugging: initialize NMSE-versus-iteration

 % iterate GAMP
 for iter=1:iterMax,

   % computations at observation nodes
   Z_var = A_mean2*X_var;
   Z_mean = A_mean*X_mean;
   P_var = Z_var + A_var*(X_var + X_mean.^2);
   P_mean = Z_mean - Z_var.*S_mean;
   S_var = 1./(P_var+psi);
   S_mean = S_var.*(Y-P_mean);
% computations at variable nodes
R_var = 1./(A_mean2'*S_var + eps);
   R_mean = X_mean + R_var.*(A_mean'*S_mean);
   gam = (R_var.*theta + phi.*R_mean)./(R_var+phi);
   nu = phi.*R_var./(R_var+phi);
   pi = 1./(1+(1-lam)./lam .*sqrt((R_var+phi)./R_var) .* exp( ...
    -(R_mean.^2)./(2*R_var) + ((R_mean-theta).^2)./(2*(phi+R_var)) ));
   X_mean = pi.*gam;
   X_var = pi.*(nu+gam.^2) - (pi.*gam).^2;

   % for debugging: record normalized mean-squared error 
   NMSEdB_(iter) = 20*log10(norm(X_true-X_mean,'fro')/norm(X_true,'fro'));
 end;% iter

 % for debugging: plot posterior means, variances, and NMSE-versus-iteration
 subplot(311)
   errorbar(X_mean(:,1),sqrt(X_var(:,1)),'.');
   hold on; plot(X_true(:,1),'k.'); hold off;
   axis([0,size(X_true,1),-1.2*max(abs(X_true(:,1))),1.2*max(abs(X_true(:,1)))])
   ylabel('mean')
   grid on;
   title('GAMP estimates')
 subplot(312)
   plot(10*log10(X_var(:,1)),'.');
   axe=axis; axis([0,N,min(axe(3),-1),0]);
   ylabel('variance [dB]');
   grid on;
 subplot(313)
   plot(NMSEdB_,'.-')
   ylabel('NMSE [dB]')
   grid on;
   title(['NMSE=',num2str(NMSEdB_(iter),3),'dB '])
 drawnow;

Today I have implemented GAMP in GraphLab. The implementation in parallel is pretty straightforward. The main ovearhead is the matrix product with A (marked in red) and with A' (marked in green). Each of the variable and observation nodes is a graphlab node. There are two update functions, one for variable nodes and one for observation nodes. The code can be found in GraphLab linear solvers library.

Here is a snapshot from the implementation of the observation node update function:
void gamp__update_function(gl_types_gamp::iscope &scope,
         gl_types_gamp::icallback &scheduler) {

  /* GET current vertex data */
  vertex_data_gamp& vdata = scope.vertex_data();
  unsigned int id = scope.vertex();
  gamp_struct var(vdata, id);

  
/*
 *  computations at variable nodes
*/
   memset(var.R_var, 0, config.D*sizeof(sdouble));
   memset(var.R_mean, 0, config.D*sizeof(sdouble));
   for (unsigned int k=ps.m; k< ps.m+ps.n; k++){
      vertex_data_gamp  & other= g_gamp->vertex_data(k);
      gamp_struct obs(other,k);

      for (int i=0; i< config.D; i++){
        var.R_var[i] += pow(var.AT_i[k-ps.m],2) * obs.S_var[i];
        var.R_mean[i] += var.AT_i[k-ps.m] * obs.S_mean[i];
      }
  }

  for (int i=0; i< config.D; i++){
     var.R_var[i] = 1.0 / (var.R_var[i] + ps.gamp_eps);
     var.R_mean[i] = var.X_mean[i] + var.R_var[i] * var.R_mean[i];
     var.gam[i] = (var.R_var[i]*ps.gamp_theta + ps.gamp_phi*var.R_mean[i]) / (var.R_var[i] + ps.gamp_phi);
     var.nu[i] = ps.gamp_phi*var.R_var[i] / (var.R_var[i] + ps.gamp_phi);
     var.pi[i] = 1.0 / (1.0 + (1.0 - ps.gamp_lam) / ps.gamp_lam * sqrt((var.R_var[i]+ps.gamp_phi)/var.R_var[i]) * exp( \
      -(powf(var.R_mean[i],2)) / (2*var.R_var[i]) + (powf(var.R_mean[i] - ps.gamp_theta,2)) / (2*(ps.gamp_phi+var.R_var[i])) ));
     var.X_mean[i] = var.pi[i] * var.gam[i];
     var.X_var[i] = var.pi[i] * (var.nu[i]+ pow(var.gam[i],2)) - pow(var.pi[i] * var.gam[i],2);
  }
}
The code is slightly longer then matlab, and looks somewhat uglier since the vector loop operation have to be explicitly expanded. I am not yet happy with performance - since Matlab is highly optimized for the matrix product of dense matrices. I still need to think how to improve - maybe by blocking and using lapack...