Distributed/multi-gpu CFPQ

In current implementations, we suppose that all data fit in GPU memory. How to handle really huge graphs? Try to implement a multi-GPU solution for graphs which cannot be fully stored in RAM. It should be possible to utilize classical solutions for distributed huge matrices multiplication.