[Haskell-cafe] Large graphs

Sun May 20 16:55:01 CEST 2012

2012/5/20 Benjamin Ylvisaker <benjaminy at fastmail.fm>:
> I have a problem that I'm trying to use Haskell for, and I think I'm running into scalability issues in FGL.  However, I am quite new to practical programming in Haskell, so it's possible that I have some other bone-headed performance bug in my code.  I tried looking around for concrete information about the scalability of Haskell's graph libraries, but didn't find much.  So here are the characteristics of the problem I'm working on:
>
> - Large directed graphs.  Mostly 10k-100k nodes, but some in the low 100ks.
> - Sparse graphs.  The number of edges is only 2-3x the number of nodes.
> - Immutable structure, mutable labels.  After initially reading in the graphs, their shape doesn't change, but information "flows" around the graph, changing the labels on nodes and edges.

I would like to suggest to you a representation based in 32-bit
integers as vertex index. I.e., "roll your own"

Use strict IntMap IntSet for neighbor information, it is very efficient.

> I wrote some code that reads in graphs and some some basic flow computations on them.  The first few graphs I tried were around 10k nodes, and the performance was okay (on the order of several seconds).  When I tried some larger graphs (~100k), the memory consumption spiked into multiple GB, the CPU utilization went down to single digit percentages and the overall running time was closer to hours than seconds.

Looks like your code does not force everything. It leaves some thunks
unevaluated, check for that situation.

It is common pitfall, not only for computations on graphs.

>
> Because the graph structure is basically immutable for my problem, I'm tempted to write my own graph representation based on mutable arrays.  Before I embark on that, I wonder if anyone else can share their experience with large graphs in Haskell?  Is there a library (FGL or otherwise) that should be able to scale up to the size of graph I'm interested in, if I write my code correctly?

The above structure (IntMap IntSet) allowed for fast computations on
relatively large arrays, in order of 1M vertices and 16M
undirected/32M directed edges.