[Haskell-cafe] Re: Mining Twitter data in Haskell and Clojure

braver deliverable at gmail.com
Thu Jun 17 01:23:09 EDT 2010


WIth @dafis's help, there's a version tagged cafe3 on the master
branch which is better performing with ByteString.  I also went ahead
and interned ByteString as Int, converting the structure to IntMap
everywhere.  That's reflected on the new "intern" branch at tag cafe4.

Still it can't do the full 35 days for all users.  It comes close,
however, to 30 days under ghc 6.12 with the IntMap -- just where 6.10
was with Map ByteString.  Some profiling is in prof/ subdirectory,
with the tag responsible and RTS profiling option in the file
name; .prof are -P, and the rest are -hX.

When I downsize the sample data to 1 million users, the whole run,
with -P profiling, is done in 7.5 minutes.  Something happens when
tripling that amount.  For instance, making -A10G may cause sefgault,
after a fast run up to 10 days, then seeming stalling, and a dump of
days up to 28 before the segfault.  -A5G comes closest, to 30 days,
when coupled with -H1G.  It's not clear to me how to work -A and -H
together.

I'll work with Simon to investigate the runtime, but would welcome any
ideas on further speeding up cafe4.

-- Alexy


More information about the Haskell-Cafe mailing list