[Haskell-cafe] External Sort: Sort a 10-million integer file with just 256M of ram.

Thomas Hartman tphyahoo at gmail.com
Thu Oct 23 12:41:04 EDT 2008


Now on hackage: cabal install external-sort

demo (included in distribution):

thartman at thartman-laptop:~/external-sort>cat demo.hs
{-# LANGUAGE PatternSignatures #-}
import Algorithms.ExternalSort
import Data.List
import System.IO
import System.Environment (getArgs)
import System.Time
import HSH


-- to do: compare speed against unix sort util on a 10 million line file.

-- pure in-memory prelude sort will crash your computer when the list
gets over a million elements or so
-- externalsort caches the sublists used in the sort algorithm on your
hard drive, so you can sort a much larger list.
{-

The behavior below was on a demo executable, compiled. (In ghci, even
last on a 10 million element list
caused an out of memory error.) The test computer had 256M physical
ram and was ulimited to 256M cache.

*Main>:! ulimit -v
262144

For 10 million element list:
*Main> :! time ./demo preludesort 7
demo: out of memory (requested 1048576 bytes)
Command exited with non-zero status 1
4.88user 0.68system 0:21.11elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+64817minor)pagefaults 0swaps
*Main> :! time ./demo externalsort 7
10000000
73.87user 1.96system 1:24.25elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
792inputs+156280outputs (6major+16739minor)pagefaults 0swaps
./demo unixsort 7
wrote bigfile, time: Mon Oct 20 15:25:26 CEST 2008
demo: out of memory (requested 1048576 bytes)

For 100 million element list, external sort failed. Can it be made to
work? maybe in some future version.
time ./demo externalsort 8
demo: out of memory (requested 1048576 bytes)

real    10m14.061s
user    8m26.712s
sys     0m11.793s
thartman at thartman-laptop:~/external-sort>ls -lh ExternalSort.bin
-rw-r--r-- 1 thartman thartman 764M Oct 20 15:50 ExternalSort.bin

The problem is not fitting a 10^8 element list in memory, the
following works fine
 (when compiled, though not in ghci):
t = putStrLn . show . last $ [1..10^8::Int]

Maybe think about this more later.

-}

main = do
  [s,e] <- getArgs
  let exp = read e
  case s of
    "preludesort" -> sortwith exp $ return . sort
    "externalsort" -> sortwith exp externalSort
    "unixsort" -> unixsort exp
    _ -> let msg = "usage: ./demo preludesort 7 or ./demo externalsort
7 or ./demo unixsort 7 \
                    \(sort 10 million element list)"
         in  fail msg

sortwith exp s = putStrLn =<< return . show . last =<< s ([1..10^exp ::Int])

unixsort exp = do
  let fn = "bigfile"
  withFile fn AppendMode (\h -> (mapM_ (hPutStrLn h . show)
([1..10^exp::Int]) ) )
  putStrLn . ( ("wrote " ++ fn ++ ", time: ") ++ ) . show =<< getClockTime
  run $ "time tail -n1 | sort " ++fn :: IO String
  return ()

t = putStrLn . show . last $ [1..10^8::Int]


More information about the Haskell-Cafe mailing list