Replacement for GMP: Update

Wed Jan 24 10:30:02 EST 2007

On Tue, 23 Jan 2007, John Meacham wrote:
> I think the benchmarks are flawed in an important way, I believe, (but
> am not positive) that ARPREC uses a special factorized form of
> representing numbers, which makes multiplicative type operations
> extremely fast, but simple things like addition/subtraction quite  
> slow.

Oh, don't worry.  I've given up on ARPREC for the simple reason that  
*every* operation on small-size operands is very slow to say nothing  
of the time it would take to initialise ARPREC every time.  ARPREC  
does store numbers using a representation similar to floating point,  
essentially an array of doubles where:
	(int)(array[0])	= array_size
	(int)(array[1])	= no_mantissa_words
		array[2]	= exponent
		array[ 3 .. ] = mantissa
I hope the tests aren't "flawed," in that the main purpose of  
performing those operations was to see the median-level performance  
(extreme performance being Integers with operands greater than,  
10,000 or 30,000 bits).  The crumby graphs do show that I made a low  
cutoff of 256 bits so the most common Integer-use ( < 128 bits, 4  
uint32_t) wasn't even tested.  I probably didn't clarify why I tested  
larger size operands.  GHC should have something that is comparable  
to GMP, and that means speed (and precision) for medium-large  
operands as well as small operands.

> you are only benchmarking multiplicative or similar routines, giving
> ARPREC a huge lead, when in practice it might end up being slowest, as
> addition/subtraction are extremely more common than multiplication.

In the common case it is slowest--but not by much.

> Also, how are you choosing the numbers to test with? it is possible  
> that
> some packages are using 'sparse' representations or other specialized
> forms if all your numbers happen to be powers of two or something.

Random numbers--a different number for each iteration in size.  I  
used a PRNG based on the SNOW2 stream cipher--something I wrote  
awhile ago, as fast as arc4random and tests well on DIEHARD and other  
statistical things.  The GMP and OpenSSL random number generators  
were slower and I wanted to use the same generator across libraries.

> also, pretty much all uses of integers will be for very small  
> integers,
> we should be careful to not lose sight of speed for the common case  
> due
> to pretty asymtotic bounds.

All numbers were the same size, so cases like multiplying a 1024-bit  
operand by a 256-bit operand weren't tested; that could be a real  
flaw.  It's all very sloppy--just to get an idea of where things  
generally line up.

So, where am I now?  I worked on the GHC-Win off and on and then went  
back to making a uniform API between GHC and the replacement, and I  
am re-writing the replacement.  I thought I would be done several  
weeks ago but of course little things take over...  One area of GHC I  
would really like to change is the memory-use.  ForeignPtr seems to  
work well but places the full burden of memory management on the  
Integer library; for SIMD-use (AltiVec, SSE2), the library would  
require GHC to allocate memory aligned to 16-bytes.  Right now, the  
only choice would be allocatePinned() (in rts/sm/Storage.c), which is  
4-byte aligned and it is, of course, pinned so the GC can't move it.   
Imagine the RTS-memory problems you could have if you wrote a Haskell  
program using lots of small Integers allocated with, say, an  
allocatePinned_16() (16-byte aligned).  The alternative would be to  
use normal memory but re-align the operands for the vectorized  
operations by peeling; o.k., doable (especially easy with AltiVec),  
but slower.  In the meantime I have been working through SSE2  
assembler, which doesn't have an addc (add-carry) operation and  
doesn't set a flag for overflow, so I have been experimenting with a  
SWAR-like algorithm--essentially the same thing as GMP's Nails--to  
make the normally 32-bit operands 31 bits with the last bit for a  
carry flag.  Thorkil Naur and others have suggested writing the whole  
thing as small assembler operations and piece them together in  
Haskell; I have been looking into that as well but it seems to entail  
inlining every Integer function--imagine the code bloat.

Cheers,
Pete