Small update:<div><br></div><div>I got the first results from the hardware accelerated version on a 3.33 ghz westmere machine. Right now it does twice as well as the Gladman software version, which is also twice as well as the System.Random stdgen, and 1000 times faster than a the Haskell implementation of AES that I got from the Crypto package:<br>
</div><div><br></div><div><div> How many random numbers can we generate in a second on one thread?</div>
<div> Cost of rdtsc (ffi call): 84</div><div> Approx getCPUTime calls per second: 209,798</div><div> Approx clock frequency: 3,331,093,772</div><div> First, timing with System.Random interface:</div>
<div><span style="white-space: pre-wrap;">        </span> 76,811,104 random ints generated [constant zero gen] </div><div><span style="white-space: pre-wrap;">        </span> 14,482,725 random ints generated [System.Random stdGen] </div>
<div><span style="white-space: pre-wrap;">        </span> 16,061 random ints generated [PureHaskell/reference] </div><div><span style="white-space: pre-wrap;">        </span> 32,309 random ints generated [PureHaskell] </div>
<div><span style="white-space: pre-wrap;">        </span> 2,401,893 random ints generated [Gladman inefficient] </div><div><span style="white-space: pre-wrap;">        </span> 15,980,625 random ints generated [Gladman] </div>
<div><span style="white-space: pre-wrap;">        </span> 2,329,500 random ints generated [IntelAES inefficient] </div><div><span style="white-space: pre-wrap;">        </span> 32,383,799 random ints generated [IntelAES] </div>
<div> Comparison to C's rand():</div><div><span style="white-space: pre-wrap;">        </span> 71,392,408 random ints generated [rand/store in C loop] </div>
<div><span style="white-space: pre-wrap;">        </span> 71,347,778 random ints generated [rand in Haskell loop] </div><div><span style="white-space: pre-wrap;">        </span> 71,324,158 random ints generated [rand/store in Haskell loop] </div>
<div><br></div></div><div>This is what Burton Smith originally thought, that AES based RNG would be pretty fast and even faster with hardware acceleration.<br><br>-Ryan<br><br><div><br><br><div class="gmail_quote">
On Mon, Jan 31, 2011 at 1:25 AM, Ryan Newton <span dir="ltr"><<a href="mailto:newton@mit.edu" target="_blank">newton@mit.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
Hi Cafe,<br><br>I've included Gladman's efficient, portable C implementation of AES generating random numbers now, and the hardware-accelerated version is working too. I'm currently seeing higher throughput for even the software based version than the builtin stdGen:<div>
<br>
<br><span style="font-family: courier new,monospace;"> First, timing with System.Random interface:</span><span style="font-family: courier new,monospace;"></span><br style="font-family: courier new,monospace;"></div><span style="font-family: courier new,monospace;"> 13,051,964 random ints generated [System.Random stdGen] ~ 252 cycles/int</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 15,635 random ints generated [PureHaskell/reference] ~ 210,763 cycles/int</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 31,159 random ints generated [PureHaskell] ~ 105,757 cycles/int</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 2,180,488 random ints generated [Gladman inefficient] ~ 1,511 cycles/int</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 15,015,095 random ints generated [Gladman] ~ 219 cycles/int</span><br style="font-family: courier new,monospace;">
<br>That seems like a good argument for cryptographic RNGs to me!<br><br>I'm having a lot of trouble getting cabal to build/install it successfully. You can see what I've got there now. I'd be interested to know if anyone else can build it successfully. It should work -- but only by building the assembly code into a .so and assuming the build directory is /opt/intel-aes ;-).<br>
<br>I don't have real numbers for the hardware version yet because the Westmere machine I'm logged into is redhat 5.4 and is giving me "GLIBC_2.7 not found" errors. You can run it for correctness purposes using an emulation tool called sde <a href="http://software.intel.com/en-us/articles/intel-software-development-emulator/" target="_blank">(software development emulator)</a> that's based on dynamic binary translation.<br>
<font color="#888888">
<br>-Ryan<br></font><br>P.S. Checkout command:<br><span style="font-family: courier new,monospace;">git clone git://<a href="http://github.com/rrnewton/intel-aes.git" target="_blank">github.com/rrnewton/intel-aes.git</a></span><div>
<div></div><div><br><br><br><br><br>
<div class="gmail_quote">On Sat, Jan 29, 2011 at 8:52 AM, Ryan Newton <span dir="ltr"><<a href="mailto:rrnewton@gmail.com" target="_blank">rrnewton@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="gmail_quote"><div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
perhaps performance? Is this approach less robust with a faster,<br>
non-cryptographic RNG?<br></blockquote></div><div><br>Yes, I don't understand that either. Is there a reason that using a weaker PRNG in this case is WORSE than using it in the non-splitting case? Is that why there is more of an impetus to use the cryptographic approach in this case?<br>
<br>Anyway, taking for granted that the Burton approach is a useful thing to have implemented, I started developing a package for this stuff -- AES based RNG including both a haskell implementation and wrapping an AESNI-based C one . I haven't posted it to Hackage yet, but you can find the git repository here:<br>
<br> <a href="https://github.com/rrnewton/intel-aes" target="_blank">https://github.com/rrnewton/intel-aes</a><br><br>If you build with cabal and run the benchmark-intel-aes-rng executable, it will give you a breakdown like this:<br>
<br>
<span style="font-family: courier new,monospace;"> How many random numbers can we generate in a second on one thread?</span><br> <span style="font-family: courier new,monospace;"> Cost of rdtsc (ffi call): 83</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> Approx getCPUTime calls per second: 205,640</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> Approx clock frequency: 3,306,891,339</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> First, timing with System.Random interface:</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 193,178,901 random ints generated [constant zero gen] </span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 14,530,358 random ints generated [System.Random stdGen] </span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 16,346 random ints generated [BurtonGenSlow/reference] </span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 32,965 random ints generated [BurtonGenSlow] </span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> Comparison to C's rand():</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 118,766,285 random ints generated [rand/store in C loop]</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 114,668,028 random ints generated [rand / Haskell loop]</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 114,675,116 random ints generated [rand/store Haskell] </span><br><br>At the moment this is Haskell-only, I haven't included the wrapped Intel assembly code library yet. As you can see, the pure-Haskell AES based RNG (BurtonGenSlow) is pretty slow.<br>
<br>Would anyone else be interested in running those RNG testing tools (diehard, big crush) on this to make sure it is working correctly?<br><br>Also I'd be happy if anyone with a performance-oriented-eye would like to take a look at what's going on. Both for the sake of the serial performance (above) and because the parallel performance is currently <b>mysterious</b> (see below).<br>
<br>I figure one of the main reasons for splittable RNG is deterministic parallel computations. Thus it's important that all threads be able to run the RNG efficiently. Right now, if you look at SimpleRNGBench.hs I'm just running the same RNG on numCapabilities threads. Yet with that simple test I'm running into problems, summarized thus:<br>
<br> * I get substantial (3X) variance in program performance on consecutive runs.<br> * I see a minor hit in performance adding -threaded, but going from -N1 to -N4 (even with a single-thread of work) yields a big hit in performance and increase in variance.<br>
* -N4 with four actual threads of work is actually pretty good for the pure haskell version. All four threads on my nehalem 3.33ghz can maintain 93% of their throughput in the serial case. BUT the variance problem persists.<br>
* I run a busy-wait loop that measures cpu frequency... and this seems to get messed up in threaded mode (even with -qm -qa). I don't know why.<br> * I cannot killThread a haskell thread (forkIO or forkOS) that is currently running a divergent FFI call (safe or unsafe). (See "time_c".)<br>
<br>You can find the details in the DEVLOG here:<br><br> <a href="https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG" target="_blank">https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG</a><br><br>Let me know if you have any ideas. I'm going to leave the Haskell version how it is and focus on wrapping the Intel asm (which has a permissive license).<br>
<br>Cheers,<br><font color="#888888"> -Ryan<br>
</font><br>P.S. Regarding this benchmarking -- would it be appropriate to use Criterion
for this? Or is it sufficient to measure aggregate throughput as I've
been doing?<br><br></div></div>
</blockquote></div><br>
</div></div></blockquote></div><br></div></div>