Small update:<div><br></div><div>I got the first results from the hardware accelerated version on a 3.33 ghz westmere machine.  Right now it does twice as well as the Gladman software version, which is also twice as well as the System.Random stdgen, and 1000 times faster than a the Haskell implementation of AES that I got from the Crypto package:<br>


</div><div><br></div><div><div>     How many random numbers can we generate in a second on one thread?</div>

<div>      Cost of rdtsc (ffi call):    84</div><div>      Approx getCPUTime calls per second: 209,798</div><div>      Approx clock frequency:  3,331,093,772</div><div>      First, timing with System.Random interface:</div>


<div><span style="white-space: pre-wrap;">        </span> 76,811,104 random ints generated [constant zero gen]       </div><div><span style="white-space: pre-wrap;">        </span> 14,482,725 random ints generated [System.Random stdGen]    </div>


<div><span style="white-space: pre-wrap;">        </span>     16,061 random ints generated [PureHaskell/reference]   </div><div><span style="white-space: pre-wrap;">        </span>     32,309 random ints generated [PureHaskell]             </div>


<div><span style="white-space: pre-wrap;">        </span>  2,401,893 random ints generated [Gladman inefficient]     </div><div><span style="white-space: pre-wrap;">        </span> 15,980,625 random ints generated [Gladman]                 </div>


<div><span style="white-space: pre-wrap;">        </span>  2,329,500 random ints generated [IntelAES inefficient]    </div><div><span style="white-space: pre-wrap;">        </span> 32,383,799 random ints generated [IntelAES]                </div>


<div>      Comparison to C&#39;s rand():</div><div><span style="white-space: pre-wrap;">        </span> 71,392,408 random ints generated [rand/store in C loop]    </div>


<div><span style="white-space: pre-wrap;">        </span> 71,347,778 random ints generated [rand in Haskell loop]    </div><div><span style="white-space: pre-wrap;">        </span> 71,324,158 random ints generated [rand/store in Haskell loop] </div>


<div><br></div></div><div>This is what Burton Smith originally thought, that AES based RNG would be pretty fast and even faster with hardware acceleration.<br><br>-Ryan<br><br><div><br><br><div class="gmail_quote">

On Mon, Jan 31, 2011 at 1:25 AM, Ryan Newton <span dir="ltr">&lt;<a href="mailto:newton@mit.edu" target="_blank">newton@mit.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


Hi Cafe,<br><br>I&#39;ve included Gladman&#39;s efficient, portable C implementation of AES generating random numbers now, and the hardware-accelerated version is working too.  I&#39;m currently seeing higher throughput for even the software based version than the builtin stdGen:<div>


<br>

<br><span style="font-family: courier new,monospace;">  First, timing with System.Random interface:</span><span style="font-family: courier new,monospace;"></span><br style="font-family: courier new,monospace;"></div><span style="font-family: courier new,monospace;">     13,051,964 random ints generated [System.Random stdGen]      ~ 252 cycles/int</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">         15,635 random ints generated [PureHaskell/reference]     ~ 210,763 cycles/int</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">         31,159 random ints generated [PureHaskell]               ~ 105,757 cycles/int</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">      2,180,488 random ints generated [Gladman inefficient]       ~ 1,511 cycles/int</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">     15,015,095 random ints generated [Gladman]                   ~ 219 cycles/int</span><br style="font-family: courier new,monospace;">


<br>That seems like a good argument for cryptographic RNGs to me!<br><br>I&#39;m having a lot of trouble getting cabal to build/install it successfully.  You can see what I&#39;ve got there now.  I&#39;d be interested to know if anyone else can build it successfully.  It should work -- but only by building the assembly code into a .so and assuming the build directory is /opt/intel-aes ;-).<br>


<br>I don&#39;t have real numbers for the hardware version yet because the Westmere machine I&#39;m logged into is redhat 5.4 and is giving me &quot;GLIBC_2.7 not found&quot; errors.  You can run it for correctness purposes using an emulation tool called sde <a href="http://software.intel.com/en-us/articles/intel-software-development-emulator/" target="_blank">(software development emulator)</a> that&#39;s based on dynamic binary translation.<br>


<font color="#888888">

<br>-Ryan<br></font><br>P.S. Checkout command:<br><span style="font-family: courier new,monospace;">git clone git://<a href="http://github.com/rrnewton/intel-aes.git" target="_blank">github.com/rrnewton/intel-aes.git</a></span><div>


<div></div><div><br><br><br><br><br>

<div class="gmail_quote">On Sat, Jan 29, 2011 at 8:52 AM, Ryan Newton <span dir="ltr">&lt;<a href="mailto:rrnewton@gmail.com" target="_blank">rrnewton@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


<div class="gmail_quote"><div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

perhaps performance? Is this approach less robust with a faster,<br>

non-cryptographic RNG?<br></blockquote></div><div><br>Yes, I don&#39;t understand that either.  Is there a reason that using a weaker PRNG in this case is WORSE than using it in the non-splitting case?  Is that why there is more of an impetus to use the cryptographic approach in this case?<br>


<br>Anyway, taking for granted that the Burton approach is a useful thing to have implemented, I started developing a package for this stuff -- AES based RNG including both a haskell implementation and wrapping an AESNI-based C one .  I haven&#39;t posted it to Hackage yet, but you can find the git repository here:<br>


<br>    <a href="https://github.com/rrnewton/intel-aes" target="_blank">https://github.com/rrnewton/intel-aes</a><br><br>If you build with cabal and run the benchmark-intel-aes-rng executable, it will give you a breakdown like this:<br>


<br>

<span style="font-family: courier new,monospace;">    How many random numbers can we generate in a second on one thread?</span><br> <span style="font-family: courier new,monospace;">     Cost of rdtsc (ffi call):    83</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">      Approx getCPUTime calls per second: 205,640</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">      Approx clock frequency:  3,306,891,339</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">      First, timing with System.Random interface:</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">       193,178,901 random ints generated [constant zero gen]   </span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">        14,530,358 random ints generated [System.Random stdGen] </span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">            16,346 random ints generated [BurtonGenSlow/reference] </span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">            32,965 random ints generated [BurtonGenSlow]     </span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">      Comparison to C&#39;s rand():</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">       118,766,285 random ints generated [rand/store in C loop]</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">       114,668,028 random ints generated [rand / Haskell loop]</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">       114,675,116 random ints generated [rand/store Haskell] </span><br><br>At the moment this is Haskell-only, I haven&#39;t included the wrapped Intel assembly code library yet.  As you can see, the pure-Haskell AES based RNG (BurtonGenSlow) is pretty slow.<br>


<br>Would anyone else be interested in running those RNG testing tools (diehard, big crush) on this to make sure it is working correctly?<br><br>Also I&#39;d be happy if anyone with a performance-oriented-eye would like to take a look at what&#39;s going on.  Both for the sake of the serial performance (above) and because the parallel performance is currently <b>mysterious</b> (see below).<br>


<br>I figure one of the main reasons for splittable RNG is deterministic parallel computations.  Thus it&#39;s important that all threads be able to run the RNG efficiently.  Right now, if you look at SimpleRNGBench.hs I&#39;m just running the same RNG on numCapabilities threads.  Yet with that simple test I&#39;m running into problems, summarized thus:<br>


<br>  * I get substantial (3X) variance in program performance on consecutive runs.<br>  * I see a minor hit in performance adding -threaded, but going from -N1 to -N4 (even with a single-thread of work) yields a big hit in performance and increase in variance.<br>


  * -N4 with four actual threads of work is actually pretty good for the pure haskell version.  All four threads on my nehalem 3.33ghz can maintain 93% of their throughput in the serial case.  BUT the variance problem persists.<br>


  * I run a busy-wait loop that measures cpu frequency... and this seems to get messed up in threaded mode (even with -qm -qa).  I don&#39;t know why.<br>  * I cannot killThread a haskell thread (forkIO or forkOS) that is currently running a divergent FFI call (safe or unsafe).  (See &quot;time_c&quot;.)<br>


<br>You can find the details in the DEVLOG here:<br><br>   <a href="https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG" target="_blank">https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG</a><br><br>Let me know if you have any ideas.  I&#39;m going to leave the Haskell version how it is and focus on wrapping the Intel asm (which has a permissive license).<br>


<br>Cheers,<br><font color="#888888">  -Ryan<br>

</font><br>P.S. Regarding this benchmarking -- would it be appropriate to use Criterion 

for this?  Or is it sufficient to measure aggregate throughput as I&#39;ve 

been doing?<br><br></div></div>

</blockquote></div><br>

</div></div></blockquote></div><br></div></div>