On Wed, Aug 18, 2010 at 7:12 PM, Michael Snoyman <span dir="ltr">&lt;<a href="mailto:michael@snoyman.com">michael@snoyman.com</a>&gt;</span> wrote:<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div dir="ltr"><div class="gmail_quote"><div class="im">On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <span dir="ltr">&lt;<a href="mailto:johan.tibell@gmail.com" target="_blank">johan.tibell@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="gmail_quote"><div> </div></div></blockquote></div><div>Sorry, I thought I&#39;d sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:</div>


<div><br></div><div>

<a href="http://www.snoyman.com/blog/entry/bigtable-benchmarks/" target="_blank">http://www.snoyman.com/blog/entry/bigtable-benchmarks/</a></div><div><a href="http://www.snoyman.com/blog/entry/optimizing-hamlet/" target="_blank">http://www.snoyman.com/blog/entry/optimizing-hamlet/</a></div>


<div><br></div><div>Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn&#39;t be using Bryan&#39;s fusion logic.</div>


</div></div></blockquote><div><br></div><div>Those are great. As Bryan mentioned we&#39;ve already improved performance and I think I know how to improve it further.</div><div><br></div><div>I appreciate that it&#39;s difficult to show the UTF-8/UTF-16 divide. I think the approach we&#39;re trying at the moment is looking at benchmarks, improving performance, and repeating until we can&#39;t improve anymore. It could be the case that we get a benchmark where the performance difference between bytestring and text cannot be explained/fixed by factors other than changing the internal encoding. That would be strong evidence that we should try to switch the internal encoding. We haven&#39;t seen any such benchmarks yet.</div>


<div><br></div><div>As for blaze I&#39;m not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they&#39;re not it&#39;s a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div dir="ltr"><div class="gmail_quote"><div class="im">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_quote"><div>I don&#39;t see any reason why Bryan wouldn&#39;t accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on.<br>


<br></div></div></blockquote></div><div>I think that&#39;s the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don&#39;t care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that&#39;s an (uneducated) guess.</div>


</div></div></blockquote><div><br></div><div>I agree. Lets create some more benchmarks.</div><div><br></div><div>For example, lately I&#39;ve been working on a benchmark, inspired by a real world problem, where I iterate over the lines in a ~500 MBs file, encoded using UTF-8 data, inserting each line in a Data.Map and do a bunch of further processing on it (such as splitting the strings into words). This tests text I/O throughput, memory overhead, performance of string comparison, etc.</div>


<div><br></div><div>We already have benchmarks for reading files (in UTF-8) in several different ways (lazy I/O and iteratee style folds).</div><div><br></div><div>Boil down the things you care about into a self contained benchmark and send it to this list or put it somewhere were we can retrieve it.</div>


<div><br></div><div>Cheers,</div><div>Johan</div><div><br></div></div>