<div dir="ltr"><br><br><div class="gmail_quote">On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <span dir="ltr">&lt;<a href="mailto:johan.tibell@gmail.com">johan.tibell@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="gmail_quote">Hi Michael,<div class="im"><br><br>On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman <span dir="ltr">&lt;<a href="mailto:michael@snoyman.com" target="_blank">michael@snoyman.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">


<div dir="ltr"><div class="gmail_quote"><div>Here&#39;s my response to the two points:</div>

<div><br></div><div>* I haven&#39;t written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I&#39;ll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been &quot;don&#39;t use bytestring, it&#39;s the wrong datatype, text will get fixed,&quot; which is quite underwhelming.</div>


</div></div></blockquote></div><div><br>I went through all the emails you sent on with topic &quot;String vs ByteString&quot; and &quot;Re: String vs ByteString&quot; and I can&#39;t find a single benchmark. I do agree with you that<br>


<br>    * UTF-8 is more compact than UTF-16, and<br>    * UTF-8 is by far the most used encoding on the web.<br><br>and that establishes a reasonable *theoretical* argument for why switching to UTF-8 might be faster.<br>


<br>

What I&#39;m looking for is a program that shows a big difference so we can validate the hypothesis. As Duncan mentioned we already ran some benchmarks early on the showed the opposite. Someone posted a benchmark earlier in this thread and Bryan addressed the issue raised by that poster. We want more of those.<br>


 </div></div></blockquote><div>Sorry, I thought I&#39;d sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:</div><div><br></div><div>

<a href="http://www.snoyman.com/blog/entry/bigtable-benchmarks/">http://www.snoyman.com/blog/entry/bigtable-benchmarks/</a></div><div><a href="http://www.snoyman.com/blog/entry/optimizing-hamlet/">http://www.snoyman.com/blog/entry/optimizing-hamlet/</a></div>

<div><br></div><div>Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn&#39;t be using Bryan&#39;s fusion logic.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">

<div dir="ltr"><div class="gmail_quote">

<div></div><div>* Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I&#39;ll be chastised by the community, I&#39;ll stick with this approach for the moment.</div>


</div></div></blockquote></div><div><br>I&#39;m not sure this discussion has surfaced that many facts. What we do have is plenty of theories. I can easily add some more:<br><br>    * GHC is not doing a good job laying out the branches in the validation code that does arithmetic on the input byte sequence, to validate the input and compute the Unicode code point that should be streamed using fusion.<br>


<br>    * The differences in text and bytestring&#39;s fusion framework get in the way of some optimization in GHC (text uses a more sophisticated fusion frameworks that handles some cases bytestring can&#39;t according to Bryan).<br>


<br>    * Lingering space leaks is hurting performance (Bryan plugged one already).<br><br>    * The use of a polymorphic loop state in the fusion framework gets in the way of unboxing.<br><br>    * Extraneous copying in the Handle implementation slows down I/O.<br>


<br>All these are plausible reasons why Text might perform worse than ByteString. We need find out why ones are true by benchmarking and looking at the generated Core.<br>  </div></div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">


<div dir="ltr"><div class="gmail_quote">

<div></div><div>Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don&#39;t have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking.</div>


</div></div></blockquote></div><div><br>I don&#39;t see any reason why Bryan wouldn&#39;t accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on.<br>

<br></div></div></blockquote><div>I think that&#39;s the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don&#39;t care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that&#39;s an (uneducated) guess.</div>

<div><br></div><div>Some people have been floating the idea of multiple text packages. I personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases. As is, I&#39;m quite happy using blaze-builder for Hamlet.</div>

<div><br></div><div>Michael</div></div></div>