<div class="gmail_quote">Hi Ketil,<br><br>On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde <span dir="ltr">&lt;<a href="mailto:ketil@malde.org">ketil@malde.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


<div class="im">Johan Tibell &lt;<a href="mailto:johan.tibell@gmail.com">johan.tibell@gmail.com</a>&gt; writes:<br>

<br>

&gt; It&#39;s not clear to me that using UTF-16 internally does make Data.Text<br>

&gt; noticeably slower.<br>

<br>

</div>I haven&#39;t benchmarked it, but I&#39;m fairly sure that, if you try to fit a<br>

3Gbyte file (the Human genome, sayą), into a computer with 4Gbytes of<br>

RAM, UTF-16 will be slower than UTF-8.  Many applications will get away<br>

with streaming over data, retaining only a small part, but some won&#39;t.<br></blockquote><div><br>I&#39;m not sure if this is a great example as genome data is probably much better stored in a vector (using a few bits per &quot;letter&quot;). I agree that whenever one data structure will fit in the available RAM and another won&#39;t the smaller will win. I just don&#39;t know if this case is worth spending weeks worth of work optimizing for. That&#39;s why I&#39;d like to see benchmarks for more idiomatic use cases.<br>


 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


In other cases (e.g. processing CJK text, and perhap also<br>

non-Latin1 text), I&#39;m sure it&#39;ll be faster - but my (still<br>

unsubstantiated) guess is that the difference will be much smaller, and<br>

it&#39;ll be a case of winning some and losing some - and I&#39;d also<br>

conjecture that having 3Gb &quot;real&quot; text (i.e. natural language, as<br>

opposed to text-formatted data) is rare.<br></blockquote><div><br>I would like to verify this guess. In my personal experience it&#39;s really hard to guess which changes will lead to a noticeable performance improvement. I&#39;m probably wrong more often than I&#39;m right.<br>


</div></div><br>Cheers,<br>Johan<br><br>