<div class="gmail_quote">Hi Ketil,<br><br>On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde <span dir="ltr"><<a href="mailto:ketil@malde.org">ketil@malde.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">Johan Tibell <<a href="mailto:johan.tibell@gmail.com">johan.tibell@gmail.com</a>> writes:<br>
<br>
> It's not clear to me that using UTF-16 internally does make Data.Text<br>
> noticeably slower.<br>
<br>
</div>I haven't benchmarked it, but I'm fairly sure that, if you try to fit a<br>
3Gbyte file (the Human genome, sayš), into a computer with 4Gbytes of<br>
RAM, UTF-16 will be slower than UTF-8. Many applications will get away<br>
with streaming over data, retaining only a small part, but some won't.<br></blockquote><div><br>I'm not sure if this is a great example as genome data is probably much better stored in a vector (using a few bits per "letter"). I agree that whenever one data structure will fit in the available RAM and another won't the smaller will win. I just don't know if this case is worth spending weeks worth of work optimizing for. That's why I'd like to see benchmarks for more idiomatic use cases.<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
In other cases (e.g. processing CJK text, and perhap also<br>
non-Latin1 text), I'm sure it'll be faster - but my (still<br>
unsubstantiated) guess is that the difference will be much smaller, and<br>
it'll be a case of winning some and losing some - and I'd also<br>
conjecture that having 3Gb "real" text (i.e. natural language, as<br>
opposed to text-formatted data) is rare.<br></blockquote><div><br>I would like to verify this guess. In my personal experience it's really hard to guess which changes will lead to a noticeable performance improvement. I'm probably wrong more often than I'm right.<br>
</div></div><br>Cheers,<br>Johan<br><br>