implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Thu Apr 20 11:52:09 EDT 2006


Bulat Ziganshin <bulat.ziganshin at gmail.com> writes:

> this letter describes why i think that using hand-made (de)coder for
> support of UTF-8 encoded files is better than using iconv.

A Haskell recoder is fine, and probably a good idea for important
encodings, provided that wrapping a block recoder implemented in C
is not ruled out. The two approaches should coexist.

> 2) Einar once asked me about changing the encoding on the fly, that
> is needed for some HTML processing.

HTML can be parsed by treating it as ISO-8859-1 first, looking for
headers specifying the encoding only, and then converting the whole
stream to the right encoding.

> it is also possible that some program will need to intersperse text
> I/O with buffer/array/byte/bits I/O. it's a sort of things that are
> absolutely impossible with iconv

Of course it's possible.

HTTP specifies that headers end with an empty line. The boundary can
be found without decoding the text at all. Then the part before the
boundary is treated as ASCII text and converted to strings, and the
rest is binary.

Or alternatively the text can be read by decoding one character at
a time, and after the boundary is found, the rest cis read from the
underlying binary stream. Even IConv can be used one character at a
time, it will only be inefficient; but here ASCII can be implemented
by hand.

Emitting HTTP is analogous.

> 3) my library support Streams that works in ANY monad (not only IO,
> ST and their derivatives). it's impossible to implement iconv
> conversion for such stream types

Which is good. It's impossible to implement a stateful encoding in a
monad which doesn't carry state.

> moreover, there are implementation issues that make me more
> enthusiastic about hand-made solution. it just already implemented
> and really works.

Your implementation doesn't detect unencodable or malformed input.

And I've already implemented both an IConv wrapper and some
hand-written encodings (but not for Haskell). They work too :-)

> using iconv anyway will be much more complex than using hand-made
> routines.

iconv is done once and tens of encodings become available at once.
Each would have to be hand-implemented separately.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Libraries mailing list