[Haskell-cafe] Re: Haskell (Byte)Strings - wrong to separate content from encoding?

Maciej Piechotka uzytkownik2 at gmail.com
Fri Mar 19 14:05:39 EDT 2010


On Fri, 2010-03-19 at 18:45 +0100, Mads Lindstrøm wrote:
> Hi
> 
> More and more libraries use ByteStrings these days. And it is great that
> we can get fast string handling in Haskell, but is ByteString the right
> level of abstractions for most uses?
> 
> It seems to me that libraries, like the Happstack server, should use a
> string-type which contains both the content (what ByteString contains
> today) and the encoding. After all, data in a ByteString have no meaning
> if we do not know its encoding.
> 
> An example will illustrate my point. If your web-app, implemented with
> Happstack, receives a request it looks like
> http://happstack.com/docs/0.4/happstack-server/Happstack-Server-HTTP-Types.html#t%3ARequest :
> 
> data Request = Request { ... rqHeaders :: Headers, ... rqBody ::
> RqBody ... }
> 
> newtype RqBody = Body ByteString
> 
> To actually read the body, you need to find the content-type header, use
> some encoding-conversion package to actually know what the ByteString
> means. Furthermore, some other library may need to consume the
> ByteString. Now you need to know which encoding the consumer expects...
> 
> But all this seems avoidable if Happstack returned a string-type which
> included both content and encoding.
> 

I guess that problem is that... body does not necessary have to be a
text. It can as well be a gif, an mp3 etc.

So you would need to have something like:

data RqBody = Text MIME String
            | Binary MIME ByteString

> I could make a similar story about reading text-files.
> 
> If some data structure contains a lot of small strings, having both
> encoding and content for each string is wasteful. Thus, I am not
> suggesting that ByteString should be scraped. Just that ordinarily
> programmers should not have to think about string encodings.
> 

In network programming you have to think about encoding - there is (was)
too much sites encoded in IBM codepages (not much problem for
English-speaking users). What worst I read some HTML tutorials which
suggested that adding meta content-type automatically changes it to ISO
encoding ;)

> An alternative to having a String type, which contains both content and
> encoding, would be standardizing on some encoding like UTF-8. I realize
> that we have the utf8-string package on Hackage, but people (at least
> Happstack and Network.HTTP) seem to prefer ByteString. I wonder why.
> 
> 
> Greetings,
> 
> Mads Lindstrøm

Hopefully most of problems are gone as world is moving into utf-8. But
still:

- Other Unicode coding are used (for example with fixed length of
character)
- Other data types are used (like binary)

In many cases you cannot depend on MIME to be always correct. In some
cases you don't need to have character recoding anyway (you store
directly to db, you want to compress it).

Additionally you may want to compute checksum of string. However
recoding UTF-16 -> ... -> UTF16 may change the contents (direction bytes
at the beginning) and therefore checksum.

Regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20100319/c1f1e3f0/attachment.bin


More information about the Haskell-Cafe mailing list