GHC and UNICODE...

Simon Marlow simonmar at microsoft.com
Mon Dec 22 10:49:27 EST 2003


 
> something I have wanted to do is modify Alex so that ∀ turns into the
> regular expression 0xe2 0x88 0x80 (and so forth) so that ghc (whose
> lexer is generated from alex) can simply accept utf8 input. 

I also really want to get GHC accepting UTF-8 source files, but I don't think this is the best way to go about it.

Sure, you can run Alex over the UTF-8 source, but the grammar will be huge.  A simpler way is to take advantage of the fact that Haskell only uses 5 classes of Unicode characters: uniSmall, uniLarge, uniWhite, uniSymbol, and uniDigit.  Alex has a good input abstraction behind which you can hide the translation from UTF-8 to Char, so you can map these 5 classes of unicode characters onto 5 special Char values, and use Alex unmodified.

Well, perhaps Alex will need a small modification so that its upper bound on Char values is variable (currently it is fixed at 255).

Then you have to think about whether GHC keeps strings internally in UTF-8 or expanded unicode.  Perhaps UTF-8 is initially easier (not much change to the FastString type), but this might have further ramifications.

Hmmm... looks like a good project to put on the GHC Task List!

Cheers,
	Simon


More information about the Glasgow-haskell-users mailing list