[Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

Tue Jun 5 04:36:51 EDT 2007

> >   {- Arity: 4 Strictness: LSSL -}
>
> Right. Unboxed args are always given the annotation L. So that function
> is strict in that pointer arg, but GHC is choosing not to unbox it. I'm
> not sure why that's the case.

I thought maybe it was because I hadn't said -funbox-strict-fields,
but it didn't change when I did.

> > Is there some semantic advantage to bang-patterns, or is it just a
> > syntactic convenience?
>
> It's syntactic convenience.

I've noticed differences between strictness guards and bang-patterns
with GHC. This is a problem, I think, because bang-patterns are GHC
only, and I wanted to keep the code portable. This strictness guard:

readUTF8Char :: Int -> Int -> Ptr Word8 -> IO Char
readUTF8Char x offset p
  | () !x !offset !p !False = undefined
  | otherwise =
  ...
  where
    x ! y = seq x y

results in this simplifier output:

[Arity 3
 Str: DmdType LSS]
$wreadUTF8Char_r38I =
  \ (ww_s30B :: GHC.Prim.Int#)
    (w_s30D :: GHC.Base.Int)
    (w1_s30E :: GHC.Ptr.Ptr GHC.Word.Word8) ->

However, if I change this to:

{-# OPTIONS_GHC -fbang-patterns #-}

...

readUTF8Char :: Int -> Int -> Ptr Word8 -> IO Char
readUTF8Char !x !offset !p
  | otherwise =

then I get this simpifier output:

[Arity 3
 Str: DmdType LLL]
$wreadUTF8Char_r38n =
  \ (ww_s2Zk :: GHC.Prim.Int#)
    (ww1_s2Zo :: GHC.Prim.Int#)
    (ww2_s2Zs :: GHC.Prim.Addr#) ->

Also, with bang-patterns I've noticed that fromUTF8Ptr transforms
into two functions, which contain very similar code:

[Arity 5]
Foreign.C.UTF8.$s$wfromUTF8Ptr =
  \ (acc_X151 :: GHC.Base.String)
    (new_s_a2GQ :: GHC.Prim.State# GHC.Prim.RealWorld)
    (a87_a2GR :: GHC.Base.Char)
    (ww_s2ZH :: GHC.Prim.Addr#)
    (sc_s36j :: GHC.Prim.Int#) ->

[Arity 4
 Str: DmdType LLSL]
Foreign.C.UTF8.$wfromUTF8Ptr =
  \ (ww_s2ZD :: GHC.Prim.Int#)
    (ww1_s2ZH :: GHC.Prim.Addr#)
    (w_s2ZJ :: GHC.Base.String)
    (w1_s2ZK :: GHC.Prim.State# GHC.Prim.RealWorld) ->

$wfromUTF8Ptr calls itself and $s$wfromUTF8Ptr, but $s$wfromUTF8Ptr
only every calls itself, so $wfromUTF8Ptr could be considered the
wrapper, and $s$wfromUTF8Ptr the worker, I guess.

AFAICT it's a transformation of the various cases in fromUTF8Ptr:

      | x <= 0x7F -> fromUTF8Ptr (bytes-1) p (chr x:acc)
      | x <= 0xBF && bytes == 0 -> error "fromUTF8Ptr: ..."
      | x <= 0xBF -> fromUTF8Ptr (bytes-1) p acc
      | otherwise -> do
          c <- readUTF8Char x bytes p
          fromUTF8Ptr (bytes-1) p (c:acc)

The first case, x <= 0x7F, results in a call to $s$wfromUTF8Ptr.
The third case, x <= 0xBF, results in a call to $wfromUTF8Ptr.
The last case, otherwise, results in a call to $s$wfromUTF8Ptr.

The calls to $s$wfromUTF8Ptr pass the newly constructed Char and the
rest of the String separately, and they are cons'ed in $s$wfromUTF8Ptr.
Not sure what benefit this gives...

I don't know what transformation causes this, but it was a bit of a surprise.

Alistair