I'm working on measuring and improving the performance of the text library at the moment, and the very first test I tried demonstrated a piece of behaviour that I'm not completely able to understand. Actually, I'm not able to understand what's going on at all, beyond a very shallow level. All the comments below pertain to GHC 6.10.4.<div>
<br></div>
<div>The text library uses stream fusion, and I want to measure the performance of UTF-8 decoding.</div><div><br></div><div>The code I'm measuring is very simple:</div><div><br></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px">
<div><div><font face="'courier new', monospace">import qualified Data.ByteString as B</font></div></div><div><div><font face="'courier new', monospace">import Data.Text.Encoding as T</font></div>
</div><div><div><font face="'courier new', monospace">import qualified Data.Text as T</font></div></div><div><div><font face="'courier new', monospace">import System.Environment (getArgs)</font></div>
</div><div><div><font face="'courier new', monospace">import Control.Monad (forM_)</font></div></div><div><div><font face="'courier new', monospace"><br></font></div>
</div><div><div><font face="'courier new', monospace">main = do</font></div></div><div><div><font face="'courier new', monospace"> args <- getArgs</font></div>
</div><div><div><font face="'courier new', monospace"> forM_ args $ \a -> do</font></div></div><div><div><font face="'courier new', monospace"> s <- B.readFile a</font></div>
</div><div><div><font face="'courier new', monospace"> let t = T.decodeUtf8 s</font></div></div><div><div><font face="'courier new', monospace"> print (T.length t)</font></div>
</div></blockquote><div><div><font face="'courier new', monospace"><br></font></div><div><font face="arial, helvetica, sans-serif">The streamUtf8 function looks roughly like this:</font></div>
<div><font face="'courier new', monospace"><br></font></div></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><div><font face="'courier new', monospace"><div>
streamUtf8 :: OnDecodeError -> ByteString -> Stream Char</div></font></div></div><div><div><font face="'courier new', monospace"><div>streamUtf8 onErr bs = Stream next 0 (maxSize l)</div>
</font></div></div><div><div><font face="'courier new', monospace"><div> where</div></font></div></div><div><div><font face="'courier new', monospace"><div>
l = B.length bs</div></font></div></div><div><div><font face="'courier new', monospace"><div> next i</div></font></div></div><div><div><font face="'courier new', monospace"><div>
| i >= l = Done</div></font></div></div><div><div><font face="'courier new', monospace"><div> | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1)</div></font></div>
</div><div><div><font face="'courier new', monospace"><div> | {- etc. -}</div></font></div></div><div><div><font face="'courier new', monospace"><div>
<div>{-# INLINE [0] streamUtf8 #-}</div></div></font></div></div></blockquote><div><div><font face="'courier new', monospace"><div><div><br></div><div><font face="arial, helvetica, sans-serif">The values being </font>Yield<font face="arial, helvetica, sans-serif">ed from the inner function are, as you can see, themselves constructed by functions.</font></div>
<div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">Originally, with the inner </font>next<font face="arial, helvetica, sans-serif"> function manually marked as </font>INLINE<font face="arial, helvetica, sans-serif">, I found that functions like </font>unsafeChr8<font face="arial, helvetica, sans-serif"> were not being inlined by GHC, and performance was terrible due to the amount of boxing and unboxing happening in the inner loop.</font></div>
<div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">I somehow stumbled on the idea of removing the </font><font>INLINE</font><font face="arial, helvetica, sans-serif"> annotation from </font><font>next</font><font face="arial, helvetica, sans-serif">, and performance suddenly improved by a significant integer multiple. This caused the body of </font><font>streamUtf8</font><font face="arial, helvetica, sans-serif"> to be inlined into my test program, as I hoped.</font></div>
<div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">However, I wasn't yet out of the woods. The </font><font>length</font><font face="arial, helvetica, sans-serif"> function is defined as follows:</font></div>
<div><font face="arial, helvetica, sans-serif"><br></font></div></div></font></div></div><blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><div><font face="'courier new', monospace"><div>
<div><font face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace">length :: Text -> Int</font></div></font></div></div></font></div></div><div><div><font><div><div>
<font><div><font class="Apple-style-span" face="'courier new', monospace">length t = Stream.length (Stream.stream t)</font></div></font></div></div></font></div></div><div><div><font face="'courier new', monospace"><div>
<div><font face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace">{-# INLINE length #-}</font></div><div><font class="Apple-style-span" face="'courier new', monospace"><br>
</font></div></font></div></div></font></div></div></blockquote><font class="Apple-style-span" face="arial, helvetica, sans-serif">And the streaming </font><font class="Apple-style-span" face="'courier new', monospace">length</font><font class="Apple-style-span" face="arial, helvetica, sans-serif"> is:</font>
<div><font class="Apple-style-span" face="'courier new', monospace"><br></font></div><blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><font class="Apple-style-span" face="'courier new', monospace"><div>
length :: Stream Char -> Int</div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div>length = S.lengthI</div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div>
{-# INLINE[1] length #-}</div></font></div></blockquote><div><font class="Apple-style-span" face="'courier new', monospace"><div><br></div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif">And the </font>lengthI<font class="Apple-style-span" face="arial, helvetica, sans-serif"> function is defined more generally, in the hope that I could use it for both </font>Int<font class="Apple-style-span" face="arial, helvetica, sans-serif"> and </font>Int64<font class="Apple-style-span" face="arial, helvetica, sans-serif"> lengths:</font></div>
<div><br></div></font></div><blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><font class="Apple-style-span" face="'courier new', monospace"><div><div>lengthI :: Integral a => Stream Char -> a</div>
</div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div><div>lengthI (Stream next s0 _len) = loop_length 0 s0</div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div>
<div> where</div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div><div> loop_length !z s = case next s of</div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div>
<div> Done -> z</div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div><div> Skip s' -> loop_length z s'</div>
</div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div><div> Yield _ s' -> loop_length (z + 1) s'</div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div>
<div>{-# INLINE[0] lengthI #-}</div></div></font></div></blockquote><div><font class="Apple-style-span" face="'courier new', monospace"><div><div><br></div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif">Unfortunately, although </font>lengthI<font class="Apple-style-span" face="arial, helvetica, sans-serif"> is inlined into the </font>Int<font class="Apple-style-span" face="arial, helvetica, sans-serif">-typed streaming </font>length<font class="Apple-style-span" face="arial, helvetica, sans-serif"> function, that function is not in turn marked with __inline_me in simplifier output, so the length/decodeUtf8 loops do not fuse. The code is pretty fast, but there's still a lot of boxing and unboxing happening for all the </font>Yield<font class="Apple-style-span" face="arial, helvetica, sans-serif">s.</font></div>
<div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><br></font></div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif">So. I am quite baffled by this, and I confess to having no idea what to do to get the remaining functions to fuse. But that's not quite confusing enough! Here's a one-byte change to my test code:</font></div>
<div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><br></font></div></div></font></div><blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><font class="Apple-style-span" face="'courier new', monospace"><div>
<div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace">main = do</font></div></font></div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div>
<div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace"> args <- getArgs</font></div></font></div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div>
<div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace"> forM_ args $ \a -> do</font></div></font></div></div></font></div><div>
<font class="Apple-style-span" face="'courier new', monospace"><div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace"> s <- B.readFile a</font></div>
</font></div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace"> let !t = decodeUtf8 s <i><font class="Apple-style-span" color="#CC0000">{- <-- notice the strictness annotation -}</font></i></font></div>
</font></div></div></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><div><font class="Apple-style-span" face="'courier new', monospace"> print (T.length t)</font></div>
</font></div></div></font></div></blockquote><div><font class="Apple-style-span" face="'courier new', monospace"><div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><div><br></div><div>In principle, this should make the code a little slower, because I'm deliberately forcing a <font class="Apple-style-span" face="'courier new', monospace">Text</font> value to be created, instead of allowing stream/unstream fusion to occur. Now the <font class="Apple-style-span" face="'courier new', monospace">length</font> function seems to get inlined properly, but while the <font class="Apple-style-span" face="'courier new', monospace">decodeUtf8</font> function is inlined, the functions in its inner loop that must be inlined for performance purposes are not. The result is very slow code.</div>
<div><br></div><div>I found another site for this one test where removing a single <font class="Apple-style-span" face="'courier new', monospace">INLINE</font> annotation makes the strictified code above 2x faster, but that change causes the stream/unstream fusion rule to fail to fire entirely, so the strictness annotation no longer makes a difference to performance.</div>
<div><br></div><div>All of these flip-flops in inliner behaviour are very difficult to understand, and they seem to be exceedingly fragile. Should I expect the situation to be better with the new inliner in 6.12?</div><div>
<br></div><div>Thanks for bearing with that rather long narrative,</div><div>Bryan.</div></font></div></div></font></div>