The tale of keepAlive#

bgamari - 2021-06-08

In GHC 9.0 we have reworked the primops used for guaranteeing heap object lifetime in the presence of off-heap references. The result is a new primop, keepAlive#, which largely replaces the venerable touch# operation. This post will give some background on touch#, describe the rationale for this change, discuss some of the alternatives that were evaluated on the way to this new design, and provide some guidance on how this change affects users.

Users who merely want to know how to adapt their existing touch# uses for the new keepAlive# should feel free to skip to the last section of the post.

The challenge of foreign references

One of the joys of working in a garbage collected language is that one is freed from the burden of thinking about object lifetime. Instead, it is the responsibility of the runtime to retain each object as long as it is reachable from the roots of the heap graph.

Moreover, one of the great joys of Haskell is its robust foreign function interface, giving users the ability to easily weave together Haskell and foreign code in a safe and efficient manner.

However, taken together these two features pose significant challenges to language implementors; specifically, consider a program such as:

-- Recall that:
--   newPinnedByteArray       :: Int -> IO (MutableByteArray RealWorld)
--   setByteArray             :: Prim a => MutableByteArray RealWorld -> Int -> Int -> a -> IO ()
--   mutableByteArrayContents :: MutableByteArray RealWorld -> Ptr a
import Data.Primitive.ByteArray

main = printDots 10

printDots :: Int -> IO ()
printDots len = do
    arr <- newPinnedByteArray len
    setByteArray arr 0 len (fromIntegral $ ord '.' :: Word8)
    let ptr :: Ptr CChar
        ptr = byteArrayContents arr
    puts ptr

-- Recall that:
--   void puts(const char *);
foreign import ccall "puts"
    puts :: Ptr CChar -> IO ()

Here we have allocated an array on the Haskell heap, initialized it, and passed it to the C puts function, which will print it to stdout. On its face, this program looks completely innocuous: it contains no mentions of unsafe functions, the array’s bounds are clearly respected, and puts is specified to treat its argument as immutable. Indeed, running the program even prints the expected result (at least most of the time).

However, despite its appearance, this program is fatally flawed. Specifically, note that puts is (implicitly) declared to be a safe foreign call; this means that GHC is free to do as it pleases (e.g. continue running another Haskell thread or, perhaps, initiate a GC) while the C call executes in another operating system thread. This is problematic since upon entering puts there are no references to arr remaining on the Haskell heap, leading the collector to conclude that the array can be safely reclaimed. However, arr is not dead: it is still in use via the off-heap reference passed to puts. However, this reference is not visible to the garbage collector.

Premature collection of live heap objects may result in arbitrary heap corruption, which may cause a program crash at best and incorrect runtime result at worst. Also note that while FFI is the most common reason for off-heap references, it is certainly not the only reason. In particular, any use of byteArrayContents is potentially problematic in this way.

To make such off-heap references visible to the runtime, GHC has long provided the touch operation (embodied by the GHC.Exts.touch# primop):

touch :: forall a. a -> IO ()

Semantically (and, mostly, operationally) the action touch x is a no-op; it merely instructs GHC to ensure that the given value is alive at the point when the action is executed. Using this operation we can fix printDots as follows:

printDots len = 
    arr <- newPinnedByteArray len
    setByteArray arr 0 len (fromIntegral $ ord '.' :: Word8)
    puts (byteArrayContents arr)
    touch arr

Here we have added a touch after the puts, ensuring that arr is retained at least until the call has returned.

To a functional programmer, touch may smell a bit odd, merely serving to introduce a side-effect. By contrast, most idiomatic Haskell libraries rather express resource lifetimes using the “with” pattern. That is, rather than write:

hdl <- openFile "hello.txt"
hPutStrLn hdl "hello world!"
hClose hdl

We prefer to write: withFile:

withFile "hello.txt" $ \hdl ->
    hPutStrLn hdl "hello world!"

as the latter makes the lifetime of the resource apparent from the program’s syntactic structure alone. In the same way, users have historically been advised to stay away from using touch and its variants directly. For instance, while Foreign.ForeignPtr provides a touchForeignPtr operation the documentation strong recommends that the user rather use withForeignPtr.

While it is rarely seen in user code, the touch# primop has served the Haskell community well for many years: a handful of occurrences in base, bytestring, and other foundational libraries ensured that garbage colection and foreign calls could peacefully coexist. However, as we will see, the design of the primop is ripe for unsoundness.

Aside: A bit of simplifier background

Over the years GHC’s simplifier (the GHC subsystem responsible for optimising programs in its Core intermediate representation) has consistently grown in its abilities. By GHC 8.2 we started to come to the realization that all was not well with touch#. To see why, let us examine a transform performed by GHC’s simplifier which I will call dead-continuation elimination (DCE).

Consider that you have a program such as:

doThings =
    case error "hello world" :: Int of
      0  ->  {- some large expression -}
      1  ->  {- another large expression -}
      ...

Here we have a program with a divergent expression (namely the error application) scrutinized by a case with a number of large alternatives. Under Haskell’s semantics, it is impossible for program execution to reach any of these alternatives’ right-hand-sides.

By the dead-continuation elimination transform, GHC’s simplifier will cull this unreachable code, turning the program into:

doThings =
    case error "hello world" :: Int of { }

This optimisation can significantly reduce program sizes. Note that exceptions are only one source of divergence: any non-terminating scrutinee can trigger this transformation.

The shortcomings of touch#

With this transformation in mind, consider a trivial modification to the original printDots program which loops the puts call infinitely:

printManyDots :: Int -> IO ()
printManyDots len = do
    arr <- newPinnedByteArray len
    setByteArray arr 0 len (fromIntegral $ ord '.' :: Word8)
    forever $ puts (byteArrayContents arr)
    touch arr

At its face, this modification looks benign enough; indeed, many server applications take a very similar form (replacing puts with some request handling logic). However in #14346 we noticed that the simplifier could in some cases determine that the forever $ ... expression would diverge and consequently that the touch arr continuation was unreachable. By the DCE transform, this continuation could be dropped. Without the touch application, the garbage collector may prematurely collect arr, resulting in unsoundness in an otherwise reasonable-looking program.

In light of this, it becomes clear that the design of touch# is not compatible with DCE (which is otherwise a perfectly sound transform under Haskell’s semantics): to faithfully serve its purpose, touch# must be retained even if it could never be executed.

As there are relatively few uses of touch#, we tried for a while to paper over this issue with the judicious use of NOINLINE to hide the divergence from GHC’s sights. However, between 8.2 and 8.10 we encountered at least six other user reports of unsoundness due to various manifestations of this issue in various base interfaces (#15260, #14403, #15544, #13707, #14329, #18061). For this reason, last year we resolved to address the issue in the 9.0 release, marking the beginning of a nearly year-long journey towards a possible replacement which lead us through nearly a dozen design and implementation variants.

Finding an alternative design

At first glance, the design of a safer touch# seems obvious by analogy to withFile:

with :: a -> (a -> m b) -> m b

Under such a design with x f would evaluate to f x, ensuring that x is kept alive (at least) until evaluation finishes.

While this interface is simple and pleasingly similar to patterns which Haskellers expect, it does not make for a good primop design. In particular, touch# incurs nearly no runtime overhead (it doesn’t even emit code of its own) and generally occurs in performance-critical, low-level code. Consequently, it is important to minimize the performance overhead imposed by our new primitive. However, the type given to with above is extremely restrictive to the optimisations available to the simplifier.

For instance, consider a function like,

readWord8_touch :: ForeignPtr Word8 -> IO Word8
readWord8_touch fptr = do
    n <- peek (unsafeForeignPtrToPtr fptr)
    touch fptr
    return n

After a bit of simplification this will become:

readWord8_touch :: ForeignPtr Word8 -> IO Word8
readWord8_touch fptr = IO (\s0 ->
    case readWord8# (unsafeForeignPtrToPtr fptr) s0 of (# s1, n #) ->
    case touch# fptr s1 of s2 -> 
      (# s2, W8# n #)

This function will benefit from GHC’s worker/wrapper transform, resulting in a pair of Core bindings1:

-- The worker:
worker_readWord8_touch :: ForeignPtr Word8 -> State# RealWorld -> (# State# RealWorld, Word8# #)
worker_readWord8_touch fptr s0 =
    case peek (unsafeForeignPtrToPtr fptr) s0 of (# s0, n #) ->
    case touch# fptr s0 of s1 ->
      (# s1, n #)

-- The wrapper:
readWord8_touch :: ForeignPtr Word8 -> IO Word8
readWord8_touch fptr = IO $ \s ->
    case worker_readWord fptr s of (# s1, n #) ->
      (# s1, W8# n #)
{-# INLINE readWord8_touch #-}

The readWord8_touch wrapper can then be inlined into call sites, allowing elimination of the W8# allocation (via case-of-known-constructor) at call-sites which strictly consume the result.

However, if we rewrite readWord8 using with we will rather get Core of the form:

readWord8_with :: ForeignPtr Word8 -> IO Word8
readWord8_with fptr = with fptr $ \fptr' -> IO $ \s0 ->
    case peek (unsafeForeignPtrToPtr fptr') s0 of (# s1, n #) ->
      (# s1, W8# n #)

Unlike readWord8_touch, this right-hand side does not have the constructed-product property and therefore can not benefit from unboxing by the worker/wrapper transform. Consequently, readWord8_with will allocate a W8# result on every call. Empirically we found that these added allocations significantly regress runtime performance of many, if not most, Haskell programs. GHC in particular regressed in compile-time by nearly 10% in some cases.

Recovering the CPR property?

At first appearance, there is no reason why GHC could not learn to transform programs defined in terms of with to retain the CPR property. In particular, this would require three changes:

  1. since worker/wrapper may produce results of arbitrary RuntimeRep we must generalize the type of with:

    with# :: forall a (r :: RuntimeRep) (b :: TYPE r).
             a -> (a -> b) -> b
  2. use State# tokens to enforce execution ordering:

    with# :: forall a (r :: RuntimeRep) (b :: TYPE r).
             a -> State# s -> (State# s -> b) -> b
  3. teaching the simplifier to “push” strict contexts (e.g. case) into with# applications. For instance, transform

    case (with# x s (\s' -> k)) of alt -> rhs

    into

    with# x s (\s' -> case k of alt -> rhs)

    This transformation, which we call “strict context pushing” (SCP), is similar to a transform already performed on runRW# applications (#15127). In this case of with#, it is sound as it strictly grows the scope of the continuation and therefore lengthens the lifetime of the kept-alive value.

This SCP transformation allows us to eliminate the allocation incurred in the case of readWord8 by way of inlining and simplification. That is, consider a strict call-site of readWord8:

IO $ \s0 ->
  case readWord8 fptr s0 of (# s1, n #) ->
  case n of W8# n# ->
    doSomething n#

First, we inline readWord8, resulting in:

IO $ \s0 ->
  case (
    with# fptr s0 (\s1 ->
      case peek (unsafeForeignPtrToPtr fptr') s1 of (# s2, n# #) ->
        (# s2, W8# n# #)
    )
  ) of (# s3, n #) ->
  case n of W8# n# ->
    doSomething n#

Now, by the SCP transform above, the outer case can be pushed inside of the with#:

IO $ \s0 ->
  with# fptr s0 (\s1 ->
    case (
      case peek (unsafeForeignPtrToPtr fptr') s1 of (# s2, n# #) ->
        (# s2, W8# n# #)
    ) of (# s3, n #) ->
    case n of W8# n# ->
      doSomething n#
  )

Finally, case-of-known constructor can be applied twice, eliminating the W8# allocation:

IO $ \s0 ->
  with# fptr s0 (\s1 ->
    case peek (unsafeForeignPtrToPtr fptr') s1 of (# s2, n# #) ->
      doSomething n#
  )

Hooray!

The trouble with SCP

At first glance, this SCP transform looks like a promising way to eliminate a major source of overhead imposed by with#. Unfortunately, testing quickly revealed that matters are not so simple. For instance, consider a loop which uses readWord8 to sum the elements of an array:

sumArray :: ForeignPtr Word8    -- ^ an array of Word8s
         -> Int                 -- ^ array length
         -> IO Word8
sumArray fptr n = IO (go 0# n)
  where
    go :: Word8# -> Int -> State# RealWorld -> (# State# RealWorld, Word8 #)
    go accum 0 s0 = (# s0, W8# accum #)
    go accum i s0 =
      case readWord8 (fptr `plusForeignPtr` i) s0 of (# s1, n #) ->
      case n of W8# n# ->
        go (accum +# n#) (i-1) s1

As written, this program will run in constant space (modulo short-lived garbage W8# constructors). However, consider what happens if we inline readWord8#, apply the SCP transform, and simplify in the second equation of go:

    go accum i s0 =
      with# (fptr `plusForeignPtr` i) s0 (\s1 ->
        case peek (unsafeForeignPtrToPtr (fptr `plusForeignPtr` i)) s1 of (# s2, n# #) ->
          go (accum +# n#) (i-1) s1

While we have successfully eliminated the garbage W8# allocations, the recursive go call (previously a tail-call) is now beneath a with# application. Since most implementations of with# will have the operational effect of pushing a frame to the evaluation stack, this transforms the otherwise constant-space sumArray into an O(n) program. This is clearly not acceptable.

Unfortunately, this problem is quite difficult to avoid. While one can think of placing various heuristic restrictions on the contexts where SCP is applied, there is no obviously-correct, local criterion which robostly preserves CPR while avoiding stack blow-up.

While we found no way to make SCP work, the modifications we made to with#’s type at the beginning of this section are desireable regardless. That is:

  • with# must be polymorphic in the RuntimeRep of its result lest the user is unable to use it to compute unboxed results.
  • the use of State# tokens is a good hint to the user that they must take care to control execution ordering to safely use the operation

Consequently, we will adopt the following type for with#:

with# :: forall a (r :: RuntimeRep) (b :: TYPE r).
        a -> State# s -> (State# s -> b) -> b

Generating code for with#

Above we considered some of the design decisions surrounding the user-facing interface of with#. However, the interface is only half of the story: the compiler must also know how to generate code for with#.

Thankfully, doing so is (currently) reasonably straightforward as GHC’s DCE optimisation strictly occurs on Core. For this reason, we can safely desugar with# into touch# by inlining after the Core-to-Core optimisation pipeline has run (namely in the CorePrep stage of compilation). Specifically, an application:

-- Recall that
--   with# :: forall a (r :: RuntimeRep) (b :: TYPE r).
--            a -> State# s -> (State# s -> b) -> b

with# x s0 (\s0 -> k)

can be rewritten to

case k s0 of r ->
case touch# x s0 of s1 ->
  r

Admittedly, this rewrite is a bit suspicious as we there is nothing ensuring the ordering between the case k s and the touch# (since they are both applied to the State# token s). However, this is a pragmatic choice: GHC’s STG pipeline currently performs no optimisation which could reorder these expressions and forcing with# to return a (# State# s, b #) would preclude its use in some pure contexts.

On the bright side, the fact that we can reuse touch# means that we can provide with# with minimal overhead and no further changes to the backend.

keepAlive# in GHC 9.0

While we originally set out to supercede (and ultimately remove) touch#, the issues described above (and others described in the Wiki) made it clear that this plan was not feasible: what touch# lacks in robustness and aesthetics, it makes up for in efficiency and ease-of-optimisation. Moreover, there are occassionally cases which do not fit with the lexically-scoped semantics of with#.

Consequently, in GHC 9.0 we opted for a pragmatic solution: continue to allow use of touch# where necessary but prefer to use the safer with-style where possible. GHC 9.0.1 exposes this operation as keepAlive# (the kinds in the type below are slightly different than those present in 9.0, referring to BoxedRep introduced in 9.2 for added precision):

keepAlive# :: forall (lev :: Levity) (a :: TYPE (BoxedRep lev))
                     (rep :: RuntimeRep) (b :: TYPE rep) s.
              a
           -> State# s
           -> (State# s -> b)
           -> b

What implications does this have on user code? For most users, none: nearly all uses of touch# are buried in low-level libraries which have already been adapted by the GHC maintainers.

If you do for some reason use touch#, we advise that you first consider whether it is possible to replace the usage with a higher-level, less fragile abstraction: in our experience, there are few cases where touch# is necessary which would not be equally-well served by, e.g., ForeignPtr or a higher-level scoped allocation primitive.

If you have determined that the low-level primitive is indeed unavoidable then the next question is whether touch# should be replaced with keepAlive#. In the general case this unfortunately comes down to a safety/performance tradeoff. Specifically, let us consider the example of the withForeignPtr operation exposed by base. When written in terms of touch this operation is defined as:

withForeignPtr :: ForeignPtr a -> (Ptr a -> IO b) -> IO b
withForeignPtr fptr k = do
    k (unsafeForeignPtrToPtr fptr)
    touch fptr

While this can produce very efficient code, it is also deeply unsafe as the user may provide a divergent continuation (e.g. withForeignPtr fptr (forever ...)). For the reasons described above, this can result in the dropping of the touch#, resulting in unsoundness.

To avoid this, we should rather express withForeignPtr in terms of keepAlive#:

withForeignPtr :: ForeignPtr a -> (Ptr a -> IO b) -> IO b
withForeignPtr fptr k = IO $ \s0 -> 
    keepAlive# fptr s0 (unIO $ k (unsafeForeignPtrToPtr fptr))

This avoids unsoundness at the expense of needing to allocate the b result of the continuation. This is the implementation provided with base-4.15.

Introducing unsafeWithForeignPtr

While the introduction of keepAlive# affects relatively little user-code, the added cost of withForeignPtr has the potential to be more impactful. In particular, it is not uncommon to find code such as:

withForeignPtr fptr (peek @Word8)

in tight inner-loops. Adding an allocation in such a context would be catestrophic for performance and consequently GHC.ForeignPtr exports the old, touch#-based implementation of withForeignPtr as unsafeWithForeignPtr. This allows performance sensitive code to avoid boxing when it is known that the continuation is not certain to diverge. Since this operation is unsafe in the presence of known-divergent continuations, users are strong encouraged to use this operation sparingly and only in cases where the continuation is obviously not divergent.

Guidance for earlier GHC releases

As noted, use of touch# in previous GHC versions is a bit perilous unless due care is taken. Let’s consider an example:

foo :: ByteArray -> IO ()
foo x = do
    doSomething x                        -- (a)
    ...
    doForeignCall (byteArrayContents x)  -- (b)
    ...
    touch x                              -- (c)

For this program to be safe, the user must guarantee that no subexpression in the body of foo between (a) (the point that x potentially becomes unreachable) and (c) (the touch, which we must ensure DCE does not drop) is provably divergent. In the case that all of the ...s are statically known it is possible to determine this by inspection.

However, things are harder if the function takes a user-provided continuation a la withForeignPtr:

withForeignPtr :: ForeignPtr a -> (Ptr a -> IO b) -> IO b
withForeignPtr fptr k = do
    k (unsafeForeignPtrToPtr fptr)
    touch fptr

In this case the only reliable way to avoid unsoundness is to place a NOINLINE on withForeignPtr. This deprives the simplifier of knowledge of k, ensuring that it cannot conclude that touch is unreachable.

Conclusion

Here we have described a bit of the story that lead to keepAlive#. While the full story is significantly longer (and ultimately lead to the three-month deferral of the 9.0.1 release), the result is a primop which avoids a class of pernicious soundness issues which had lurked in previous releases.

If you have any questions regarding touch# and keepAlive#, do be in touch on the ghc-devs mailing list.


  1. To avoid cluttering the code too much, we are playing a bit fast-and-loose with the IO newtype here. For instance, we treat peek as a function of type State# RealWorld -> (# State# RealWorld, a #) (i.e. the representation of IO a)) instead of IO a. We will continue this pragmatic abuse of notation below.↩︎