Proposal: provide cas and barriers symbols even without -threaded

Sat Jul 20 08:37:15 CEST 2013

also: HOLY CRAP THATS AWESOME  performance :)

(i'll be wanting to do some cache aware parallel work stealing in the near
future, so this is really really handy for me)

On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton <rrnewton at gmail.com> wrote:

> Hi Carter,
>
> Yes, SMP.h is where I've copy pasted the duplicate functionality from
> (since I can't presently rely on linking the symbols).
>
> Your proposal for the LLVM backend sounds **great**.  But it also is
> going to provide additional constraints for getting "atomic-primops" right.
>
>    The goal of atomic-primops is to be a stable Haskell-level interface
> into the relevant CAS and fetch-and-add stuff.  The reason this is
> important is that one has to be very careful to defeat the GHC optimizer in
> all the relevant places and make pointer equality a reliable property.  I
> would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and
> have more "native" support in future GHC releases, where maybe the foreign
> primops would become unecessary.  (They are a pain and have already exposed
> one blocking cabal bug, fixed in upcoming 1.17.)
>
> A couple additional suggestions for the proposal in ticket #7883:
>
>    - we should use more unique symbols than "cas", especially for this
>    rewriting trick.  How about "ghc_cas" or something?
>    - it would be great to get at least fetch-and-add in addition to CAS
>    and barriers
>    - if we reliably provide this set of special symbols, libraries like
>    atomic-primops may use them in the .cmm and benefit from the CMM->LLVM
>    substitutions
>    - if we include all the primops I need in GHC proper the previous
>    bullet will stop applying ;-)
>
> Cheers,
>   -Ryan
>
> P.S. Just as a bit of motivation, here are some recent performance
> numbers.  We often wonder about how close our "pure values in a box"
> approach comes to efficient lock-free structures.  Well here are some
> numbers about using a proper unboxed counter in the Haskell heap, vs using
> an IORef Int and atomicModifyIORef':  Up to 100X performance difference
> on some platforms for microbenchmarks that hammer a counter:
>
>
> https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops
>
> And here are the performance and scaling advantages of using ChaseLev
> (based on atomic-primops), over a traditional pure-in-a-box structure
> (IORef Data.Seq). The following are timings of ChaseLev/traditional
> respectively on a 32 core westmere:
>
>     fib(42) 1 threads:  21s
>     fib(42) 2 threads:  10.1s
>     fib(42) 4 threads:  5.2s (100%prod)
>     fib(42) 8 threads:  2.7s - 3.2s (100%prod)
>     fib(42) 16 threads: 1.28s
>     fib(42) 24 threads: 1.85s
>     fib(42) 32 threads: 4.8s (high variance)
>
>     (hive) fib(42) 1 threads:  41.8s  (95% prod)
>     (hive) fib(42) 2 threads:  25.2s  (66% prod)
>     (hive) fib(42) 4 threads:  14.6s  (27% prod, 135GB alloc)
>     (hive) fib(42) 8 threads:  17.1s  (26% prod)
>     (hive) fib(42) 16 threads: 16.3s  (13% prod)
>     (hive) fib(42) 24 threads: 21.2s  (30% prod)
>     (hive) fib(42) 32 threads: 29.3s  (33% prod)
>
> And that is WITH the inefficiency of doing a "ccall" on every single
> atomic operation.
>
> Notes on parfib performance are here:
>
>
> https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158
>
>
>
>
>
>
>
> On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald <
> carter.schonwald at gmail.com> wrote:
>
>> ryan, the relevant machinery on the C side is here, see
>> ./includes/stg/SMP.h :
>> https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h
>>
>> (unless i'm missing something)
>>
>>
>> On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald <
>> carter.schonwald at gmail.com> wrote:
>>
>>> Ryan,
>>> if you look at line 270, you'll see the CAS is a C call
>>> https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270
>>>
>>>
>>> What Simon is alluding to is some work I started (but need to finish)
>>> http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and
>>> I'll need to sort out doing the same on the native code gen too
>>>
>>> there ARE no write barrier primops, they're baked into the CAS machinery
>>> in ghc's rts
>>>
>>>
>>> On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton <rrnewton at gmail.com> wrote:
>>>
>>>> Yes, I'd absolutely rather not suffer C call overhead for these
>>>> functions (or the CAS functions).  But isn't that how it's done currently
>>>> for the casMutVar# primop?
>>>>
>>>>
>>>> https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265
>>>>
>>>> To avoid the overhead, is it necessary to make each primop in-line
>>>> rather than out-of-line, or just to get rid of the "ccall"?
>>>>
>>>> Another reason it would be good to package these with GHC is that I'm
>>>> having trouble building robust libraries of foreign primops that work under
>>>> all "ways" (e.g. GHCI).  For example, this bug:
>>>>
>>>>     https://github.com/rrnewton/haskell-lockfree-queue/issues/10
>>>>
>>>> If I write .cmm code that depends on RTS functionality like
>>>> stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode
>>>> (with/without threading, profiling), but I get link errors from GHCI where
>>>> these symbols aren't defined.
>>>>
>>>> I've got a draft of the relevant primops here:
>>>>
>>>>
>>>> https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops/cbits/primops.cmm
>>>>
>>>> Which includes:
>>>>
>>>>    - variants of CAS for MutableArray# and MutableByteArray#
>>>>    - fetch-and-add for MutableByteArray#
>>>>
>>>> Also, there are some tweaks to support the new "ticketed" interface for
>>>> safer CAS:
>>>>
>>>>
>>>> http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data-Atomics.html#g:3
>>>>
>>>> I started adding some of these primops to GHC proper (still as
>>>> out-of-line), but not all of them.  I had gone with the foreign primop
>>>> route instead...
>>>>
>>>>    https://github.com/rrnewton/ghc/commits/master
>>>>
>>>>   -Ryan
>>>>
>>>> P.S. Where is the write barrier primop?  I don't see it listed in
>>>> prelude/primops.txt...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald <
>>>> carter.schonwald at gmail.com> wrote:
>>>>
>>>>> I guess I should find the time to finish the CAS primop work I
>>>>> volunteered to do then. Ill look into in a few days.
>>>>>
>>>>>
>>>>> On Friday, July 19, 2013, Simon Marlow wrote:
>>>>>
>>>>>> On 18/07/13 14:17, Ryan Newton wrote:
>>>>>>
>>>>>>> The "atomic-primops" library depends on symbols such as
>>>>>>> store_load_barrier and "cas", which are defined in SMP.h.  Thus the
>>>>>>> result is that if the program is linked WITHOUT "-threaded", the user
>>>>>>> gets a linker error about undefined symbols.
>>>>>>>
>>>>>>> The specific place it's used is in the 'foreign "C"' bits of this
>>>>>>> .cmm code:
>>>>>>>
>>>>>>> https://github.com/rrnewton/**haskell-lockfree-queue/blob/**
>>>>>>> 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/**
>>>>>>> cbits/primops.cmm<https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c30b98c28c1d04f88781c/AtomicPrimops/cbits/primops.cmm>
>>>>>>>
>>>>>>> I'm trying to explore hacks that will enable me to pull in those
>>>>>>> functions during compile time, without duplicating a whole bunch of
>>>>>>> code
>>>>>>> from the RTS.  But it's a fragile business.
>>>>>>>
>>>>>>> It seems to me that some of these routines have general utility.  In
>>>>>>> future versions of GHC, could we consider linking in those routines
>>>>>>> irrespective of "-threaded"?
>>>>>>>
>>>>>>
>>>>>> We should make the non-THREADED versions EXTERN_INLINE too, so that
>>>>>> there will be (empty) functions to call in rts/Inlines.c.  Want to submit a
>>>>>> patch?
>>>>>>
>>>>>> A better solution would be to make them into primops.  You don't
>>>>>> really want to be calling out to a C function to implement a memory
>>>>>> barrier. We have this for write_barrier(), but none of the others so far.
>>>>>>  Of couse that's a larger change.
>>>>>>
>>>>>> Cheers,
>>>>>>         Simon
>>>>>>
>>>>>>
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> ghc-devs mailing list
>>>>>> ghc-devs at haskell.org
>>>>>> http://www.haskell.org/**mailman/listinfo/ghc-devs<http://www.haskell.org/mailman/listinfo/ghc-devs>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130720/a0e51432/attachment.htm>