I suspect it's not by design, because there's certainly plans to make them inline primops, and the reordering issue of the cmm optimizer hasn't come up in the design discussion previously. (And I should add those notes to the associated tickets<span></span>)<br>

<br>On Tuesday, December 31, 2013, Edward Z. Yang  wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I was thinking about my response, and realized there was one major<br>


misleading thing in my description.  The load reordering I described<br>

applies to load instructions in C-- proper, i.e. things that show up<br>

in the C-- dup as:<br>

<br>

    W_ x = I64[...addr...]<br>

<br>

Reads to IORefs and reads to vectors get compiled inline (as they<br>

eventually translate into inline primops), so my admonitions are<br>

applicable.<br>

<br>

However, the story with *foreign primops* (which is how loadLoadBarrier<br>

in atomic-primops is defined, how you might imagine defining a custom<br>

read function as a primop) is a little different.  First, what does a<br>

call to an foreign primop compile into? It is *not* inlined, so it will<br>

eventually get compiled into a jump (this could be a problem if you're<br>

really trying to squeeze out performance!)  Second, the optimizer is a<br>

bit more conservative when it comes to primop calls (internally referred<br>

to as "unsafe foreign calls"); at the moment, the optimizer assumes<br>

these foreign calls clobber heap memory, so we *automatically* will not<br>

push loads/stores beyond this boundary. (NB: We reserve the right to<br>

change this in the future!)<br>

<br>

This is probably why atomic-primops, as it is written today, seems to<br>

work OK, even in the presence of the optimizer.  But I also have a hard<br>

time believing it gives the speedups you want, due to the current<br>

design. (CC'd Ryan Newton, because I would love to be wrong here, and<br>

maybe he can correct me on this note.)<br>

<br>

Cheers,<br>

Edward<br>

<br>

P.S. loadLoadBarrier compiles to a no-op on x86 architectures, but<br>

because it's not inlined I think you will still end up with a jump (LLVM<br>

might be able to eliminate it).<br>

<br>

Excerpts from John Lato's message of 2013-12-31 03:01:58 +0800:<br>

> Hi Edward,<br>

><br>

> Thanks very much for this reply, it answers a lot of questions I'd had.<br>

>  I'd hoped that ordering would be preserved through C--, but c'est la vie.<br>

>  Optimizing compilers are ever the bane of concurrent algorithms!<br>

><br>

> stg/SMP.h does define a loadLoadBarrier, which is exposed in Ryan Newton's<br>

> atomic-primops package.  From the docs, I think that's a general read<br>

> barrier, and should do what I want.  Assuming it works properly, of course.<br>

>  If I'm lucky it might even be optimized out.<br>

><br>

> Thanks,<br>

> John<br>

><br>

> On Mon, Dec 30, 2013 at 6:04 AM, Edward Z. Yang <<a>ezyang@mit.edu</a>> wrote:<br>

><br>

> > Hello John,<br>

> ><br>

> > Here are some prior discussions (which I will attempt to summarize<br>

> > below):<br>

> ><br>

> >     <a href="http://www.haskell.org/pipermail/haskell-cafe/2011-May/091878.html" target="_blank">http://www.haskell.org/pipermail/haskell-cafe/2011-May/091878.html</a><br>

> >     <a href="http://www.haskell.org/pipermail/haskell-prime/2006-April/001237.html" target="_blank">http://www.haskell.org/pipermail/haskell-prime/2006-April/001237.html</a><br>

> >     <a href="http://www.haskell.org/pipermail/haskell-prime/2006-March/001079.html" target="_blank">http://www.haskell.org/pipermail/haskell-prime/2006-March/001079.html</a><br>

> ><br>

> > The guarantees that Haskell and GHC give in this area are hand-wavy at<br>

> > best; at the moment, I don't think Haskell or GHC have a formal memory<br>

> > model—this seems to be an open research problem. (Unfortunately, AFAICT<br>

> > all the researchers working on relaxed memory models have their hands<br>

> > full with things like C++ :-)<br>

> ><br>

> > If you want to go ahead and build something that /just/ works for a<br>

> > /specific version/ of GHC, you will need to answer this question<br>

> > separately for every phase of the compiler.  For Core and STG, monads<br>

> > will preserve ordering, so there is no trouble.  However, for C--, we<br>

> > will almost certainly apply optimizations which reorder reads (look at<br>

> > CmmSink.hs).  To properly support your algorithm, you will have to add<br>

> > some new read barrier mach-ops, and teach the optimizer to respect them.<br>

> > (This could be fiendishly subtle; it might be better to give C-- a<br>

> > memory model first.)  These mach-ops would then translate into<br>

> > appropriate arch-specific assembly or LLVM instructions, preserving<br>

> > the guarantees further.<br>

> ><br>

> > This is not related to your original question, but the situation is a<br>

> > bit better with regards to reordering stores: we have a WriteBarrier<br>

> > MachOp, which in principle, prevents store reordering.  In practice, we<br>

> > don't seem to actually have any C-- optimizations that reorder stores.<br>

> > So, at least you can assume these will work OK!<br>

> ><br>

> > Hope this helps (and is not too inaccurate),<br>

> > Edward<br>

> ><br>

> > Excerpts from John Lato's message of 2013-12-20 09:36:11 +0800:<br>

> > > Hello,<br>

> > ><br>

> > > I'm working on a lock-free algorithm that's meant to be used in a<br>

> > > concurrent setting, and I've run into a possible issue.<br>

> > ><br>

> > > The crux of the matter is that a particular function needs to perform the<br>

> > > following:<br>

> > ><br>

> > > > x <- MVector.read vec ix<br>

> > > > position <- readIORef posRef<br>

> > ><br>

> > > and the algorithm is only safe if these two reads are not reordered (both<br>

> > > the vector and IORef are written to by other threads).<br>

> > ><br>

> > > My concern is, according to standard Haskell semantics this should be<br>

> > safe,<br>

> > > as IO sequencing should guarantee that the reads happen in-order.  Of<br>

> > > course this also relies upon the architecture's memory model, but x86<br>

> > also<br>

> > > guarantees that reads happen in order.  However doubts remain; I do not<br>

> > > have confidence that the code generator will handle this properly.  In<br>

> > > particular, LLVM may freely re-order loads of NotAtomic and Unordered<br>

> > > values.<br>

> > ><br>

> > > The one hope I have is that ghc will preserve IO semantics through the<br>

> > > entire pipeline.  This see_______________________________________________<br>

Glasgow-haskell-users mailing list<br>

<a href="javascript:;" onclick="_e(event, 'cvml', 'Glasgow-haskell-users@haskell.org')">Glasgow-haskell-users@haskell.org</a><br>

<a href="http://www.haskell.org/mailman/listinfo/glasgow-haskell-users" target="_blank">http://www.haskell.org/mailman/listinfo/glasgow-haskell-users</a><br>

</blockquote>