<div dir="ltr">I&#39;m noticing that linked paper (very nice results!) mentions a prefetch primops that were added to ghc.<div style>Is there any documentation current or pending ?</div><div style><br></div><div style><a href="https://github.com/mainland/vector/commit/cfce37d3a9c228fe4bdf627ffb777399f54af5e5#Data/Vector">https://github.com/mainland/vector/commit/cfce37d3a9c228fe4bdf627ffb777399f54af5e5#Data/Vector</a> seems to have the relevant prim ops mentioned in the paper<br>


</div><div style><br></div><div style>thanks</div><div style>-Carter</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Feb 4, 2013 at 7:36 PM, Geoffrey Mainland <span dir="ltr">&lt;<a href="mailto:mainland@apeiron.net" target="_blank">mainland@apeiron.net</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 02/04/2013 11:56 PM, Johan Tibell wrote:<br>

&gt; On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland &lt;<a href="mailto:mainland@apeiron.net">mainland@apeiron.net</a>&gt; wrote:<br>

&gt;<br>

&gt; What would a sensible fallback be for AVX instructions? What should we<br>

&gt; fall back on when the LLVM backend is not being used?<br>

&gt;<br>

&gt; Depends on the instruction. A 256-bit multiply could be replaced by N<br>

&gt; multiplies etc. For popcount we have a little bit of C code in<br>

&gt; ghc-prim that we use if SSE 4.2 isn&#39;t enabled. An alternative is to<br>

&gt; emit some different assembly in e.g. the x86-64 backend if AVX isn&#39;t<br>

&gt; enabled.<br>

&gt;<br>

&gt; Maybe we could desugar AVX instructions to SSE instructions on platforms<br>

&gt; that support SSE but not AVX, but in practice people would then #ifdef<br>

&gt; anyway and just use SSE if AVX weren&#39;t available.<br>

&gt;<br>

&gt; I don&#39;t follow here. If you conditionally emitted different<br>

&gt; instructions in the backends depending on which -m flags are passed to<br>

&gt; GHC, why would people #ifdef?<br>

<br>

</div>I think you are suggesting that the user should always use 256-bit<br>

short-vector instructions, and that on platforms where AVX is not<br>

available, this would fall back to an implementation that performed<br>

multiple SSE instructions for each 256-bit vector instruction---and used<br>

multiple XMM registers to hold each 256-bit vector value (or spilled).<br>

<br>

Anyone using low-level primops should only do so if they really want<br>

low-level control. The most efficient SSE implementation of a function<br>

is not going to be whatever implementation falls out of a desugaring of<br>

generic 256-bit short-vector primitives. Therefore, I suspect that<br>

anyone using low-level vector primops like this will #ifdef and provide<br>

two implementations---one for SSE, one for AVX. Anyone who doesn&#39;t care<br>

about this level of detail should use a higher-level interface---which<br>

we have already implemented---and which does not require any<br>

ifdefs. People will #ifdef because they can provide better SSE<br>

implementations than GHC when AVX instructions are not available.<br>

<br>

I am suggesting that we push the &quot;ifdefs&quot; into a library. The vast<br>

majority of programmers will never see the ifdefs, because they will use<br>

the library.<br>

<br>

I think you are suggesting that we push the &quot;ifdefs&quot; into GHC. That way<br>

nobody will have a choice---they get whatever desugaring GHC gives them.<br>

<br>

I understand your point of view---having primops that don&#39;t work<br>

everywhere is a real pain and aesthetically unpleasing---but I prefer<br>

exposing more low-level details in our primops even if it means a bit of<br>

unpleasantness once in a while. This does mean a tiny segment of<br>

programmers will have to deal with ifdefs, but I suspect that this tiny<br>

segment of programmers would prefer ifdefs to a lack of control.<br>

<br>

If a population count operation translates to a few extra instructions,<br>

I don&#39;t think anyone will care. If a body of code performing<br>

short-vector operations desugars to twice as many instructions that<br>

require twice as many registers, thereby resulting in a bunch of extra<br>

spills, it will matter. Put differently, there is a more-or-less<br>

canonical desugaring of population count. For a given function using<br>

short-vector instructions of one width, there is not a canonical<br>

desugaring into a function using short-vector instructions of a lesser<br>

width.<br>

<div class="im"><br>

&gt; The current idea is to hide the #ifdefs in a library. Clients of the<br>

&gt; library would then get the &quot;best&quot; short-vector implementation available<br>

&gt; for their platform by using this library. Right now this library is a<br>

&gt; modified version of primitive, and I have modified versions of vector<br>

&gt; and DPH that use this version of the primitive library to generate SSE<br>

&gt; code.<br>

&gt;<br>

&gt; You would still end up with an GHC.Exts that exports a different API<br>

&gt; depending on which flags (e.g. -m&lt;something&gt;) are passed to<br>

&gt; GHC. Couldn&#39;t you use ghc-prim for your fallbacks and have<br>

&gt; GHC.Exts.yourPrimOp use either those fallbacks or the AVX<br>

&gt; instructions.<br>

<br>

</div>This is basically what I&#39;ve implemented, expect there is a Multi type<br>

family that &quot;picks&quot; the appropriate short-vector representation for a<br>

type, e.g., DoubleX2# for Double on machines with SSE, DoubleX4# for<br>

Double on machines with AVX, and accompanying set of short-vector<br>

operations.<br>

<br>

We have a concrete design and implementation---take a look at the<br>

primitive, vector, and dph packages on my github page<br>

(<a href="http://github.com/mainland" target="_blank">http://github.com/mainland</a>). I would be very happy to discuss any<br>

concrete alternative design. We also have a paper with some performance<br>

measurements<br>

(<a href="http://www.eecs.harvard.edu/~mainland/publications/mainland12simd.pdf" target="_blank">http://www.eecs.harvard.edu/~mainland/publications/mainland12simd.pdf</a>). I<br>

would not be thrilled with a design that resulting in significantly<br>

worse benchmarks.<br>

<br>

Geoff<br>

<br>

</blockquote></div><br></div>