<div>Block, line or &#39;beam&#39; decomposition tends to work well for raytracing tasks, because they tend to give you a good cache locality and don&#39;t create a ridiculous explosion of parallel jobs. </div>

<div> </div>

<div>You&#39;ll need to do some tuning to figure out the right granularity for the decomposition. But typically a few hundred tasks works a lot better than tens of thousands or millions. You need to balance the tension between having too much overhead maintaining the decomposition with wasted work from lumpy task completion times and coarse grain sizes.</div>


<div> </div>

<div>Unfortunately, using Haskell it is hard to do what you can do, say, in C++ with Intel Thread Building Blocks to get a self-tuning decomposition of your range, which self-tunes by splitting stolen tasks. You don&#39;t get the same visibility into whether or not the task you are doing was stolen from elsewhere when using GHC&#39;s sparks.</div>


<div> </div>

<div>-Edward Kmett</div>

<div> </div>

<div class="gmail_quote">On Tue, Sep 15, 2009 at 7:48 AM, Andrew Coppin <span dir="ltr">&lt;<a href="mailto:andrewcoppin@btinternet.com">andrewcoppin@btinternet.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">

<div bgcolor="#ffffff" text="#000000">I have a number of compute-bound graphics programs written in Haskell. (Fractal generators, ray tracers, that kind of thing.) GHC offers several concurrency and parallelism abstractions, but what&#39;s the best way to use these to get images rendered as fast as possible, using the available compute power?<br>

<br>(OK, well the *best* way is to use the GPU. But AFAIK that&#39;s still a theoretical research project, so we&#39;ll leave that for now.)<br><br>I&#39;ve identified a couple of common cases. You have a 2D grid of points, and you want to compute the value at each point. Eventually you will have a grid of <i>pixels</i> where each value is a <i>colour</i>, but there may be intermediate steps before that. So, what cases exist?<br>

<br>1. A point&#39;s value is a function of its coordinates.<br><br>2. A point&#39;s value is a function of its previous value from the last frame.<br><br>3. A point&#39;s value is a function of <i>several</i> points from the last frame.<br>

<br>How can we accelerate this? I see a few options:<br><br>- Create a spark for every point in the grid.<br>- Create several explicit threads to populate non-overlapping regions of the grid.<br>- Use parallel arrays. (Does this actually works yet??)<br>

<br>I&#39;m presuming that sparking every individual point is going to create billions of absolutely tiny sparks, which probably won&#39;t give great performance. We could spark every line rather than every point?<br><br>

Using explicit threads has the nice side-effect that we can produce progress information. Few things are more frustrating than staring at a blank screen with no idea how long it&#39;s going to take. I&#39;m thinking this method might also allow you to avoid two cores tripping over each other&#39;s caches.<br>

<br>And then there&#39;s parallel arrays, which presumably are designed from the ground up for exactly this type of task. But are they usable yet?<br><br>Any further options?<br><br></div><br>_______________________________________________<br>

Haskell-Cafe mailing list<br><a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br><a href="http://www.haskell.org/mailman/listinfo/haskell-cafe" target="_blank">http://www.haskell.org/mailman/listinfo/haskell-cafe</a><br>

<br></blockquote></div><br>