[Haskell-cafe] Kernel Loops in Accelerate

Thu Dec 13 13:47:14 CET 2012

Hi Clark,

The question of sequential loops in Accelerate has come up a few times in the
past. The main sticking point is knowing how to implement them in a way that
encourages efficient programs; avoiding irregular arrays (iteration depths),
discouraging scalar versions of collective combinators, etc. Basically we need
to avoid thread divergence where possible, else the GPU SIMD hardware won't be
well utilised.

I've experimented a little with this. If you check the generated code of the
mandelbrot program [1], you'll see it has recovered the _fixed_ iteration depth
into a scalar loop.

I'll hack up something like a while loop for value iteration and see what
happens. Tentative proposal (perhaps just the second):

  iterate :: Exp Int            -- fixed iteration count (maybe just a regular Int)
          -> (Exp a -> Exp a)   -- function to repeatedly apply
          -> Exp a              -- initial (seed) value
          -> Exp a

  while   :: (Exp a -> Exp Bool)
          -> (Exp a -> Exp a)
          -> Exp a
          -> Exp a

It would be good if we could collect some additional concrete applications and
see if there are better ways to map certain classes of sequential iteration to
efficient GPU code.

Additional thoughts/comments welcome!

Cheers,
-Trev

[1]: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/mandelbrot

On 12/12/2012, at 4:34 PM, Clark Gaebel <cgaebel at uwaterloo.ca> wrote:

> Hi Trevor (and cafe),
> 
> I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.
> 
> For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.
> 
> As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :
> 
> typedef uint64_t rand_state;
> 
> __device__ rand_state rand_next(rand_state s);
> __device__ uint32_t rand_get(rand_state s);
> __device__ rand_state rand_skip(rand_state s, uint64_t distance);
> __device__ uint32_t round_to_next_pow2(uint32_t n);
> 
> // Fills an array with random numbers given a random seed,
> // a maximum random number to generate, and an output
> // array to put the result in. The output will be in the range
> // [0, rand_max).
> __kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {
>     rand_state current_state = start_state;
>     int i = blockDim.x*blockIdx.x + threadIdx.x;
>     // assumes we skip less than 1 million times per element...
>     current_state = rand_skip(current_state, i*1e6);
>     uint32_t mask = round_to_next_pow2(rand_max) - 1;
>     uint32_t result;
>     do {
>         result = rand_get(current_state);
>         current_state = rand_next(current_state);
>     } while(result & mask >= rand_max);
> 
>     out[i] = result;
> } // note: code was neither debugged, run, nor compiled.
> 
> Thanks,
>   - Clark
> 
> [0] https://github.com/AccelerateHS/accelerate/issues/52
> [1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20121213/279ec1de/attachment.htm>