<div dir="ltr"><div>I'm learning about the Map Reduce computation frequently used with big data. For the fun of it, I decided to write a very high-level spec of Map Reduce. Here is what I came up with. Enjoy.<br><br>
</div>John<br><div><br>> module MapReduce where<br>> import Data.List (nub)<br><br>A high-level specification of Map Reduce as a Haskell program. The<br>program uses lists to represent multisets. As multisets have no<br>
implied ordering, the ordering implied by lists in this specification<br>should be ignored.<br><br>The database is a multiset of key-value pairs.<br><br>> type Key = String<br>> type Value = String<br>> type Datum = (Key, Value)<br>
> type Data = [Datum]<br><br>A mapper maps a datum to a finite multiset of key-value pairs.<br><br>> type Mapper = Datum -> Data<br><br>A reducer takes a key and a multiset of values and produces a finite<br>multiset of values.<br>
<br>> type Reducer = (Key, [Value]) -> [Value]<br><br>A step is a mapper followed by a reducer<br><br>> type Step = (Mapper, Reducer)<br><br>A program is a finite sequence of steps<br><br>> type Program = [Step]<br>
<br>The semantics of a program is provided by the run function.<br><br>> run :: Program -> Data -> Data<br>> run [] d = d<br>> run (s : p) d =<br>> run p (step s d)<br><br>The three parts of a step are mapping, shuffling, and reducing.<br>
<br>> step :: Step -> Data -> Data<br>> step (m, r) d =<br>> let mapped = transform m d<br>> shuffled = shuffle mapped in<br>> reduce r shuffled<br><br>The first part of a step is to transform the data by applying the<br>
mapper to each datum and collecting the results.<br><br>> transform :: Mapper -> Data -> Data<br>> transform m d =<br>> [p | u <- d, p <- m u]<br><br>Next, values with common keys are collected. Keys are unique after<br>
shuffling.<br><br>> shuffle :: Data -> [(Key, [Value])]<br>> shuffle d =<br>> [(k, vs) | k <- nub (map fst d), -- nub eliminates duplicate keys<br>> let vs = [v | (k', v) <- d, k' == k]]<br>
<br>A reducer is applied to the data associated with one key, and always<br>produces data with that key.<br><br>> reduce :: Reducer -> [(Key, [Value])] -> Data<br>> reduce r rs = <br>> [(k, v) | (k, vs) <- rs, v <- r (k, vs)]<br>
<br></div></div>