[Haskell-cafe] Improving my implementation of "The Lady Tasting Tea"

Mon Mar 18 21:21:20 CET 2013

Hello,

For my first project in Haskell, I thought I would re-implement a
statistical problem that I have previously done in R. In his 1925 book,
Fisher tells of an experiment in which a lady claims she can tell the
difference in cups of tea that have the milk or the tea/water added first
(I'm a coffee drinker, so I'm unclear on the actual physical
interpretation). To test this claim, Fisher proposes an experiment in which
8 cups are created, with 4 having milk first and 4 having tea first. If the
lady can properly label 3 of the 4 milk first cups correctly, should we be
suprised? What is the probability that she would get 3 or 4 of the milk
first cups correct just due to chance (that is, independently of the actual
allocation of milk and tea first)?

Fisher shows that these questions can be answered by writing out all the
possible ways the 4 milk cups could be allocated (there are 70 such
possibilities) and then counting the subset that have 3 or 4 correct
according to the lady's guess (there are 17). Therefore the probability of
getting 3 or 4 correct just by guessing would occur about 24% of the time.

I've implemented this algorithm for problems of arbitrary size, and I'd
like some feedback on improving the efficiency of my code:

https://gist.github.com/markmfredrickson/5190212

The problem is basically a map-reduce:

1. Create the C(N,K) possible ways the cups could be set up (where N is the
total number of cups, and K is the number with milk first)
2. Map a function across them that scores the lady's guess
3. Reduce the set of scores to a p-value (a value between 0 and 1) that
indicates what percentage of all possible allocations of milk first would
have a score equal to or greater than the lady's guess.

My first attempt at profiling indicates that most time is spent in the
scoring function, unsurprising as this is in the inner most loop of the
algorithm. The generation of all possible combinations is also time
consuming. From what I can tell, this implementation is not very efficient
from a space perspective.

In the long term, I'm interested in using this same basic procedure to
solve more interesting statistical problems, so I'm particulary interested
in getting feed back on:

1. A good data type for storing the list of treatment allocations (i.e.
milk first cups).

The current problem calls for binary treatment with no extra data. In the
future, my N units may be partitioned into K groups (here K=2), and the
score function will require additional data (e.g. each unit will have an
outcome measure on a continuous or ordinal scale). In a sense, I'd like to
use tuples to hold index values from 1 to N, but I wouldn't necessarily
know the tuple size at compile time. The index values would point to an
array or similar structure holding the additional data.

Some other properties of allocations:
- They are always the same size
- Order is not important
- All values are unique

Any standard types fit this description?

2. A better algorithm and perhaps data type for the combinations.

The current implementation uses a tree recursion to get the set of all
possible combinations. In the future, this set will be too large to
enumerate entirely, so I'll need to sample from it. It would be desirable
to be able to write `take samples (permute $ combinations total treated)`
where the `permute` function (perhaps with a random seed) properly
re-orders the combinations. If these goals are supported by a data
structure better than a list, I'm very open to using other types to hold
the combinations.

I would also appreciate any suggestions on making the code more idiomatic
or readable.

Thanks,
-Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20130318/991d9664/attachment.htm>