# [Haskell-cafe] Re: Math.Statistics

ok ok at cs.otago.ac.nz
Tue Sep 25 20:25:39 EDT 2007

```There are a number of interesting issues raised by mbeddoe's
Math.Statistics.

0.  Coding issues.

Why use foldr1 (*) instead of product?

covm xs =  split' (length xs) cs
where
cs = [ cov a b | a <- xs, b <- xs]
split' n = unfoldr (\y -> if null y then Nothing
else Just \$ splitAt n y)

seems a rather odd way to write

covm xs = [[cov a b | b <- xs] | a <- xs]

which in turn is a bit wasteful because the matrix must be
symmetric.

I believe the author may have misunderstood "numerically stable".
The obvious (sum xs)/(fromIntegral \$ length \$ xs) is fine for
the mean, and the obvious two-pass algorithm for the variance is
fine.  The tricky algorithms are needed for incremental one-pass
calculation.

1.  How do you decide how general to make things?

mean, hmean: Floating a
But only + and / are needed, so Fractional a is enough.
gmean: Floating a
You need + and /, so fair enough.
median: Floating a, Ord a
High median and low median (not provided) need only Ord a.
Standard median needs + and /, so fair enough.
modes: Ord a
Right.
range: Num a, Ord a
But range needs only Ord a.
avgdev: Floating a
Right.
pvar, var, stddev, skew, kurtosis, cov, covm: Floating a
How should one define the variance of a collection of complex
numbers?  I'm not at all sure that it makes sense.  If I were
dealing with complex numbers, I might treat them as 2-vectors,
in which case the "standard deviation" would be a 2x2 matrix,
not a complex number.  I suspect that these should require
Floating a, Ord a.  For skew this is quite certain: one looks
for "positive" or "negative" skew, and this makes no sense for
complex numbers.
iqr: type not specified but appears to be :: [a] -> [a]
This is certainly the wrong type as the inter-quartile range
is supposed to be a NUMBER, not a sequence.  You have to use
comparisons and subtraction, so
iqr :: Num a, Ord a => [a] -> a

2. Statistics

This is an odd bunch of things; mostly things that are dangerous to
use.  A collection of the techniques from "Applications, Basics,
and
Computing" by Hoaglin and Velleman (and was there someone else?
memory fails me) would be nice.  A few basic robust measures like
- median absolute deviation
- trimmed mean
- Winsorized standard deviation
- percentage bend correlation
might be nice.

3.  How laziness could help.

Suppose you ask for the mean, standard deviation, and skewness of
a bunch of numbers using this library.  Calculating the standard
deviation will recalculate the mean.  Calculating the skewness
will recalculate the standard deviation, which will recalculate the
mean.

Robust statistics, like the trimmed mean, often want to sort the
data.  And you don't want to keep on sorting the data over and over
again.

This is a case where Haskell's laziness really shines: it translates
into direct practical gains for simple code.

data (Floating a, Ord a)
=> Simple_Continuous_Variate a
= SCV [a] Int a a (Array Int a)

list_to_variate xs = SCV xs n m s o
where n = length xs
m = sum xs / fromIntegral n
s = sum [(x - m)^2 | x <- xs] / fromIntegral (n - 1)
o = listArray (1,n) (sort xs)

vLength (SCV _ n _ _ _) = n
vMean   (SCV _ _ m _ _) = m
vSd     (SCV _ _ _ s _) = s
vMin    (SCV _ _ _ _ a) = a ! 1
vMax    (SCV _ n _ _ a) = a ! n
vRange   scv            = vMax scv - vMin scv
vMedian (SCV _ n _ _ a)
| odd n               = a ! ((n+1)`div`2)
| even n              = ((a ! l) + (a ! u))/2
where l = n `div` 2
u = n - l
.....

The wonderful thing about this is that there are no bangs in SCV, so
the various pieces of information don't get evaluated until they are
needed, and then don't get evaluated *again*.  For example, vRange
will cause the sorted array to be built at most once.  Trimmed mean
is very little harder to write than vMedian.

Esoteric uses of lazy evaluation abound, but this one makes sense
even to
an old number-crunching programmer.  I once built something like this in
C, but boy, it was a pain.

```