Proposal for Data.List.splitBy

Sun Jan 4 16:13:00 EST 2009

We seem to have not added any splitters such as splitBy or split to
Data.List despite the discussions.  Here is yet another perspective
with the hope that it will make it.

For the record, see the threads

    * http://www.haskell.org/pipermail/libraries/2004-July/thread.html#2342
    * http://www.haskell.org/pipermail/haskell-cafe/2006-July/thread.html#16559
    * http://www.haskell.org/pipermail/libraries/2008-January/thread.html#8922

see the article

    * http://www.haskell.org/pipermail/libraries/2006-October/006072.html

and see the wiki page

    * http://haskell.org/haskellwiki/List_function_suggestions

Essentially, I believe that the difficulty in choosing a splitter for
Data.List comes from the fact that it does not take much of a problem
to solve before a simple splitter is just not enough whereupon you
should just use a FSM, Alex, or Parsec for example.

A way forward would be to state a few meta-properties and properties
up front and then see if this leads to a reasonable solution for the
simple problems that just need a solution that is "good enough."

I would suggest that

   1. a set of splitter words should be small in number,
   2. such a set should fit within the spirit of the List module of
      the Haskell 98 standard libraries,
   3. it should pick up where break and span left off, and
   4. the results coming from this set of splitter words should be
      unsplittable.

I would argue that the above suggestions imply at least two words

    * splitBy :: (a -> Bool) -> [a] -> [([a], [a])] and
    * split :: [a] -> [a] -> [([a], [a])]

such that splitBy takes a predicate p, e.g., (=='\n'), and a list xs
and such that split takes a list of equivalent delimiters ds, e.g.,
" \t", and a list xs. The functions splitBy and split should at least
have the follow properites.

> (concatMap (\(x1, x2) -> x1 ++ x2) $ splitBy p xs) == xs
> splitBy _ [] == [([],[])]
> splitBy (\x -> False) xs == [(xs,[])]
> splitBy (\x -> True)  xs == [([],[x1]),([],[x2]),([],[x3]), ...]

and

> (concatMap (\(x1, x2) -> x1 ++ x2) $ split ds xs)  == xs
> split _  [] == [([],[])]
> split [] xs == [(xs,[])]
> split xs xs == [([],[x1]),([],[x2]), ([],[x3]), ...]

where xs == x1:x2:x2:...:[]. Examples would be

> splitBy (=='/') "/a/b"      == [("","/"),("a","/"),("b","")]
> splitBy (=='/') "//aa//bb"  ==
>     [("","/"),("","/"),("aa","/"),("","/"),("bb","")]
> splitBy (=='\n') "\na\nb"   == [("","\n"),("a","\n"),("b","")]
> split xs xs                 == [([],[x1]),([],[x2]), ([],[x3]), ...]
> split " \t" "a\tb \tc\t d " ==
>     [("a","\t"),("b"," "),("","\t"),("c","\t"),(""," "),("d"," ")]
> split "/" "/a/b"            == [("","/"),("a","/"),("b","")]

This leaves the one remaining "simple" case which is splitting using a
list such as "\r\n" which splitBy and split cannot handle easily. This
leads to

    * splitUsing :: (Eq a) => [a] -> [a] -> [([a], [a])]

such that

> (concatMap (\(x1, x2) -> x1 ++ x2) $ splitUsing xs xs') == xs'
> splitUsing _  [] == [([] ,     [])]
> splitUsing [] xs == [(xs ,     [])]
> splitUsing xs xs == [([] ,     xs)]

Examples would be

> splitUsing "\r\n" "a"          == [("a",     "")]
> splitUsing "\r\n" "\r\n"       == [("" , "\r\n")]
> splitUsing "\r\n" "a\r\n"      == [("a", "\r\n")]
> splitUsing "\r\n" "a\r\n\r\n"  == [("a", "\r\n"), ("" , "\r\n")]
> splitUsing "\r\n" "a\r\nb"     == [("a", "\r\n"), ("b",     "")]
> splitUsing "\r\n" "a\r\nb\r\n" == [("a", "\r\n"), ("b", "\r\n")]

Wrapping up, based on a sort of logical extension, you could imagine

    * splitWhere :: ([a] -> Bool) -> [a] -> [([a], [a])]

but I believe that this fourth word is one too many, cute, but not
necessary. If the above three splitter words are not enough, then I
would say that you need a more powerful tool beyond the spirit of the
Haskell 98 standard List module.

Just to check, the Haskell 98 List.lines function would be defined as

> lines98 xs = map fst $ split "\n" xs

and proper lines' and unlines' functions would be

> lines' :: String -> Either [String] [String]
> lines' [] = Left []
> lines' xs =
>     let res = split "\n" xs
>         nul = (null.snd.last) res
>     in (if nul then Left else Right) $ map fst res
>
> unlines' :: Either [String] [String] -> String
> unlines' exss = intercalate "\n" $ either (id) (++[[]]) exss

such that

> unlines' . lines' == id

This is to say, (Right [String]) if the original xs ends in '\n' and
(Left [String]) if it does not.

For your review, the attached file Split.hs defines and documents
splitBy, split, splitUsing, and splitWhere, and the attached file
SplitProperties.hs defines tests for these same words.

Finally, I would propose adding splitBy, split, and splitUsing to
Data.List as three words that fit within the spirit of the Haskell 98
List module, take off where break and span leave off, are
unsplittable, and are "good enough".

Cheers,
- Marcus

-- 
  Marcus D. Gabriel, Ph.D.                         Saint Louis, FRANCE
  http://www.marcus.gabriel.name            mailto:marcus at gabriel.name
  Tel: +33.3.89.69.05.06                   Portable: +33.6.34.56.07.75

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Split.hs
Type: text/x-haskell
Size: 6707 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/libraries/attachments/20090104/9822119d/Split-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SplitProperties.hs
Type: text/x-haskell
Size: 8412 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/libraries/attachments/20090104/9822119d/SplitProperties-0001.bin