# Proposal for Data.List.splitBy

Marcus D. Gabriel marcus at gabriel.name
Sun Jan 4 16:13:00 EST 2009

```We seem to have not added any splitters such as splitBy or split to
Data.List despite the discussions.  Here is yet another perspective
with the hope that it will make it.

For the record, see the threads

see the article

and see the wiki page

Essentially, I believe that the difficulty in choosing a splitter for
Data.List comes from the fact that it does not take much of a problem
to solve before a simple splitter is just not enough whereupon you
should just use a FSM, Alex, or Parsec for example.

A way forward would be to state a few meta-properties and properties
up front and then see if this leads to a reasonable solution for the
simple problems that just need a solution that is "good enough."

I would suggest that

1. a set of splitter words should be small in number,
2. such a set should fit within the spirit of the List module of
3. it should pick up where break and span left off, and
4. the results coming from this set of splitter words should be
unsplittable.

I would argue that the above suggestions imply at least two words

* splitBy :: (a -> Bool) -> [a] -> [([a], [a])] and
* split :: [a] -> [a] -> [([a], [a])]

such that splitBy takes a predicate p, e.g., (=='\n'), and a list xs
and such that split takes a list of equivalent delimiters ds, e.g.,
" \t", and a list xs. The functions splitBy and split should at least

> (concatMap (\(x1, x2) -> x1 ++ x2) \$ splitBy p xs) == xs
> splitBy _ [] == [([],[])]
> splitBy (\x -> False) xs == [(xs,[])]
> splitBy (\x -> True)  xs == [([],[x1]),([],[x2]),([],[x3]), ...]

and

> (concatMap (\(x1, x2) -> x1 ++ x2) \$ split ds xs)  == xs
> split _  [] == [([],[])]
> split [] xs == [(xs,[])]
> split xs xs == [([],[x1]),([],[x2]), ([],[x3]), ...]

where xs == x1:x2:x2:...:[]. Examples would be

> splitBy (=='/') "/a/b"      == [("","/"),("a","/"),("b","")]
> splitBy (=='/') "//aa//bb"  ==
>     [("","/"),("","/"),("aa","/"),("","/"),("bb","")]
> splitBy (=='\n') "\na\nb"   == [("","\n"),("a","\n"),("b","")]
> split xs xs                 == [([],[x1]),([],[x2]), ([],[x3]), ...]
> split " \t" "a\tb \tc\t d " ==
>     [("a","\t"),("b"," "),("","\t"),("c","\t"),(""," "),("d"," ")]
> split "/" "/a/b"            == [("","/"),("a","/"),("b","")]

This leaves the one remaining "simple" case which is splitting using a
list such as "\r\n" which splitBy and split cannot handle easily. This

* splitUsing :: (Eq a) => [a] -> [a] -> [([a], [a])]

such that

> (concatMap (\(x1, x2) -> x1 ++ x2) \$ splitUsing xs xs') == xs'
> splitUsing _  [] == [([] ,     [])]
> splitUsing [] xs == [(xs ,     [])]
> splitUsing xs xs == [([] ,     xs)]

Examples would be

> splitUsing "\r\n" "a"          == [("a",     "")]
> splitUsing "\r\n" "\r\n"       == [("" , "\r\n")]
> splitUsing "\r\n" "a\r\n"      == [("a", "\r\n")]
> splitUsing "\r\n" "a\r\n\r\n"  == [("a", "\r\n"), ("" , "\r\n")]
> splitUsing "\r\n" "a\r\nb"     == [("a", "\r\n"), ("b",     "")]
> splitUsing "\r\n" "a\r\nb\r\n" == [("a", "\r\n"), ("b", "\r\n")]

Wrapping up, based on a sort of logical extension, you could imagine

* splitWhere :: ([a] -> Bool) -> [a] -> [([a], [a])]

but I believe that this fourth word is one too many, cute, but not
necessary. If the above three splitter words are not enough, then I
would say that you need a more powerful tool beyond the spirit of the

Just to check, the Haskell 98 List.lines function would be defined as

> lines98 xs = map fst \$ split "\n" xs

and proper lines' and unlines' functions would be

> lines' :: String -> Either [String] [String]
> lines' [] = Left []
> lines' xs =
>     let res = split "\n" xs
>         nul = (null.snd.last) res
>     in (if nul then Left else Right) \$ map fst res
>
> unlines' :: Either [String] [String] -> String
> unlines' exss = intercalate "\n" \$ either (id) (++[[]]) exss

such that

> unlines' . lines' == id

This is to say, (Right [String]) if the original xs ends in '\n' and
(Left [String]) if it does not.

For your review, the attached file Split.hs defines and documents
splitBy, split, splitUsing, and splitWhere, and the attached file
SplitProperties.hs defines tests for these same words.

Finally, I would propose adding splitBy, split, and splitUsing to
Data.List as three words that fit within the spirit of the Haskell 98
List module, take off where break and span leave off, are
unsplittable, and are "good enough".

Cheers,
- Marcus

--
Marcus D. Gabriel, Ph.D.                         Saint Louis, FRANCE
http://www.marcus.gabriel.name            mailto:marcus at gabriel.name
Tel: +33.3.89.69.05.06                   Portable: +33.6.34.56.07.75

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Split.hs
Size: 6707 bytes
Desc: not available