[Haskell-cafe] Finding longest common prefixes in a list

Gwern Branwen gwern0 at gmail.com
Fri Jan 20 18:45:22 CET 2012


Recently I wanted to sort through a large folder of varied files and
figure out what is a 'natural' folder to split out, where natural
means something like >4 files with the same prefix. (This might be
author, genre, subject, whatever I felt was important when I was
naming the file.) Now usually I name files with hyphens as the
delimiters like the hypothetical '1998-wadler-monads.pdf', and it
would be easy to write a stdin/stdout filter to break Strings on
hyphens and sort by whatever is most common. But this is rather
hardwired, can I solve the more general problem of finding the longest
common prefixes, whatever they are?

This turns out to be much more difficult than simply finding 'the'
longest common prefix (which is usually ""). I found an algorithm of
sorts at http://stackoverflow.com/a/6634624 but it was easier
described than implemented. Eventually I wrote what I *think* is a
correct program, but it's definitely of the write-only sort. Perhaps
people have better implementations somewhere? I saw a lot of
discussion of tries, but I didn't go that route.

The code, followed by an example:

#!/usr/bin/env runhaskell

import Data.List (intercalate, isPrefixOf, nub, sort)

main :: IO ()
main = interact (unlines . intercalate [""] . chunkFiles . lines )

-- basic algorithm from <http://stackoverflow.com/a/6634624>
chunkFiles :: Ord a => [[a]] -> [[[a]]]
chunkFiles f = map (\(_,b) -> filter (isPrefixOf b) f) $ sort $
                    map (\x -> (countPrefixes x f,x)) (e $ bar f)

sharedPrefixes :: Ord a => [[a]] -> [a]
sharedPrefixes [] = []
sharedPrefixes s = foldr1 sp2 s
  where sp2 l1 l2 = map fst . takeWhile (uncurry (==)) $ zip l1 l2

traverse :: Ord a => [[a]] -> [[a]]
traverse [] = []
traverse x = sharedPrefixes (take 2 x) : traverse (drop 1 x)

bar :: Ord a => [[a]] -> [[a]]
bar = nub . sort . traverse . sort

countPrefixes :: (Ord a) => [a] -> [[a]] -> Int
countPrefixes x xs = length $ filter (x `isPrefixOf`) xs

e :: Eq a => [[a]] -> [[a]]
e y = map fst $ filter snd $ map (\x -> (x, (==) 1 $ length . filter
id $ map (x `isPrefixOf`) y)) y

{- Example input from `ls`:

chorus-kiminoshiranaimonogatari.ogg
chorus-mrmusic.ogg
choucho-lastnightgoodnight.ogg
dylanislame-aikotoba.ogg
electriclove-エレクトリック・ラブ-korskremix.ogg
gumi-bacon8-justhangingaround.ogg
gumi-iapologizetoyou.ogg
gumi-montblanc.ogg
gumi-mozaikrole.ogg
gumi-ハッピーシンセサイザ.ogg
gumi-showasengirl.ogg
gumi-sweetfloatflatsスイートフロートアパート.ogg
gumi-timewarpedafterchoppingmystagbeetle.ogg
gumi-オリジナル曲-付きホシメグリ.ogg
gumi-ミクオリジナル親友.ogg
kaito-byakkoyano.ogg
kaito-flowertail.ogg
kasaneteto-tam-ochamekinou重音テト吹っ切れたおちゃめ機能.ogg
len-crime-timetosaygoodbye.ogg
len-fire◎flower.ogg
len-ponponpon.ogg
lily-prototype.ogg
luka-apolxcore-waitingforyou.ogg
luka-dimトロイ.ogg
luka-dion-myheartwillgoon.ogg
luka-dirgefilozofio-dirgeasleepinjesus.ogg
luka-アゴアニキ-doubelariatダブルラリアット.ogg
luka-emon-heartbeats.ogg
luka-emonloid3-ハローハロー.ogg
luka-everybreathyoutake.ogg
luka-オリジナル-garden.ogg
luka-justbefriends.ogg
lukameiko-gemini.ogg
luka-milkyway.ogg
luka-やみくろ-かいぎ.ogg
luka-tic-tick.ogg
luka-torinouta.ogg
luka-zeijakukei-shounenshoujo.ogg
luka-勝手にアニメ-nologic-作ってみた.ogg
luka-駄目人間.ogg
meiko-artemis-awake.ogg
miku-9ronicleプラチナ.ogg
miku-acolorlinkingworld-この世界の下で.ogg
miku-acolorlinkingworld-青い花.ogg
miku-a+jugos-lullabyforkindness.ogg
miku-akayaka-beacon.ogg
miku-akayakap-sunrise.ogg
miku-aoihana.ogg
miku-arabianresponse.ogg
miku-avtechno-tear.ogg
miku-こえをきかせてcicci.ogg
miku-cleantears-remind2011natsu-greenhillzonecrystiararemix.ogg
miku-cleantears-remind2011natsu-夏影summerwindremix.ogg
miku-clocklockworks.ogg
miku-dancedancevol2-runner.ogg
miku-daniwellp-chaoticuniverse.ogg
miku-dixieflatline-shinonomescrumble.ogg
miku-electricloveエレクトリックラヴ.ogg
miku-elegumitokyo-kissmebaby.ogg
miku-galaxyodyssey-cryingirl.ogg
miku-galaxyodyssey-galaxyspacelines.ogg
miku-hakamairi.ogg
miku-haruna.ogg
miku-heartshooter.ogg
miku-hoshikuzutokakera.ogg
miku-innes.ogg
miku-innocence初音ミク.ogg
miku-jemappelle-motion-likeyou.ogg
miku-jemappelle-motion-ohwell.ogg
miku-jevannip-myfavoritesummer.ogg
miku-kakokyuudance-過呼吸ダンス.ogg
miku-kz-packaged.ogg
miku-kz-tellyourworld.ogg
miku-lastscene.ogg
miku-lostmemories付き-初音ミク.ogg
miku-lovelyday.ogg
miku-いいわけlove_song.ogg
mikulukagumi-prayfor.ogg
miku-maple-初音ミク楓-オリジナル曲.ogg
miku-more1.5.ogg
miku-m at rk-eklosion.ogg
miku-m at rk-kirch.ogg
miku-nana-ボーナストラック-ハッピー般若コア.ogg
miku-nekomimiswitch.ogg
miku-nightrainbow.ogg
miku-noyounome.ogg
miku-むかしむかしのきょうのぼくオリジナル.ogg
miku-pandolistp-neverendinghammertime.ogg
miku-ジラートP-birthdayofeden-deepsleep.ogg
miku-ジラートP-birthdayofeden-水中読書.ogg
miku-plustellia-dear.ogg
miku-plustellia-壁の彩度-crazygirl.ogg
miku-plustellia-壁の彩度-discoradio.ogg
miku-ぽわぽわP-ストロボライト.ogg
miku-rabbitforgets.ogg
miku-re:package-lastnightgoodnight.ogg
miku-re:package-ourmusic.ogg
miku-re:package-sutorobonaitsu.ogg
miku-rollinggirl.ogg
miku-ryo-メルト-melt.ogg
miku-senseiniitteyaro.ogg
miku-sevencolors-レモネード.ogg
miku-shoukinosatadenia.ogg
miku-stratosphere.ogg
miku-supernova.ogg
miku-tam-lastnightgoodnight.ogg
miku-tanatofobia.ogg
miku-thearmyforyourenvy-スーパー・ノヴァ.ogg
miku-theendlesslove.ogg
miku-tinyparadise-snowflake.ogg
miku-tinyparadise-tinyparadise.ogg
miku-unfragment.ogg
miku-worldismine-ルドイズマイン.ogg
miku-yakiimo.ogg
miku-文学少年の憂鬱-オリジナル.ogg
miku-カラフルポップビートオリジナル曲.ogg
miku-杯本選life.ogg
miku-杯本選初音ミクどういうことなのダンス.ogg
miku-般若心経beautyfloor-buddhamix.ogg
miku-般若心経ポップ.ogg
niconicochorus-blackrockshooter.ogg
niconicochorus-justbefriends.ogg
rin-dixieflatline-gemini.ogg
rin-elegumitokyo-二人、恋してgirlsside.ogg
rin-helloworld.ogg
rin-jutenija.ogg
rin-lastnightgoodnight.ogg
rin-ripples-evergreen.ogg
rin-っ´ω`c.ogg
rollinggirl-piano.ogg
seeu-gagain-따라리라ddadada.ogg
utau-雪歌ユフbeyondオリジナル曲.ogg
yuki-discochocolatheque.ogg
yuki-shouwasenhosiga^ru.ogg
yuki-shouwasenhosiga^ru.ogg

Example output from `ls | ~/lcp.hs`:

chorus-kiminoshiranaimonogatari.ogg
chorus-mrmusic.ogg

gumi-montblanc.ogg
gumi-mozaikrole.ogg

gumi-showasengirl.ogg
gumi-sweetfloatflatsスイートフロートアパート.ogg

kaito-byakkoyano.ogg
kaito-flowertail.ogg

luka-emon-heartbeats.ogg
luka-emonloid3-ハローハロー.ogg

luka-tic-tick.ogg
luka-torinouta.ogg

miku-acolorlinkingworld-この世界の下で.ogg
miku-acolorlinkingworld-青い花.ogg

miku-akayaka-beacon.ogg
miku-akayakap-sunrise.ogg

miku-cleantears-remind2011natsu-greenhillzonecrystiararemix.ogg
miku-cleantears-remind2011natsu-夏影summerwindremix.ogg

miku-dancedancevol2-runner.ogg
miku-daniwellp-chaoticuniverse.ogg

miku-electricloveエレクトリックラヴ.ogg
miku-elegumitokyo-kissmebaby.ogg

miku-galaxyodyssey-cryingirl.ogg
miku-galaxyodyssey-galaxyspacelines.ogg

miku-hakamairi.ogg
miku-haruna.ogg

miku-innes.ogg
miku-innocence初音ミク.ogg

miku-jemappelle-motion-likeyou.ogg
miku-jemappelle-motion-ohwell.ogg

miku-kz-packaged.ogg
miku-kz-tellyourworld.ogg

miku-lostmemories付き-初音ミク.ogg
miku-lovelyday.ogg

miku-m at rk-eklosion.ogg
miku-m at rk-kirch.ogg

miku-plustellia-壁の彩度-crazygirl.ogg
miku-plustellia-壁の彩度-discoradio.ogg

miku-senseiniitteyaro.ogg
miku-sevencolors-レモネード.ogg

miku-tam-lastnightgoodnight.ogg
miku-tanatofobia.ogg

miku-thearmyforyourenvy-スーパー・ノヴァ.ogg
miku-theendlesslove.ogg

miku-tinyparadise-snowflake.ogg
miku-tinyparadise-tinyparadise.ogg

miku-ジラートP-birthdayofeden-deepsleep.ogg
miku-ジラートP-birthdayofeden-水中読書.ogg

miku-杯本選life.ogg
miku-杯本選初音ミクどういうことなのダンス.ogg

miku-般若心経beautyfloor-buddhamix.ogg
miku-般若心経ポップ.ogg

niconicochorus-blackrockshooter.ogg
niconicochorus-justbefriends.ogg

len-crime-timetosaygoodbye.ogg
len-fire◎flower.ogg
len-ponponpon.ogg

luka-dimトロイ.ogg
luka-dion-myheartwillgoon.ogg
luka-dirgefilozofio-dirgeasleepinjesus.ogg

miku-re:package-lastnightgoodnight.ogg
miku-re:package-ourmusic.ogg
miku-re:package-sutorobonaitsu.ogg

miku-nana-ボーナストラック-ハッピー般若コア.ogg
miku-nekomimiswitch.ogg
miku-nightrainbow.ogg
miku-noyounome.ogg

rin-dixieflatline-gemini.ogg
rin-elegumitokyo-二人、恋してgirlsside.ogg
rin-helloworld.ogg
rin-jutenija.ogg
rin-lastnightgoodnight.ogg
rin-ripples-evergreen.ogg
rin-っ´ω`c.ogg -}

-- 
gwern
http://www.gwern.net



More information about the Haskell-Cafe mailing list