Personal tools

Hitchhikers guide to Haskell

From HaskellWiki

(Difference between revisions)
Jump to: navigation, search
m (Preface: DONT PANIC!: sp.)
m (Corrections: apostrophes, articles, spelling mistakes.)
Line 16: Line 16:
 
This text intends to introduce the reader to the practical aspects of Haskell
 
This text intends to introduce the reader to the practical aspects of Haskell
 
from the very beginning (plans for the first chapters include: I/O, darcs,
 
from the very beginning (plans for the first chapters include: I/O, darcs,
Parsec, QuickCheck, profiling and debugging, to mention the few). The reader
+
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader
 
is expected to know (where to find) at least the basics of haskell: how to run
 
is expected to know (where to find) at least the basics of haskell: how to run
 
"hugs" or "ghci", that layout is 2-dimensional, etc. Other than that, we do
 
"hugs" or "ghci", that layout is 2-dimensional, etc. Other than that, we do
 
not plan to take radical leaps, and will go one step at a time in order not to
 
not plan to take radical leaps, and will go one step at a time in order not to
lose the reader along the way. So DONT PANIC, take your towel with you and
+
lose the reader along the way. So DON'T PANIC, take your towel with you and
 
read along.
 
read along.
   
Line 36: Line 36:
 
information on CDs and DVDs. While CD (or DVD) burning itself is easy
 
information on CDs and DVDs. While CD (or DVD) burning itself is easy
 
these days, it usually takes some (or quite a lot ot) time to decide
 
these days, it usually takes some (or quite a lot ot) time to decide
how to put a several Gb's of digital photos on CD-Rs, when directories
+
how to put a several GB of digital photos on CD-Rs, when directories
with images range from 10 to 300 Mb's in size, and you dont want to
+
with images range from 10 to 300 Mb's in size, and you don't want to
 
burn half-full (or half-empty) CD-Rs.
 
burn half-full (or half-empty) CD-Rs.
   
Line 59: Line 59:
 
OK, we've done it. Move along now, nothing interesting here :)
 
OK, we've done it. Move along now, nothing interesting here :)
   
Any serious development must be done with the help of version control
+
Any serious development must be done with the help of a version control
system, and we will not make an exception. We will use modern
+
system, and we will not make an exception. We will use the modern
 
distributed version control system "darcs". "Modern" means that it is
 
distributed version control system "darcs". "Modern" means that it is
 
written in Haskell, "distributed" means that each working copy is
 
written in Haskell, "distributed" means that each working copy is
repository in itself.
+
a repository in itself.
   
First, lets create an empty directory for all our code, and invoke
+
First, let's create an empty directory for all our code, and invoke
 
"darcs init" there, which will create subdirectory "_darcs" to store
 
"darcs init" there, which will create subdirectory "_darcs" to store
 
all version-control-related stuff there.
 
all version-control-related stuff there.
   
 
Fire up your favorite editor and create new file "cd-fit.hs" in our
 
Fire up your favorite editor and create new file "cd-fit.hs" in our
working directory. Now lets think for a moment about how our program
+
working directory. Now let's think for a moment about how our program
 
will operate and express it in pseudocode:
 
will operate and express it in pseudocode:
   
Line 79: Line 79:
 
Sounds reasonable? I thought so.
 
Sounds reasonable? I thought so.
   
Lets simplify our life a little and assume for now that we will
+
Let's simplify our life a little and assume for now that we will
 
compute directory sizes somewhere outside our program (for example,
 
compute directory sizes somewhere outside our program (for example,
 
with "du -sb *") and read this information from stdin.
 
with "du -sb *") and read this information from stdin.
Line 104: Line 104:
 
How did I know that? Did I memorize all the functions by heart? Of course not!
 
How did I know that? Did I memorize all the functions by heart? Of course not!
 
Each function has a type, which, along with function's name, usually tells a
 
Each function has a type, which, along with function's name, usually tells a
lot about what function will do.
+
lot about what the function will do.
   
Let's fire up interactive Haskell environment and examine this function
+
Let's fire up an interactive Haskell environment and examine this function
 
up close:
 
up close:
   
Line 130: Line 130:
 
return its result (if any).
 
return its result (if any).
   
We can choose not to evaluate action obtained from "getContents", but to carry
+
We can choose not to evaluate the action obtained from "getContents", but to carry it around a bit and evaluate later:
it around a bit and evaluate later:
 
   
 
let x = getContents
 
let x = getContents
Line 143: Line 143:
 
Haskell allows you to do better than that.
 
Haskell allows you to do better than that.
   
Standard language library (named "Prelude", by the way) provides us with lots
+
The standard language library (named "Prelude", by the way) provides us with lots of functions that return useful primitive IO actions. In order to combine them to produce more complex actions, we use a "do":
of functions that return useful primitive IO actions. In order to combine them
 
to produce more complex actions, we use a "do":
 
   
 
c = do a <- someAction
 
c = do a <- someAction
Line 158: Line 158:
 
* then, print the word "done"
 
* then, print the word "done"
   
When all this will actually be executed? Answer: as soon as we evaluate "c"
+
When will all this actually be executed? Answer: as soon as we evaluate "c"
 
using the "<-" (if it returns result, as "getContents" does) or just
 
using the "<-" (if it returns result, as "getContents" does) or just
 
by using it as a function name (if it does not return a result, as "print"
 
by using it as a function name (if it does not return a result, as "print"
Line 169: Line 169:
 
Notice that we took a bunch of functions ("someAction", "someOtherAction",
 
Notice that we took a bunch of functions ("someAction", "someOtherAction",
 
"print", "putStrLn") and using "do" created from them a new function, which we
 
"print", "putStrLn") and using "do" created from them a new function, which we
bound to symbol "c". Now we could use "c" as a building block to produce even
+
bound to symbol "c". Now we could use "c" as a building block to produce an even
 
more complex function, "process", and we could carry this on and on.
 
more complex function, "process", and we could carry this on and on.
 
Eventually, some of the functions will be mentioned in the code of function
 
Eventually, some of the functions will be mentioned in the code of function
 
"main", to which the ultimate topmost IO action any Haskell program is bound.
 
"main", to which the ultimate topmost IO action any Haskell program is bound.
   
When the "main" will be executed/evaluated/forced? As soon as we run the
+
When will the "main" be executed/evaluated/forced? As soon as we run the
 
program. Read this twice and try to comprehend:
 
program. Read this twice and try to comprehend:
   
''Execution of the Haskell program is an evaluation of the symbol "main" to
+
''The execution of a Haskell program is an evaluation of the symbol "main" to
 
which we have bound an IO action. Via evaluation we obtain the result of that
 
which we have bound an IO action. Via evaluation we obtain the result of that
 
action''.
 
action''.
   
Readers familiar with advanced C++ or Java programming and arcane body of
+
Readers familiar with advanced C++ or Java programming and that arcane body of
 
knowledge named "OOP Design Patterns" might note that "build actions from
 
knowledge named "OOP Design Patterns" might note that "build actions from
 
actions" and "evaluate actions to get result" is essentially a "Command
 
actions" and "evaluate actions to get result" is essentially a "Command
pattern" and "Composition pattern" combined. Good new: in Haskell you get them
+
pattern" and "Composition pattern" combined. Good news: in Haskell you get them
 
for all your IO, and get them '''for free''' :)
 
for all your IO, and get them '''for free''' :)
   
Line 231: Line 231:
 
Haskell program in order to examine "main"? :)
 
Haskell program in order to examine "main"? :)
   
Lets celebrate that by putting it under version control: execute
+
let's celebrate that by putting it under version control: execute
 
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions
 
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions
 
and provide a commit comment "Skeleton of cd-fit.hs"
 
and provide a commit comment "Skeleton of cd-fit.hs"
Line 243: Line 243:
 
'''Exercises''':
 
'''Exercises''':
   
* Try to write program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);
+
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);
   
* Try to write program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).
+
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).
   
 
== Chapter 2: Parsing the input ==
 
== Chapter 2: Parsing the input ==
   
 
OK, now that we have proper understanding of the powers of Haskell IO
 
OK, now that we have proper understanding of the powers of Haskell IO
(and are awed by them, I hope), lets forget about IO and actually do
+
(and are awed by them, I hope), let's forget about IO and actually do
some usefull work.
+
some useful work.
   
 
As you remember, we set forth to pack some CD-Rs as tightly as
 
As you remember, we set forth to pack some CD-Rs as tightly as
Line 275: Line 275:
   
 
Unlike other tools in this area (lex/yacc or JavaCC to name a few),
 
Unlike other tools in this area (lex/yacc or JavaCC to name a few),
[[Parsec]] parsers do not require separate preprocessing stage. Since in
+
[[Parsec]] parsers do not require a separate preprocessing stage. Since in
 
Haskell we can return function as a result of function and thus
 
Haskell we can return function as a result of function and thus
construct functions "from the thin air", there is no need for separate
+
construct functions "from the thin air", there is no need for a separate
syntax for parser description. But enough advertisements, lets actually
+
syntax for parser description. But enough advertisements, let's actually
 
do some parsing:
 
do some parsing:
   
Line 337: Line 337:
 
paint, which nails to choose, etc) from you.
 
paint, which nails to choose, etc) from you.
   
Lets use the interactive Haskell environment to decipher all the
+
let's use the interactive Haskell environment to decipher all the
 
instructions we've written for the parsing library. As usually, we'll
 
instructions we've written for the parsing library. As usually, we'll
 
go top-down:
 
go top-down:
Line 377: Line 377:
 
the curtains" to make this ''datatype'' conform to the interface of
 
the curtains" to make this ''datatype'' conform to the interface of
 
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for
 
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for
now lets just say that this will allow us to "print" instances of
+
now let's just say that this will allow us to "print" instances of
 
"Dir".
 
"Dir".
   
Line 432: Line 432:
 
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.
 
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.
   
Lets extend our "main" function to use "parse" and actually parse the input
+
let's extend our "main" function to use "parse" and actually parse the input
 
and show us the parsed data structures:
 
and show us the parsed data structures:
   
Line 462: Line 462:
 
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]
 
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]
   
Seems to be doing exactly as planned. Now lets try some erroneous
+
Seems to be doing exactly as planned. Now let's try some erroneous
 
input:
 
input:
   
Line 513: Line 513:
 
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==
 
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==
   
Enough preliminaries already. Lets go pack some CDs.
+
Enough preliminaries already. let's go pack some CDs.
   
 
As you might already have recognized, our problem is a classical one. It is
 
As you might already have recognized, our problem is a classical one. It is
Line 519: Line 519:
 
is. There are more than 100000 links).
 
is. There are more than 100000 links).
   
Lets start from the greedy solution, but first let's slightly modify our "Dir"
+
let's start from the greedy solution, but first let's slightly modify our "Dir"
 
datatype to allow easy extraction of its components:
 
datatype to allow easy extraction of its components:
   
Line 533: Line 533:
 
The Greedy algorithm sorts directories from the biggest down, and tries to puts
 
The Greedy algorithm sorts directories from the biggest down, and tries to puts
 
them on CD one by one, until there is no room for more. We will need to track
 
them on CD one by one, until there is no room for more. We will need to track
which directories we added to CD, so lets add another datatype, and code this
+
which directories we added to CD, so let's add another datatype, and code this
 
simple packing algorithm:
 
simple packing algorithm:
   
Line 543: Line 543:
 
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show
 
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show
 
 
-- For simplicity, lets assume that we deal with standard 700 Mb CDs for now
+
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now
 
media_size = 700*1024*1024
 
media_size = 700*1024*1024
 
 
Line 570: Line 570:
   
 
In order to actually use our greedy packer we must call it from our "main"
 
In order to actually use our greedy packer we must call it from our "main"
function, so lets add a lines:
+
function, so let's add a lines:
   
 
main = do ...
 
main = do ...
Line 587: Line 587:
 
about establishing with reasonable degree of certainty that our code, parts
 
about establishing with reasonable degree of certainty that our code, parts
 
and the whole, works properly, and doing so in re-usable manner? In other
 
and the whole, works properly, and doing so in re-usable manner? In other
words, how about writing some test?
+
words, how about writing some test?
   
 
Java programmers used to JUnit probably thought about screens of boiler-plate
 
Java programmers used to JUnit probably thought about screens of boiler-plate
Line 597: Line 597:
 
[[QuickCheck]] is a tool to do automated testing of your functions using
 
[[QuickCheck]] is a tool to do automated testing of your functions using
 
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of
 
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of
praise" lets show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:
+
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:
   
 
import Test.QuickCheck
 
import Test.QuickCheck
Line 623: Line 623:
 
in pack_size pack == pack_size (greedy_pack (dirs pack))
 
in pack_size pack == pack_size (greedy_pack (dirs pack))
   
Lets run the test, after which I'll explain how it all works:
+
let's run the test, after which I'll explain how it all works:
   
 
Prelude> :r
 
Prelude> :r
Line 636: Line 636:
 
completely) random lists of "Dir"s, and it seems that property indeed holds.
 
completely) random lists of "Dir"s, and it seems that property indeed holds.
   
Lets dissect the code. The most intriguing part is "instance Arbitrary Dir
+
let's dissect the code. The most intriguing part is "instance Arbitrary Dir
 
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a
 
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a
 
bit.
 
bit.
Line 674: Line 674:
 
on any type which is an instance of "Arbitrary"!
 
on any type which is an instance of "Arbitrary"!
   
Lets say it again. Someone (maybe even you) writes the code (API or library),
+
let's say it again. Someone (maybe even you) writes the code (API or library),
 
which requires that input values implement certain ''interfaces'', which is
 
which requires that input values implement certain ''interfaces'', which is
 
described in terms of functions. Once you show how your type implements this
 
described in terms of functions. Once you show how your type implements this
Line 722: Line 722:
 
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.
 
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.
   
Now, lets take a look back to the definition of "Dir":
+
Now, let's take a look back to the definition of "Dir":
   
 
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show
 
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show

Revision as of 01:36, 3 April 2006

Contents

1 Preface: DON'T PANIC!

Recent experiences from a few of my fellow C++/Java programmers indicate that they read various Haskell tutorials with "exponential speedup" (think about how TCP/IP session starts up). They start slow and cautious, but when they see that the first 3-5 pages do not contain "anything interesting" in terms of code and examples, they begin skipping paragraphs, then chapters, then whole pages, only to slow down - often to a complete halt - somewhere on page 50, finding themselves in the thick of concepts like "type classes", "type constructors", "monadic IO", at which point they usually panic, think of a perfectly rational excuse not to read further anymore, and happily forget this sad and scary encounter with Haskell (as human beings usually tend to forget sad and scary things).

This text intends to introduce the reader to the practical aspects of Haskell from the very beginning (plans for the first chapters include: I/O, darcs, Parsec, QuickCheck, profiling and debugging, to mention a few). The reader is expected to know (where to find) at least the basics of haskell: how to run "hugs" or "ghci", that layout is 2-dimensional, etc. Other than that, we do not plan to take radical leaps, and will go one step at a time in order not to lose the reader along the way. So DON'T PANIC, take your towel with you and read along.

Oh, almost forgot: author is very interested in ANY feedback. Drop him a line or a word (see Adept for contact info)

2 Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell

Each chapter will be dedicated to one small real-life task which we will complete from the ground up.

So here is the task for this chapter: in order to free up space on your hard drive for all the haskell code you are going to write in the nearest future, you are going to archive some of the old and dusty information on CDs and DVDs. While CD (or DVD) burning itself is easy these days, it usually takes some (or quite a lot ot) time to decide how to put a several GB of digital photos on CD-Rs, when directories with images range from 10 to 300 Mb's in size, and you don't want to burn half-full (or half-empty) CD-Rs.

So, the task is to write a program which will help us to put a given collection of directories on the minimum possible amount of media, while packing the media as tight as possible. Let's name this program "cd-fit".

Oh. Wait. Let's do the usual "hello world" thing, before we forget about it, and then move on to more interesting things:

 -- put this in hello.hs
 module Main where
 main = putStrLn "Hello world!"

Run it:

 $ runhaskell ./hello.hs
 Hello world!

OK, we've done it. Move along now, nothing interesting here :)

Any serious development must be done with the help of a version control system, and we will not make an exception. We will use the modern distributed version control system "darcs". "Modern" means that it is written in Haskell, "distributed" means that each working copy is a repository in itself.

First, let's create an empty directory for all our code, and invoke "darcs init" there, which will create subdirectory "_darcs" to store all version-control-related stuff there.

Fire up your favorite editor and create new file "cd-fit.hs" in our working directory. Now let's think for a moment about how our program will operate and express it in pseudocode:

 main = read list of directories and their sizes
        decide how to fit them on CD-Rs
        print solution

Sounds reasonable? I thought so.

Let's simplify our life a little and assume for now that we will compute directory sizes somewhere outside our program (for example, with "du -sb *") and read this information from stdin. Now let me convert all this to Haskell:

 module Main where
 main = do input <- getContents
           putStrLn ("DEBUG: got input " ++ input)
           -- compute solution and print it

Not really working, but pretty close to plain English, eh? Let's stop for a moment and look closer at whats written here line-by-line

Let's begin from the top:

 input <- getContents

This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind to the symbol "input", so we can process this string any way we want.

How did I know that? Did I memorize all the functions by heart? Of course not! Each function has a type, which, along with function's name, usually tells a lot about what the function will do.

Let's fire up an interactive Haskell environment and examine this function up close:

 $ ghci
    ___         ___ _
   / _ \ /\  /\/ __(_)
  / /_\// /_/ / /  | |      GHC Interactive, version 6.4.1, for Haskell 98.
 / /_\\/ __  / /___| |      http://www.haskell.org/ghc/
 \____/\/ /_/\____/|_|      Type :? for help.
 
 Loading package base-1.0 ... linking ... done.
 Prelude> :type getContents
 getContents :: IO String
 Prelude> 
 

We see that "getContents" is a function without arguments, that will return "IO String". Prefix "IO" meant that this is an IO action. It will return String, when evaluated. Action will be evaluated as soon as we use "<-" to bind its result to some symbol.

Note that "<-" is not a fancy way to do assignment to variable. It is a way to evaluate (execute) IO actions, in other words - to actually do some I/O and return its result (if any).

We can choose not to evaluate the action obtained from "getContents", but to carry it around a bit and evaluate later:

 let x = getContents
 -- 300 lines of code here
 input <- x

So, as you see, IO actions can act like an ordinary values. Suppose that we have built a list of IO actions and found a way to execute them one by one. This would be a way to simulate imperative programming with its notion of "order of execution".

Haskell allows you to do better than that.

The standard language library (named "Prelude", by the way) provides us with lots of functions that return useful primitive IO actions. In order to combine them to produce more complex actions, we use a "do":

 c = do a <- someAction
        b <- someOtherAction
        print (bar b)
        print (foo a)
        putStrLn "done"

Here we bind "c" to an action with the following "scenario":

  • evaluate action "someAction" and bind its result to "a"
  • then, evaluate "someOtherAction" and bind its result to "b"
  • then, process "b" with function "bar" and print result
  • then, process "a" with function "foo" and print result
  • then, print the word "done"

When will all this actually be executed? Answer: as soon as we evaluate "c" using the "<-" (if it returns result, as "getContents" does) or just by using it as a function name (if it does not return a result, as "print" does):

 process = do putStrLn "Will do some processing"
              c
              putStrLn "Done"
 

Notice that we took a bunch of functions ("someAction", "someOtherAction", "print", "putStrLn") and using "do" created from them a new function, which we bound to symbol "c". Now we could use "c" as a building block to produce an even more complex function, "process", and we could carry this on and on. Eventually, some of the functions will be mentioned in the code of function "main", to which the ultimate topmost IO action any Haskell program is bound.

When will the "main" be executed/evaluated/forced? As soon as we run the program. Read this twice and try to comprehend:

The execution of a Haskell program is an evaluation of the symbol "main" to which we have bound an IO action. Via evaluation we obtain the result of that action.

Readers familiar with advanced C++ or Java programming and that arcane body of knowledge named "OOP Design Patterns" might note that "build actions from actions" and "evaluate actions to get result" is essentially a "Command pattern" and "Composition pattern" combined. Good news: in Haskell you get them for all your IO, and get them for free :)


Exercise: Consider the following code:

 module Main where
 c = putStrLn "C!"
 
 combine before after =
   do before
      putStrLn "In the middle"
      after
 
 main = do combine c c
           let b = combine (putStrLn "Hello!") (putStrLn "Bye!)
           let d = combine (b) (combine c c)
           putStrLn "So long!"

See how we construct code out of thin air? Try to imagine what this code will do, then run it and check yourself.

Do you understand why "Hello!" and "Bye!" are not printed?


Let's examine our "main" function closer:

 Prelude> :load cd-fit.hs
 Compiling Main             ( ./cd-fit.hs, interpreted )
 Ok, modules loaded: Main.
 *Main> :type main
 main :: IO ()
 *Main> 

We see that "main" is indeed an IO action which will return nothing when evaluated. When combining actions with "do", the type of the result will be the type of the last action, and "putStrLn something" has type "IO ()":

 *Main> :type putStrLn "Hello world!"
 putStrLn "Hello world!" :: IO ()
 *Main> 

Oh, by the way: have you noticed that we actually compiled our first Haskell program in order to examine "main"? :)

let's celebrate that by putting it under version control: execute "darcs add cd-fit.hs" and "darcs record", answer "y" to all questions and provide a commit comment "Skeleton of cd-fit.hs"

Let's try to run it:

 $ echo "foo" | runhaskell cd-fit.hs
 DEBUG: got input foo

Exercises:

  • Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);
  • Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).

3 Chapter 2: Parsing the input

OK, now that we have proper understanding of the powers of Haskell IO (and are awed by them, I hope), let's forget about IO and actually do some useful work.

As you remember, we set forth to pack some CD-Rs as tightly as possible with data scattered in several input directories. We assume that "du -sb" will compute the sizes of input directories and output something like:

 65572 /home/adept/photos/raw-to-burn/dir1
 68268 /home/adept/photos/raw-to-burn/dir2
 53372 /home/adept/photos/raw-to-burn/dir3
 713124  /home/adept/photos/raw-to-burn/dir4
 437952  /home/adept/photos/raw-to-burn/dir5

Our next task is to parse that input into some suitable internal representation.

For that we will use powerful library of parsing combinators named "Parsec" which ships with most Haskell implementations.

Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine into more complex parsing constructs.

Unlike other tools in this area (lex/yacc or JavaCC to name a few), Parsec parsers do not require a separate preprocessing stage. Since in Haskell we can return function as a result of function and thus construct functions "from the thin air", there is no need for a separate syntax for parser description. But enough advertisements, let's actually do some parsing:

 import Text.ParserCombinators.Parsec
 
 -- parseInput parses output of "du -sb", which consists of many lines,
 -- each of which describes single directory
 parseInput = 
   do dirs <- many dirAndSize
      eof
      return dirs
 
 -- Datatype Dir holds information about single directory - its size and name
 data Dir = Dir Int String deriving Show
 
 -- `dirAndSize` parses information about single directory, which is:
 -- a size in bytes (number), some spaces, then directory name, which extends till newline
 dirAndSize = 
   do size <- many1 digit
      spaces
      dir_name <- anyChar `manyTill` newline
      return (Dir (read size) dir_name)

Just add those lines into "cd-fit.hs". Here we see quite a lot of new things, and several those that we know already.

First of all, note the familiar "do" construct, which, as we know, is used to combine IO actions to produce new IO actions. Here we use it to combine "parsing" actions into new "parsing" actions. Does this mean that "parsing" implies "doing IO"? Not at all. Thing is, I must admit that I lied to you - "do" is used not only to combine IO actions. "Do" is used to combine any kind of so-called monadic actions or monadic values together.

Think about monad as of "design pattern" in the functional world. Monad is a way to hide from the user (programmer) all the machinery required for complex functionality to operate.

As you might have heard, Haskell has no notion of "assignment", "mutable state", "variables", and is a "pure functional language", which means that every function called with the same input parameters will return exactly the same result. Meanwhile "doing IO" requires hauling around file handles and their states and dealing with IO errors. "Parsing" requires to track position in the input and dealing with parsing errors.

In both cases Wise Men Who Wrote Libraries cared for our needs and hid all underlying complexities from us, exposing the API of their libraries (IO and parsing) in the form of "monadic action" which we are free to combine as we see fit.

Think of programming with monads as of doing the remodelling with the help of professional remodelling crew. You describe sequence of actions on the piece of paper (that's us writing in "do" notation), and then, when required, that sequence will be evaluated by the remodelling crew ("in the monad") which will provide you with end result, hiding all the underlying complexity (how to prepare the paint, which nails to choose, etc) from you.

let's use the interactive Haskell environment to decipher all the instructions we've written for the parsing library. As usually, we'll go top-down:

 *Main> :reload
 Ok, modules loaded: Main.
 *Main> :t parseInput
 parseInput :: GenParser Char st [Dir]
 *Main> :t dirAndSize
 dirAndSize :: GenParser Char st Dir
 *Main> 

Assuming (well, take my word for it) that "GenParser Char st" is our parsing monad, we could see that "parseInput", when evaluated, will produce a list of "Dir", and "dirAndSize", when evaluated, will produce "Dir". Assuming that "Dir" somehow represents information about single directory, that is pretty much what we wanted, isn't it?

Let's see what a "Dir" means. We defined datatype Dir as a record, which holds an Int and a String:

 data Dir = Dir Int String deriving Show

In order to construct such records, we must use data constructor Dir:

 *Main> :t Dir 1 "foo"
 Dir 1 "foo" :: Dir

In order to reduce confusion for newbies, we could have written:

 data Dir = D Int String deriving Show

, which would define datatype "Dir" with data constructor "D". However, traditionally name of the datatype and its constructor are chosen to be the same.

Clause "deriving Show" instructs the compiler to make enough code "behind the curtains" to make this datatype conform to the interface of the type class Show. We will explain type classes later, for now let's just say that this will allow us to "print" instances of "Dir".

Exercises:

  • examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.
  • compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called partial application.


OK. So, we combined a lot of primitive parsing actions to get ourselves a parser for output of "du -sb". How can we actually parse something? the Parsec library supplies us with function "parse":

 *Main> :t parse
 parse :: GenParser tok () a
 	 -> SourceName
 	 -> [tok]
 	 -> Either ParseError a
 *Main> :t parse parseInput
 parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]
 *Main> 

At first the type might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise type.

Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.

OK, back to the code. We can observe that the "parser" is a function, which, given a parser, a name of the source file or channel (f.e. "stdin"), and source data (String, which is a list of "Char"s, which is written "[Char]"), will either produce parse error, or parse us a list of "Dir".

Datatype "Either" is an example of datatype whose constructor has name, different from the name of the datatype. In fact, "Either" has two constructors:

 data Either a b = Left a | Right b

In order to undestand better what does this mean consider the following example:

 *Main> :t Left 'a'
 Left 'a' :: Either Char b
 *Main> :t Right "aaa"
 Right "aaa" :: Either a [Char]
 *Main> 

You see that "Either" is a union (much like the C/C++ "union") which could hold value of one of the two distinct types. However, unlike C/C++ "union", when presented with value of type "Either Int Char" we could immediately see whether its an Int or a Char - by looking at the constructor which was used to produce the value. Such datatypes are called "tagged unions", and they are another power tool in the Haskell toolset.

Did you also notice that we provide "parse" with parser, which is a monadic value, but receive not a new monadic value, but a parsing result? That is because "parse" is an evaluator for "Parser" monad, much like the GHC or Hugs runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.

let's extend our "main" function to use "parse" and actually parse the input and show us the parsed data structures:

 main = do input <- getContents
           putStrLn ("DEBUG: got input " ++ input)
           let dirs = case parse parseInput "stdin" input of
                           Left err -> error $ "Input:\n" ++ show input ++ 
                                               "\nError:\n" ++ show err
                           Right result -> result
           putStrLn "DEBUG: parsed:"; print dirs

Exercise:

  • In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).
  • Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.
  • Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.

Let's try to run what we have so far:

 $ du -sb * | runhaskell ./cd-fit.hs
 
 DEBUG: got input 22325  Article.txt
 18928   Article.txt~
 1706    cd-fit.hs
 964     cd-fit.hs~
 61609   _darcs
 
 DEBUG: parsed:
 [Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]

Seems to be doing exactly as planned. Now let's try some erroneous input:

 $ echo "foo" | runhaskell cd-fit.hs
 DEBUG: got input foo
 
 DEBUG: parsed:
 *** Exception: Input:
 "foo\n"
 Error:
 "stdin" (line 1, column 1):
 unexpected "f"
 expecting digit or end of input

Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".


Here is complete "cd-fit.hs" what we should have written so far:

 module Main where
 
 import Text.ParserCombinators.Parsec
 
 -- Output of "du -sb" -- which is our input -- consists of many lines,
 -- each of which describes single directory
 parseInput = 
   do dirs <- many dirAndSize
      eof
      return dirs
 
 -- Information about single direcory is a size (number), some spaces,
 -- then directory name, which extends till newline
 data Dir = Dir Int String deriving Show
 dirAndSize = 
   do size <- many1 digit
      spaces
      dir_name <- anyChar `manyTill` newline
      return $ Dir (read size) dir_name
 
 main = do input <- getContents
           putStrLn ("DEBUG: got input " ++ input)
           let dirs = case parse parseInput "stdin" input of
                           Left err -> error $ "Input:\n" ++ show input ++ 
                                               "\nError:\n" ++ show err
                           Right result -> result
           putStrLn "DEBUG: parsed:"; print dirs
           -- compute solution and print it


4 Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!)

Enough preliminaries already. let's go pack some CDs.

As you might already have recognized, our problem is a classical one. It is called a "knapsack problem" (google it up, if you don't know already what it is. There are more than 100000 links).

let's start from the greedy solution, but first let's slightly modify our "Dir" datatype to allow easy extraction of its components:

 data Dir = Dir {dir_size::Int, dir_name::String} deriving Show

Exercise: examine types of "Dir", "dir_size" and "dir_name"


From now on, we could use "dir_size d" to get a size of directory, and "dir_name d" to get its name, provided that "d" is of type "Dir".

The Greedy algorithm sorts directories from the biggest down, and tries to puts them on CD one by one, until there is no room for more. We will need to track which directories we added to CD, so let's add another datatype, and code this simple packing algorithm:

 import Data.List (sortBy)
 
 -- DirPack holds a set of directories which are to be stored on single CD.
 -- 'pack_size' could be calculated, but we will store it separately to reduce
 -- amount of calculation
 data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show
 
 -- For simplicity, let's assume that we deal with standard 700 Mb CDs for now
 media_size = 700*1024*1024
 
 -- Greedy packer tries to add directories one by one to initially empty 'DirPack'
 greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs
   where
   cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)
 
 -- Helper function, which only adds directory "d" to the pack "p" when new
 -- total size does not exceed media_size
 maybe_add_dir p d =
   let new_size = pack_size p + dir_size d
       new_dirs = d:(dirs p)
       in if new_size > media_size then p else DirPack new_size new_dirs

I'll highlight the areas which you could explore on your own (using other nice tutorials out there, of which I especially recommend "Yet Another Haskell Tutorial" by Hal Daume):

  • We choose to import a single function "sortBy" from a module Data.List, not the whole thing.
  • Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!
  • To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".
  • To code the quite complex function "maybe_add_dir", we introduced several local definitions in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.
  • Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"

In order to actually use our greedy packer we must call it from our "main" function, so let's add a lines:

 main = do ...
           -- compute solution and print it
           putStrLn "Solution:" ; print (greedy_pack dirs)

Verify integrity of our definitions by (re)loading our code in ghci. Compiles? Thought so :)

Now it is time to test our creation. We could do it by actually running it in the wild like this:

 $ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs

This will prove that our code seems to be working. At least, this once. How about establishing with reasonable degree of certainty that our code, parts and the whole, works properly, and doing so in re-usable manner? In other words, how about writing some test?

Java programmers used to JUnit probably thought about screens of boiler-plate code and hand-coded method invocations. Never fear, we will not do anything as silly :)

Enter QuickCheck.

QuickCheck is a tool to do automated testing of your functions using (semi)random input data. In the spirit of "100b of code examples is worth 1kb of praise" let's show the code for testing the following property: An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:

 import Test.QuickCheck
 import Control.Monad (liftM2)
 
 -- We must teach QuickCheck how to generate arbitrary "Dir"s
 instance Arbitrary Dir where
   -- Let's just skip "coarbitrary" for now, ok? 
   -- I promise, we will get back to it later :)
   coarbitrary = undefined
   -- We generate arbitrary "Dir" by generating random size and random name
   -- and stuffing them inside "Dir"
   arbitrary = liftM2 Dir gen_size gen_name
           -- Generate random size between 10 and 1400 Mb
     where gen_size = do s <- choose (10,1400)
                         return (s*1024*1024)
           -- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" 
           gen_name = do n <- choose (1,300)
                         sequence $ take (n*10+1) $ repeat (elements "fubar/")
 
 -- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".
 -- Assume that "ds" will be a random list of "Dir"s and code your test.
 prop_greedy_pack_is_fixpoint ds =
   let pack = greedy_pack ds 
       in pack_size pack == pack_size (greedy_pack (dirs pack))

let's run the test, after which I'll explain how it all works:

 Prelude> :r
 Compiling Main             ( ./cd-fit.hs, interpreted )
 Ok, modules loaded: Main.
 *Main> quickCheck prop_greedy_pack_is_fixpoint
 [numbers spinning]
 OK, passed 100 tests.
 *Main> 

We've just seen our "greedy_pack" run on a 100 completely (well, almost completely) random lists of "Dir"s, and it seems that property indeed holds.

let's dissect the code. The most intriguing part is "instance Arbitrary Dir where", which declares that "Dir" is an instance of typeclass "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a bit.

What is a typeclass? A typeclass is a Haskell way of dealing with the following situation: suppose that you are writing a library of usefull functions and you dont know in advance how exactly they will be used, so you want to make them generic. Now, on one hand you dont want to restrict your users to certain type (e.g. String). On the other hand, you want to enforce the convention that arguments for your function must satisfy a certain set of constraints. That is where typeclass comes in handy.

Think of typeclass as a contract (or "interface", in Java terms) that your type must fulfill in order to be admitted as an argument to certain functions.

Let's examine the typeclass "Arbitrary":

 *Main> :i Arbitrary
 class Arbitrary a where
   arbitrary :: Gen a
   coarbitrary :: a -> Gen b -> Gen b
   	-- Imported from Test.QuickCheck
 instance Arbitrary Dir
   	-- Defined at ./cd-fit.hs:61:0
 instance Arbitrary Bool 	-- Imported from Test.QuickCheck
 instance Arbitrary Double 	-- Imported from Test.QuickCheck
 instance Arbitrary Float 	-- Imported from Test.QuickCheck
 instance Arbitrary Int 	-- Imported from Test.QuickCheck
 instance Arbitrary Integer 	-- Imported from Test.QuickCheck
 -- rest skipped --

It could be read this way: "Any type (let's name it 'a') could be a member of the class Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".

Now, if you write a function which operates on its arguments solely by means of "arbitrary" and "coarbitrary", you can be sure that this function will work on any type which is an instance of "Arbitrary"!

let's say it again. Someone (maybe even you) writes the code (API or library), which requires that input values implement certain interfaces, which is described in terms of functions. Once you show how your type implements this interface you are free to use API or library.

Conside the function "sort" from standard library:

 *Main> :t Data.List.sort
 Data.List.sort :: (Ord a) => [a] -> [a]

We see that it sorts lists of any values which are instance of typeclass "Ord". Let's examine that class:

 *Main> :i Ord
 class Eq a => Ord a where
   compare :: a -> a -> Ordering
   (<) :: a -> a -> Bool
   (>=) :: a -> a -> Bool
   (>) :: a -> a -> Bool
   (<=) :: a -> a -> Bool
   max :: a -> a -> a
   min :: a -> a -> a
 -- skip
 instance Ord Double 	-- Imported from GHC.Float
 instance Ord Float 	-- Imported from GHC.Float
 instance Ord Bool 	-- Imported from GHC.Base
 instance Ord Char 	-- Imported from GHC.Base
 instance Ord Integer 	-- Imported from GHC.Num
 instance Ord Int 	-- Imported from GHC.Base
 -- skip
 *Main> 

We see a couple of interesting things: first, there is an additional requirement listed: in order to be an instance of "Ord", type must first be an instance of typeclass "Eq". Then, we see that there is an awful lot of functions to define in order to be an instance of "Ord". Wait a second, isn't it silly to define both (<) and (>) when one could be expressed via another?

Right you are! Usually, typeclass contains several "default" implementation for its functions, when it is possible to express them through each other (as it is with "Ord"). In this case it is possible to supply only a minimal definition (which in case of "Ord" consists of any single function) and others will be automatically derived. If you supplied fewer functions than are required for minimal implementation, the compiler/interpreter will say so and explain which functions you still have to define.

Once again, we see that a lot of types are already instances of typeclass Ord, and thus we are able to sort them.

Now, let's take a look back to the definition of "Dir":

 data Dir = Dir {dir_size::Int, dir_name::String} deriving Show

See that "deriving" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of that type!

Sidenote for Java programmers: just imagine java compiler which derives code for "implements Storable" for you...

Sidenote for C++ programmers: just imagine that deep copy constructors are being written for you by compiler....


Exercises:

  • Examine typeclasses Eq and Show
  • Examine types of (==) and "print"
  • Try to make "Dir" instance of "Eq"

OK, back to our tests. So, what we have had to do in order to make "Dir" an instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's examine it up close:

 *Main> :t arbitrary
 arbitrary :: (Arbitrary a) => Gen a

See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser a" which we've seen already. This is yet another example of action-returning function, which could be used inside "do"-notation. (You might ask yourself, wouldn't it be useful to generalize that convenient concept of actions and "do"? Of course! It is already done, the concept is called "Monad" and we will talk about it in Chapter 400 :) )

Since 'a' here is a type variable which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?

Let's look at the code:

   arbitrary = liftM2 Dir gen_size gen_name
           -- Generate random size between 10 and 1400 Mb
     where gen_size = do s <- choose (10,1400)
                         return (s*1024*1024)
           -- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" 
           gen_name = do n <- choose (1,300)
                         sequence $ take (n*10+1) $ repeat (elements "fubar/")
 

We have used the library-provided functions "choose" and "elements" to build up "gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my word on that. Find a way to check types of "gen_name" and "gen_size"). Since "Int" and "String" are components of "Dir", we sure must be able to use "Gen Int" and "Gen String" to build "Gen Dir". But where is the "do" block for that? There is none, and there is only single call to "liftM2".

Let's examine it:

 *Main> :t liftM2
 liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r

Kind of scary, right? Let's provide typechecker with more context:

 *Main> :t liftM2 Dir
 liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir

Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -> Gen Dir". Exactly what we wanted!

Consider "liftM2" to be "advanced topic" of this chapter (which we will cover later) and just note for now that:

  • "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"
  • There are also "liftM", "liftM3", "liftM4", "liftM5"
  • "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"

Hopefully, this will all make sense after you read it for the third time ;)

5 Chapter 4: REALLY packing the knapsack this time

In this chapter we are going to write several not-so-trivial packing methods, compare their efficiency, and learn something new about debugging and profiling of the Haskell programs along the way

6 Chapter 400: Monads up close

Google "All about monads" and read it. 'Nuff said :)

7 Chapter 500: IO up close

Shows that:

 c = do a <- someAction
        b <- someOtherAction
        print (bar b)
        print (foo a)
        print "done"
 

really is just a syntax sugar for:

 c = someAction >>= \a ->
     someOtherAction >>= \b ->
     print (bar b) >>
     print (foo a) >>
     print "done"
 

and explains about ">>=" and ">>". Oh wait. This was already explained in Chapter 400 :)

8 Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software

Plenty of material on this on the web and this wiki. Just go get yourself installation of GHC (6.4 or above) or Hugs (v200311 or above) and "darcs", which we will use for version control.

9 Chapter 10000: Thanks!

Thanks for comments, proofreading, good advice and kind words go to: Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin Percossi. If I should have mentioned YOU and forgot - tell me so.

Without you I would have stopped after Chapter 1 :)