https://wiki.haskell.org/api.php?action=feedcontributions&user=Adept&feedformat=atomHaskellWiki - User contributions [en]2024-03-19T05:06:59ZUser contributionsMediaWiki 1.35.5https://wiki.haskell.org/index.php?title=User:Adept&diff=65501User:Adept2023-01-11T22:10:30Z<p>Adept: /* Personal trivia */</p>
<hr />
<div>== Personal trivia ==<br />
I am known as adept (or ADEpt) at #haskell<br />
<br />
You can reach me via dastapov-at-gmail-dot-com<br />
<br />
== Texts and articles ==<br />
[[QuickCheck as Test Set Generator]]<br />
<br />
[[Hitchhikers Guide to the Haskell]]<br />
<br />
Source for both of them are available in my [http://adept.linux.kiev.ua:8080/repos darcs repo]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65499Hitchhikers guide to Haskell2023-01-09T12:26:55Z<p>Adept: /* Chapter 400: Monads up close */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have a proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that, we will use the powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine them into more complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return a function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses the output of "du -sb", which consists of many lines,<br />
-- each of which describes a single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about a single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about a single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several of those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) <br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter, we are going to write another not-so-trivial packing<br />
method, compare packing methods' efficiency and learn something new<br />
about debugging and profiling the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of a decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
the theory behind the monads, the overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65498Hitchhikers guide to Haskell2023-01-09T12:26:36Z<p>Adept: /* Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have a proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that, we will use the powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine them into more complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return a function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses the output of "du -sb", which consists of many lines,<br />
-- each of which describes a single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about a single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about a single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several of those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) <br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter, we are going to write another not-so-trivial packing<br />
method, compare packing methods' efficiency and learn something new<br />
about debugging and profiling the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of a decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
the theory behind the monads, the overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65497Hitchhikers guide to Haskell2023-01-09T12:26:14Z<p>Adept: /* Chapter 6: Where do you want to go tomorrow? */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have a proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that, we will use the powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine them into more complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return a function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses the output of "du -sb", which consists of many lines,<br />
-- each of which describes a single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about a single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about a single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several of those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) <br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter, we are going to write another not-so-trivial packing<br />
method, compare packing methods' efficiency and learn something new<br />
about debugging and profiling the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of a decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
the theory behind the monads, the overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65496Hitchhikers guide to Haskell2023-01-09T12:25:41Z<p>Adept: /* Chapter 5: (Ab)using monads and destructing constructors for fun and profit */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have a proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that, we will use the powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine them into more complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return a function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses the output of "du -sb", which consists of many lines,<br />
-- each of which describes a single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about a single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about a single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several of those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) <br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter, we are going to write another not-so-trivial packing<br />
method, compare packing methods' efficiency and learn something new<br />
about debugging and profiling the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of a decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
the theory behind the monads, the overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65495Hitchhikers guide to Haskell2023-01-09T12:25:09Z<p>Adept: /* Chapter 4: REALLY packing the knapsack this time */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have a proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that, we will use the powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine them into more complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return a function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses the output of "du -sb", which consists of many lines,<br />
-- each of which describes a single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about a single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about a single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several of those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) <br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter, we are going to write another not-so-trivial packing<br />
method, compare packing methods' efficiency and learn something new<br />
about debugging and profiling the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of a decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65494Hitchhikers guide to Haskell2023-01-09T12:24:00Z<p>Adept: /* Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have a proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that, we will use the powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine them into more complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return a function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses the output of "du -sb", which consists of many lines,<br />
-- each of which describes a single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about a single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about a single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several of those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) <br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65493Hitchhikers guide to Haskell2023-01-09T12:23:30Z<p>Adept: /* Chapter 2: Parsing the input */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have a proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that, we will use the powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this library provides a set of basic parsers and means to combine them into more complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return a function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses the output of "du -sb", which consists of many lines,<br />
-- each of which describes a single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about a single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about a single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several of those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - don't forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65492Hitchhikers guide to Haskell2023-01-09T12:21:21Z<p>Adept: /* Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, with directories<br />
with images ranging from 10 to 300 Mb in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contains the full program from<br />
-- which the snippet is taken. You can get the code from the GitHub<br />
-- repository "https://github.com/adept/hhgth" by issuing<br />
-- command "git clone https://github.com/adept/hhgth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control system, and I hope you are not making an exception even for the tutorial code such as the one we will write here.<br />
<br />
Create an empty directory for all our code, invoke<br />
"git init" there (or use another version control system of your choice), <br />
and fire up your favourite editor to create a new file called "cd-fit.hs"<br />
in our working directory. <br />
<br />
Now let's think for a moment about how our program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read the list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute the solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This line is an instruction to read all the information available from the stdin, return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with the function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to a variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate it later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns a result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of the function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation, we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get a result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that the source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, the compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify the end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control and then try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for your name, reads it, greets you, asks for your favourite colour, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - don't forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=65491Hitchhikers guide to Haskell2023-01-09T12:10:43Z<p>Adept: /* Preface: DON'T PANIC! */</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: the author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via [https://github.com/adept/hhgth github] or directly to this<br />
Wiki.<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua:8080/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua:8080/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = Read list of directories and their sizes.<br />
Decide how to fit them on CD-Rs.<br />
Print solution.<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof :: Parser ()<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to "cd-fit.hs", between the declaration of <br />
the Main module and the definition of main.<br />
<br />
Here we see quite a lot of new<br />
things, and several those that we know already. <br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize<br />
:: Text.Parsec.Prim.ParsecT<br />
[Char] u Data.Functor.Identity.Identity Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "Text.Parsec.Prim.ParsecT [Char] u Data.Functor.Identity.Identity" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse<br />
:: Text.Parsec.Prim.Stream s Data.Functor.Identity.Identity t =><br />
Text.Parsec.Prim.Parsec s () a<br />
-> SourceName -> s -> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",<br />
Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2, replicateM)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>=) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
replicateM n (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - don't forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass <hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass <hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate"<br />
-- dirs that are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) <br />
| let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes).<br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> <br />
let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.haskell.org/haskellwiki/All_About_Monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)<br />
<br />
Languages: [[Haskellへのヒッチハイカーガイド|jp]], [[Es/Guía de Haskell para autoestopistas|es]]</div>Adepthttps://wiki.haskell.org/index.php?title=CamHac/PostHackathonReport&diff=41601CamHac/PostHackathonReport2011-08-14T15:25:37Z<p>Adept: </p>
<hr />
<div>= Post-Hackathon Report =<br />
<br />
This page is for listing what was done during the Hackathon. Please add a short description of what you worked on, with links to relevant blog posts, hackage packages, commits, etc.<br />
<br />
== fclabels 1.0 release ==<br />
<br />
New release of the '''fclabels''' package. The new package has a lot of code and documentation cleanups, support for partial labels in the case of multi-constructor datatypes and is about 20x as fast for setting and modifying as the previous version. Thanks everyone for helping me out!<br />
<br />
Hackage: http://hackage.haskell.org/package/fclabels-1.0.1<br />
<br />
Github: http://github.com/sebastiaanvisser/fclabels<br />
<br />
== GHC and base library improvements ==<br />
<br />
* [http://hackage.haskell.org/trac/ghc/ticket/5413 Add primops for bit population count]. These primops compile down to `POPCNT` instructions where available and fast fallbacks (implemented in C) otherwise.<br />
<br />
* [http://hackage.haskell.org/trac/ghc/ticket/5414 Add unchecked left and right bit shifts]: The Data.Bits.shift method uses a branch to check if the shift amount is larger than the word size and returns 0 in these cases. This extra safety makes performance worse in bit twiddling code.<br />
<br />
* Discussed unpacking of enums in GHC (not yet implemented).<br />
<br />
== Darcs ==<br />
<br />
New contributors:<br />
<br />
* use red text to report when <font color="red">we have a conflict</font> ([http://bugs.darcs.net/issue1681 issue1681],[http://bugs.darcs.net/patch646 patch646], Jeff Foster)<br />
* support 'since' in English dates parser<br />
* filter SSH output ([http://bugs.darcs.net/issue845 issue845], Jeff Foster and Sebastian Korten)<br />
* support arbitrary darcs command in darcs-test (Alexander Njemz)<br />
* output ISO dates in darcs changes? ([http://bugs.darcs.net/issue140 issue140], Alexander N, may be not a good idea)<br />
* add a last regrets prompt to interactive patch selection ([http://bugs.darcs.net/issue1920 issue1920], [http://bugs.darcs.net/patch655 patch655], Johannes Weiß)<br />
* [in-progress] support removing changes in amend-record ([http://bugs.darcs.net/issue1470 issue1470], Johannes Weiß)<br />
<br />
== wxHaskell ==<br />
<br />
* a Windows build fix ([https://sourceforge.net/mailarchive/forum.php?thread_name=CAA5%3D7kb%2BCmJ178tvrxtOnuswWBeBxYpSotiQWRXuKE3m_sByXA%40mail.gmail.com&forum_name=wxhaskell-devel patch])<br />
* a fix for [https://sourceforge.net/tracker/?func=detail&aid=3019730&group_id=73133&atid=536845 an issue with colorDialog] ([https://sourceforge.net/mailarchive/forum.php?thread_name=20110813150933.GA758%40dewdrop.local&forum_name=wxhaskell-devel patch])<br />
<br />
== hCole-server ==<br />
<br />
* A Snap-based web application that interacts with the COLE (see http://portal.acm.org/citation.cfm?id=1356080 and http://portal.acm.org/citation.cfm?id=1772965) framework for exploring compiler optimisation levels. The purpose of the web app is that collaborators can submit optimisation sequences to the COLE backend and retrieve the results when they are available after measuring.<br />
* Git repository of the web application can be found at https://github.com/itkovian/hcole-server<br />
<br />
== GObject Introspection ==<br />
<br />
* Work-in-progress binding generator for GObject-based libraries such as Gtk+ 3.<br />
* Started switching to [http://hackage.haskell.org/package/haskell-src-exts-1.11.1 haskell-src-exts] for code generation.<br />
* Patches currently on the ''camhac'' branch on [https://gitorious.org/haskell-gi/haskell-gi gitorious].<br />
<br />
== Snap Framework ==<br />
<br />
* Some work has been done on the authentication Snaplet, including an (incomplete) HDBC backend for it. An early work-in-progress can be found here: https://github.com/norm2782/snap<br />
* An example application which uses Snap 0.6 has been improved to use the authentication Snaplet. Another work-in-progress: https://github.com/norm2782/snap-guestbook<br />
<br />
== Data.Text ==<br />
<br />
* Further benchmarking, bug fixing to support the UTF-8 port. Progress can be found in the ''utf8'' branch here: http://github.com/jaspervdj/text<br />
<br />
== hs-poker ==<br />
<br />
* A "redneck naive" poker hand evaluator. Code is on github (https://github.com/fffej/HS-Poker). Hopefully intend to turn this into a poker bot playground for Haskell (Jeff / Sebastian)<br />
<br />
== haskell-mpi ==<br />
* New version 1.1.0 uploaded to hackage, including support for more MPI implementations, bugfixes and general awesomness<br />
* Upcoming Monad Reader would feature and article about parallel programming with MPI, written during the course of the hackathon (Dmitry Astapov)</div>Adepthttps://wiki.haskell.org/index.php?title=CamHac&diff=40537CamHac2011-06-16T21:25:31Z<p>Adept: </p>
<hr />
<div>Haskell Hackaton in Cambridge, UK, '''August 12-14, 2011'''<br />
<br />
== About ==<br />
<br />
Come and spend a weekend in Cambridge hacking Haskell code in great surroundings with fantastic company! Haskell Hackathons are a tradition where everyone is welcome; we get together, work on projects with others or just do your own thing, the overall goal being to improve the Haskell ecosystem.<br />
<br />
CamHac will be held from 12-14 August 2011, at [http://www.homertonconference.com/ Homerton College] in Cambridge. As with previous Hackathons, all are welcome -- you do not have to be a Haskell guru. All you need is a basic knowledge of Haskell, a willingness to learn, and a project you're excited to help with (or a project of your own to work on).<br />
<br />
There will be lots of hacking, good food, and, of course, fun! <br />
<br />
* Organiser: [mailto:marlowsd@gmail.com Simon Marlow] (<tt>JaffaCake</tt> on IRC)<br />
* Mailing list: [http://www.haskell.org/mailman/listinfo/hackathon hackathon@haskell.org]<br />
* IRC channel: #ghc on FreeNode<br />
<br />
Many thanks to [http://research.microsoft.com/en-us/labs/cambridge/default.aspx Microsoft Research Cambridge] for agreeing to sponsor the event.<br />
<br />
== Registration ==<br />
<br />
'''Registration is now closed''' We are full, sorry!<br />
<br />
== Venue ==<br />
<br />
We're in the [http://www.homertonconference.com/Leah-Manning.html Leah Manning Room] of [http://www.homertonconference.com/ Homerton Conference Centre]. It is about [http://www.google.co.uk/maps?f=d&source=s_d&saddr=United+Kingdom+(Cambridge,+Railway+Station+(Stop+B))&daddr=CB2+8PH&hl=en&geocode=FehrHAMdjhUCACHpLU_p7S-CNg%3BFc5LHAMdNhMCACmn-uB8eXrYRzFlrDhff7fJ9A&mra=iwd&dirflg=w&sll=52.190667,0.134583&sspn=0.021547,0.040598&ie=UTF8&z=16 15 minutes walk from the train station], and Cambridge town centre is about 30 minutes walk.<br />
<br />
'''Times''': we have the room booked all day for the three days, and we'll probably start around 10am and finish around 6pm. Exact time details to be confirmed later. <br />
<br />
There will be WiFi access.<br />
<br />
There will be a projector for giving talks/demos. We will probably reserve a part of the time for talks and demos.<br />
<br />
== Food ==<br />
<br />
Tea and coffee will be supplied. We will have to go out to find lunch, but there are various places to eat and buy food at the [http://www.cambridge-x.co.uk Cambridge Leisure Park] a few minutes walk towards Cambridge town centre. In the evening we will probably head towards the town where there are plenty of good restaurants.<br />
<br />
We have been advised that only food provided by or purchased from Homerton College can be consumed on the premises.<br />
<br />
== Local arrangements ==<br />
<br />
=== Getting to Cambridge ===<br />
<br />
==== By Plane ====<br />
<br />
* [http://www.stanstedairport.com/ Stansted Airport]: Stansted is the nearest of the London-area airports to Cambridge. It is mostly served by flights to and from mainland Europe, Ireland, and elsewhere in the UK. By train it is about 30 minutes to Cambridge, bus about 1 hour.<br />
<br />
* [http://www.heathrowairport.com/ Heathrow Airport]: Heathrow is the principal London-area airport and one of the busiest in Europe with a wide range of national, European, and international services. By train it is about 1h30 to 2h to Cambridge (Heathrow Express is faster but more expensive).<br />
<br />
* [http://www.gatwickairport.com/ Gatwick Airport]: Gatwick is the second "London" airport with a wide range of national, European and international services. By train it is about 2h to Cambridge.<br />
<br />
* Other airports: [http://www.london-luton.co.uk/ Luton Airport], [http://www.norwichairport.co.uk/ Norwich airport], and [http://www.southendairport.com/ Southend airport] are other regional airports in the East Anglia region. If you use these, car or taxi is the best option for travel to Cambridge.<br />
<br />
==== Trains from London ====<br />
<br />
London has two train lines into Cambridge, London Kings Cross and London Liverpool Street. There is a regular service on both lines and duration is under an hour on the direct trains. Go to [http://www.nationalrail.co.uk National Rail] to check train times.<br />
<br />
You can usually by tickets at the station both at a ticket machine or a staffed counter. You usually will ''not'' be able to buy tickets on the train without paying a fine. Tickets can be cheaper if you buy off-peak and return trip. Off-peak tickets are usually valid on weekends and after 10 a.m. on weekdays. Make sure, though, to check [http://www.nationalrail.co.uk National Rail] for which trains are eligible for off-peak tickets.<br />
<br />
=== Getting to the venue ===<br />
<br />
[http://www.google.co.uk/maps?f=d&source=s_d&saddr=United+Kingdom+(Cambridge,+Railway+Station+(Stop+B))&daddr=CB2+8PH&hl=en&geocode=FehrHAMdjhUCACHpLU_p7S-CNg%3BFc5LHAMdNhMCACmn-uB8eXrYRzFlrDhff7fJ9A&mra=iwd&dirflg=w&sll=52.190667,0.134583&sspn=0.021547,0.040598&ie=UTF8&z=16 Walk from the train station] (about 15 minutes)<br />
<br />
[http://www.homertonconference.com/How-to-find-us.html How to find the venue]<br />
<br />
'''Local Taxis''': Panther Taxis 01223 715715<br />
<br />
=== Accommodation ===<br />
<br />
[http://www.visitcambridge.org/VisitCambridge/WhereToStay.aspx VisitCambridge: Where to Stay in Cambridge]<br />
<br />
The nearest hotels to the venue seem to be:<br />
<br />
* [http://www2.travelodge.co.uk/ Travelodge] (Cambridge Central) is just a few minutes walk from the venue. It is currently charging £65.80 per night for 11-14 August.<br />
* [http://www.helenhotel.co.uk/index.htm Helen Hotel]<br />
* [http://www.bandbincambridgeshire.co.uk/ Bridge Guest House]<br />
* [http://www.cheapguesthouses.com/ Fairways Guest House]<br />
* [http://www.abbeyfieldguesthouse.com/ Abbeyfield Guest House]<br />
* [http://rockviewguesthouse.co.uk/default.aspx Rock View Guest House]<br />
* [http://alingtonhouse.com/default.aspx Alington House Guest House]<br />
* [http://www.yha.org.uk/find-accommodation/east-of-england/hostels/cambridge/index.aspx Cambridge Youth Hostel]. The hostel does not offer single rooms, but you might be able to organise a group to occupy one 4-bed room.<br />
* [http://www.cambridgerooms.co.uk/ Stay in Cambridge Colleges]<br />
<br />
If you contact any of the above and find they're booked up, please remove them from the list.<br />
<br />
Microsoft Research recommends the following hotels to visitors, these are closer to the city centre but are probably a lot more expensive than those above:<br />
<br />
* [http://www.hilton.co.uk/cambridgegardenhouse Double Tree by Hilton Garden House Cambridge]<br />
* [http://www.ichotelsgroup.com/h/d/cp/1/en/hotel/cbguk Crowne Plaza Cambridge]<br />
* [http://www.devere.co.uk/our-locations/university-arms.html De Vere University Arms]<br />
<br />
== Projects ==<br />
<br />
Use this space to list projects you are interested in working on, and add your name to projects you are interested in helping with.<br />
<br />
* General hacking away at Snap Framework (exact goals TBD), perhaps adding/improving documentation/tutorials at the same time. (Jurriën Stutterheim, Twey)<br />
* Darcs<br />
* Something games/3d related? (Stephen L)<br />
* [http://code.google.com/p/lambdacube LambdaCube 3D engine] (Csaba Hruska)<br />
* Designing/proposing/implementing a richer or more 'haskelly' API for the network package (Ben Millwood, Twey)<br />
* Writing a library that implements the ideas of [http://web.cecs.pdx.edu/~mpj/thih/ Typing Haskell In Haskell] to type-check, say, a haskell-src-exts AST (Ben Millwood, Stijn van Drongelen)<br />
* wxHaskell (Maciek Makowski)<br />
<br />
== Attendees ==<br />
<br />
# Simon Marlow<br />
# Jurriën Stutterheim<br />
# Neil Mitchell<br />
# Jasper Van der Jeugt<br />
# Max Bolingbroke<br />
# Ben Millwood ‘benmachine’<br />
# Roman Leshchinskiy<br />
# Gregory Collins<br />
# Martijn van Steenbergen<br />
# Sjoerd Visscher<br />
# Sebastiaan Visser<br />
# Tom Lokhorst<br />
# Erik Hesselink<br />
# Jeff Foster<br />
# Sebastian Korten<br />
# Alessandro Vermeulen<br />
# Vlad Hanciuta<br />
# Ganesh Sittampalam<br />
# Eric Kow<br />
# Alexander Njemz<br />
# Mikolaj Konarski<br />
# Ian Lynagh<br />
# Andres Löh<br />
# Jeroen Janssen<br />
# Nicolas Wu<br />
# Duncan Coutts<br />
# Dominic Orchard<br />
# Jacek Generowicz<br />
# Owen Stephens<br />
# Benedict Eastaugh<br />
# Stephen Lavelle<br />
# Sam Martin<br />
# Alex Horsman<br />
# Andy Georges<br />
# Niklas Larsson<br />
# Raeez Lorgat<br />
# Maryna Strelchuk<br />
# Vincent Hanquez<br />
# Chris Done<br />
# Tomas Petricek<br />
# Thomas Schilling<br />
# Dragos Ionita<br />
# Simon Meier<br />
# Will Thompson<br />
# Sergii Strelchuk<br />
# Lennart Kolmodin<br />
# Philippa Cowderoy<br />
# Steven Keuchel<br />
# Michal Terepeta<br />
# Maciek Makowski<br />
# Johannes Weiß<br />
# Alejandro Serrano<br />
# Mike McClurg<br />
# Stefan Wehr<br />
# David Leuschner<br />
# James ‘Twey’ Kay<br />
# Simon PJ<br />
# Neill Bogie<br />
# Csaba Hruska<br />
# Bart Coppens<br />
# Stijn van Drongelen<br />
# Jeremy Yallop<br />
# Paul Wilson<br />
# Dmitry Astapov<br />
* Add your name here, once registered...</div>Adepthttps://wiki.haskell.org/index.php?title=Ghent_Functional_Programming_Group/BelHac/Register&diff=37348Ghent Functional Programming Group/BelHac/Register2010-10-28T17:21:03Z<p>Adept: </p>
<hr />
<div>Important: Please wait for a confirmation email before booking any flights/hotels.<br />
<br />
Registration is via email to Jasper Van der Jeugt at<br />
<br />
jaspervdj+belhac@gmail.com<br />
<br />
with the subject<br />
<br />
BelHac registration <br />
<br />
and body containing the following information:<br />
<br />
Name:<br />
#haskell nick: (if applicable)<br />
Email:<br />
Food restrictions:<br />
Days attending: <br />
<br />
Here is an example:<br />
<br />
Name: Jasper Van der Jeugt<br />
Nick: jaspervdj<br />
Email: jaspervdj@gmail.com<br />
Food restrictions: Raw flesh only<br />
Days attending: Friday, saturday and sunday<br />
<br />
If you want, you can also add you name here:<br />
<br />
{| class="wikitable"<br />
! Nickname<br />
! Real Name<br />
! Affiliation<br />
! Mobile #<br />
! Email<br />
! Arriving - Departing<br />
! Accomodation<br />
|-<br />
| jaspervdj<br />
| Jasper Van der Jeugt<br />
| Ghent University<br />
| +32 476 26 48 47<br />
| jaspervdj@gmail.com<br />
| <br />
| Has a small place in Ghent<br />
|-<br />
| Itkovian<br />
| Andy Georges<br />
| Ghent University/FWO<br />
| <br />
| itkovian@gmail.com<br />
|<br />
| Lives in Ostend, arrives by train on daily basis<br />
|-<br />
| Javache<br />
| Pieter De Baets<br />
| Ghent University<br />
| <br />
| pieter.debaets@gmail.com<br />
|<br />
|<br />
|-<br />
| boegel<br />
| Kenneth Hoste<br />
| Ghent University<br />
| <br />
| kenneth.hoste@ugent.be<br />
|<br />
| commute to/from Ghent daily<br />
|-<br />
| BCoppens<br />
| Bart Coppens<br />
| Ghent University<br />
| <br />
| bart.coppens@elis.ugent.be<br />
|<br />
|<br />
|-<br />
| jejansse<br />
| Jeroen Janssen<br />
| VUB<br />
|<br />
| jejansse@gmail.com<br />
|<br />
| Lives in Ghent.<br />
|-<br />
| Feuerbach<br />
| Roman Cheplyaka<br />
| <br />
| +380662285780<br />
| roma@ro-che.info<br />
| unknown yet<br />
| unknown yet<br />
|-<br />
| solidsnack<br />
| Jason Dusek<br />
| Heroku<br />
| +1 415 894 2162<br />
| jason.dusek@gmail.com<br />
| 5th-7th<br />
| Probably Monasterium.<br />
|-<br />
| kosmikus<br />
| Andres L&ouml;h<br />
| Well-Typed LLP<br />
|<br />
| mail@andres-loeh.de<br />
| 5th-7th<br />
|<br />
|-<br />
| Igloo<br />
| Ian Lynagh<br />
| Well-Typed LLP<br />
|<br />
| igloo@earth.li<br />
| 5th-7th<br />
|<br />
|-<br />
| dcoutts<br />
| Duncan Coutts<br />
| Well-Typed LLP<br />
|<br />
| duncan.coutts@googlemail.com<br />
| 5th-7th<br />
|<br />
|-<br />
| wvdschel<br />
| Wim Vander Schelden<br />
| Ghent University<br />
| <br />
| belhac@fixnum.org<br />
| <br />
| Lives in Ghent<br />
|-<br />
| sjoerd_visscher<br />
| Sjoerd Visscher<br />
| SDL Xopus<br />
| <br />
| sjoerd@w3future.com<br />
| 5th-7th<br />
| youth hostel<br />
|-<br />
| mietek<br />
| Miëtek Bak<br />
| Erlang Solutions<br />
| <br />
| mietek@gmail.com<br />
| 5th-7th<br />
| Open for suggestions<br />
|<br />
|-<br />
| <br />
| Steven Keuchel<br />
| Utrecht University<br />
| <br />
| <br />
| 5th-7th<br />
| <br />
|-<br />
| chr1s<br />
| Chris Eidhof<br />
| <br />
| <br />
| chris@eidhof.nl<br />
| 5th-7th<br />
| youth hostel<br />
|-<br />
|fphh<br />
|Heinrich Hördegen<br />
|Funktionale Programmierung<br />
|<br />
|hoerdegen@laposte.net<br />
|5th-7th<br />
|Open for suggestions<br />
|-<br />
| <br />
| Tom Lokhorst<br />
| <br />
| <br />
| tom@lokhorst.eu<br />
| 5th-7th<br />
| youth hostel<br />
|-<br />
| sioraiocht<br />
| Tom Harper<br />
| Oxford University Computing Laboratory<br />
| +44 7533 998 591<br />
| rtomharper@gmail.com<br />
| Friday-Sunday<br />
| <br />
|-<br />
| mcclurmc<br />
| Mike McClurg<br />
| <br />
| <br />
| mike.mcclurg@gmail.com<br />
| Friday-Sunday<br />
| <br />
|-<br />
| chrisdone<br />
| Chris Done<br />
| CREATE-NET<br />
| TBA<br />
| chrisdone@gmail.com<br />
| 5th-7th<br />
| <br />
|-<br />
| lpeterse<br />
| Lars Petersen<br />
| University of Osnabrück<br />
| <br />
| info@lars-petersen.net<br />
| 5th-7th<br />
| Hostel<br />
|-<br />
| sfvisser<br />
| Sebastiaan Visser<br />
| Silk<br />
| <br />
| haskell@fvisser.nl<br />
| 5th-7th<br />
| youth hostel<br />
|-<br />
| alatter<br />
| Antoine Latter<br />
|<br />
|<br />
| aslatter@gmail.com<br />
| 5th - 7th<br />
| Hostel<br />
|-<br />
| marcmo<br />
| oliver mueller<br />
|<br />
|<br />
| oliver.mueller@gmail.com<br />
| 5th - 7th<br />
| Hostel<br />
|-<br />
| kayuri<br />
| Yuriy Kashnikov<br />
| <br />
| <br />
| yuriy.kashnikov@gmail.com<br />
| 5th-7th<br />
| <br />
|-<br />
| nomeata<br />
| Joachim Breitner<br />
| <br />
| <br />
| mail@joachim-breitner.de<br />
| 5th-7th<br />
| still looking<br />
|-<br />
| dons<br />
| Don Stewart<br />
| [http://www.galois.com Galois, Inc]<br />
| 07961033604<br />
| dons@galois.com<br />
| 5th - 7th.<br />
| <br />
|-<br />
| basvandijk<br />
| Bas van Dijk<br />
| <br />
| +31 614065248<br />
| v.dijk.bas@gmail.com<br />
| 5th - 7th<br />
| Faja Lobi<br />
|-<br />
| roelvandijk<br />
| Roel van Dijk<br />
| <br />
| +31 612856453<br />
| vandijk.roel@gmail.com<br />
| 5th - 7th<br />
| Faja Lobi<br />
|-<br />
| <br />
| Martijn van Steenbergen<br />
| <br />
| <br />
| <br />
| 5th - 7th<br />
| hostel47<br />
|-<br />
| <br />
| Atze Dijkstra<br />
| Utrecht University<br />
| <br />
| atze@cs.uu.nl<br />
| 5th - 6th, possibly morning 7th<br />
| <br />
|-<br />
| mrtrac<br />
| Martin Kiefel<br />
| <br />
| <br />
| mk@nopw.de<br />
| 5th-7th<br />
| Hotel Flandria<br />
|-<br />
| <br />
| Jan-Willem Roorda<br />
| <br />
| <br />
| <br />
| 5th-7th<br />
| Monasterium<br />
|}</div>Adepthttps://wiki.haskell.org/index.php?title=LtU-Kiev/Hackathon&diff=37209LtU-Kiev/Hackathon2010-10-10T19:14:50Z<p>Adept: </p>
<hr />
<div>'''LtU-Kiev Hackathon''' is a collaborative coding festival aimed towards<br />
building and improving Haskell libraries, tools, and infrastructure.<br />
<br />
This event is also a get-together of functional programming<br />
enthusiasts. With this in mind, attendees are free to choose any<br />
programming language they like for their hacking.<br />
<br />
LtU-Kiev Hackathon is open to all. You don't have to be a Haskell guru<br />
to attend. All you need is a project you are excited to help with or<br />
an individual project to work on.<br />
<br />
== Date and Venue ==<br />
16-17 October 2010, 10 am to 6 pm.<br />
<br />
The hackathon will be hosted by GlobalLogic and will be held in Kiev's<br />
GL-Club: Mykoly Grinchenka Str., 2/1<br />
([http://maps.google.com/maps?hl=en-GB&t=h&ie=UTF8&ll=50.42498,30.506662&spn=0.002239,0.005681&z=18&iwloc=lyrftr:h,14410286290370450397,50.424939,30.506802 map]).<br />
<br />
== Registration ==<br />
If you will be attending, please<br />
[https://spreadsheets.google.com/viewform?formkey=dE5OdmhpUnJwU19VYXVzYjZaRjdKcFE6MQ register].<br />
The space is limited (up to 40 attendees), so don't delay!<br />
<br />
== Projects ==<br />
If you have a project that you want to work on at the Hackathon,<br />
please describe it here. This page is only meant for coordination, it<br />
does not impose anything.<br />
<br />
If you are interested in one of these projects, add your name to the<br />
list of hackers under that project. If you have another project you<br />
want to hack on, please add it to the list.<br />
<br />
=== darcs ===<br />
Hackers: ADEpt<br />
<br />
Adding tests to pathlib and fixing path-handling bugs reported by me<br />
<br />
=== loker ===<br />
[http://github.com/feuerbach/loker UNIX Shell scripts parser]<br />
<br />
I plan to conduct an introduction to the project for new and wannabe<br />
contributors at the beginning of the first day.<br />
<br />
Hackers: Roman Cheplyaka<br />
<br />
=== vty-ui ===<br />
http://codevine.org/vty-ui/<br />
<br />
Hackers: jtootf<br />
<br />
=== cabal ===<br />
Going to check out some [http://hackage.haskell.org/trac/hackage/report/13 easy tickets].<br />
<br />
Hackers: Oleg Smirnov, maybe ADEpt (not-so-easy tickets)<br />
<br />
=== turnir ===<br />
http://github.com/sphynx/turnir<br />
<br />
Hackers: Ivan Veselov<br />
<br />
=== TermWare ===<br />
http://redmine.gradsoft.ua/projects/show/termware<br />
<br />
Hackers: rssh<br />
<br />
=== under.c ===<br />
<br />
`[http://github.com/vvv/under.c#readme under]' utility decodes binary<br />
DER data to textual S-expressions and/or encodes sexps back to DER.<br />
This C program is based on [http://okmij.org/ftp/Streams.html iteratees]<br />
and is rather ''functional'' in this respect.<br />
<br />
Plan is to extend existing functionality with pluggable codecs.<br />
<br />
Hackers: vvv<br />
<br />
=== nullfs ===<br />
Hackers: xrgtn</div>Adepthttps://wiki.haskell.org/index.php?title=Ghent_Functional_Programming_Group/BelHac/Register&diff=36741Ghent Functional Programming Group/BelHac/Register2010-09-10T14:28:07Z<p>Adept: </p>
<hr />
<div>Important: Please wait for a confirmation email before booking any flights/hotels.<br />
<br />
Registration is via email to Jasper Van der Jeugt at<br />
<br />
jaspervdj+belhac@gmail.com<br />
<br />
with the subject<br />
<br />
BelHac registration <br />
<br />
and body containing the following information:<br />
<br />
Name:<br />
#haskell nick: (if applicable)<br />
Email:<br />
Food restrictions:<br />
Days attending: <br />
<br />
Here is an example:<br />
<br />
Name: Jasper Van der Jeugt<br />
Nick: jaspervdj<br />
Email: jaspervdj@gmail.com<br />
Food restrictions: Raw flesh only<br />
Days attending: Friday, saturday and sunday<br />
<br />
If you want, you can also add you name here:<br />
<br />
{| class="wikitable"<br />
! Nickname<br />
! Real Name<br />
! Affiliation<br />
! Mobile #<br />
! Email<br />
! Arriving - Departing<br />
! Accomodation<br />
|-<br />
| jaspervdj<br />
| Jasper Van der Jeugt<br />
| Ghent University<br />
| +32 476 26 48 47<br />
| jaspervdj@gmail.com<br />
| <br />
| Has a small place in Ghent<br />
|-<br />
| Itkovian<br />
| Andy Georges<br />
| Ghent University/FWO<br />
| <br />
| itkovian@gmail.com<br />
|<br />
| Lives in Ostend, arrives by train on daily basis<br />
|-<br />
| Javache<br />
| Pieter De Baets<br />
| Ghent University<br />
| <br />
| pieter.debaets@gmail.com<br />
|<br />
|<br />
|-<br />
| boegel<br />
| Kenneth Hoste<br />
| Ghent University<br />
| <br />
| kenneth.hoste@ugent.be<br />
|<br />
| commute to/from Ghent daily<br />
|-<br />
| BCoppens<br />
| Bart Coppens<br />
| Ghent University<br />
| <br />
| bart.coppens@elis.ugent.be<br />
|<br />
|<br />
|-<br />
| jejansse<br />
| Jeroen Janssen<br />
| VUB<br />
|<br />
| jejansse@gmail.com<br />
|<br />
| Lives in Ghent.<br />
|-<br />
| Feuerbach<br />
| Roman Cheplyaka<br />
| <br />
| +380662285780<br />
| roma@ro-che.info<br />
| unknown yet<br />
| unknown yet<br />
|-<br />
| solidsnack<br />
| Jason Dusek<br />
| Heroku<br />
| +1 415 894 2162<br />
| jason.dusek@gmail.com<br />
| 5th-7th<br />
| Probably Monasterium.<br />
|-<br />
| kosmikus<br />
| Andres L&ouml;h<br />
| Well-Typed LLP<br />
|<br />
| mail@andres-loeh.de<br />
| 5th-7th<br />
|<br />
|-<br />
| Igloo<br />
| Ian Lynagh<br />
| Well-Typed LLP<br />
|<br />
| igloo@earth.li<br />
| 5th-7th<br />
|<br />
|-<br />
| dcoutts<br />
| Duncan Coutts<br />
| Well-Typed LLP<br />
|<br />
| duncan.coutts@googlemail.com<br />
| 5th-7th<br />
|<br />
|-<br />
| ADEpt<br />
| Dmitry Astapov<br />
| Well-Typed LLP<br />
|<br />
| dastapov@gmail.com<br />
| 5th-7th<br />
|<br />
|-<br />
| wvdschel<br />
| Wim Vander Schelden<br />
| Ghent University<br />
| <br />
| belhac@fixnum.org<br />
| <br />
| Lives in Ghent<br />
|-<br />
| sjoerd_visscher<br />
| Sjoerd Visscher<br />
| SDL Xopus<br />
| <br />
| sjoerd@w3future.com<br />
| 5th-7th<br />
| youth hostel<br />
|-<br />
| mietek<br />
| Miëtek Bak<br />
| Erlang Solutions<br />
| <br />
| mietek@gmail.com<br />
| 5th-7th<br />
| Open for suggestions<br />
|<br />
|-<br />
| <br />
| Steven Keuchel<br />
| Utrecht University<br />
| <br />
| <br />
| 5th-7th<br />
| <br />
|}</div>Adepthttps://wiki.haskell.org/index.php?title=Practice_of_Functional_Programming&diff=30565Practice of Functional Programming2009-10-03T21:01:04Z<p>Adept: </p>
<hr />
<div>'''Practice of Functional Programing''' — a Russian electronic magazine dedicated to promote functional programming, with both theoretical and explanatory articles as well as practical ones (FP success stories). Much of the material is often related to Haskell. Visit the [http://fprog.ru/ '''Official web-site'''] of this magazine (in Russian).<br />
<br />
The magazine is officially registered with '''ISSN 2075-8456'''.<br />
<br />
We want to distribute this magazine free of charge. But it's hard to create it for free. So you can make a [http://fprog.ru/donate/ '''donation'''] with WebMoney ('''R218751601599'''), Yandex.Money ('''41001450150424''') or PayPal (donation with a simple credit card). The editorial and authors staff will be grateful for any pecuniary aid.<br />
<br />
== First issue ==<br />
<br />
[[Image:Pfp2009-01.png|thumb|125px|The cover of the first issue]]<br />
<br />
The first issue was released on July 20, 2009. The magazine can be downloaded as a [http://fprog.ru/2009/issue1/practice-fp-1-print.pdf '''PDF document''']. It consists of the following articles (the links point to HTML versions of the articles):<br />
<br />
* [http://fprog.ru/2009/issue1/serguey-zefirov-lazy-to-fear/ ''Lazy to Fear''] by Serguey A. Zefirov<br />
* [http://fprog.ru/2009/issue1/roman-dushkin-functional-approach/ ''Functions and Functional Approach''] by Roman V. Dushkin<br />
* [http://fprog.ru/2009/issue1/eugene-kirpichev-fighting-mutable-state/ ''The Perils of Mutable State and Methods for Fighting Them''] by Eugene R. Kirpichov<br />
* [http://fprog.ru/2009/issue1/dmitry-astapov-checkers/ ''I haven't Taken the Checkers for a Long Time''] (there is a play of words in Russian) by Dmitry E. Astapov<br />
* [http://fprog.ru/2009/issue1/dan-piponi-haskell-monoids-and-their-uses/ ''Haskell Monoids and their Uses''] by Dan Piponi (translated into Russian by Kirill V. Zaborski)<br />
* [http://fprog.ru/2009/issue1/alex-ott-literature-overview/ ''An Overview of Bibliography on Functional Programming''] by Alexey Y. Ott<br />
<br />
The editor of the first issue is Lev Walkin.<br />
<br />
== Second issue ==<br />
<br />
[[Image:Pfp2009-02.png|thumb|125px|The cover of the second issue]]<br />
<br />
The second issue was released on September 28, 2009. It was devoted mainly to FP success stories. The magazine can be downloaded as a [http://fprog.ru/2009/issue2/practice-fp-2-print.pdf '''PDF document''']. It consists of the following articles:<br />
<br />
* ''The History of One Compiler Development'' by Dmitry Zuikov<br />
* ''Use of Haskell for Maintenance of a Mission-critical Informational System'' by Dmitry E. Astapov<br />
* ''Prototyping with the Aid of Functional Languages'' by Serguey A. Zefirov and Vladislav Balin<br />
* ''Use of Scheme in Development of the Dozor-Jet Product Set'' by Alexey Y. Ott<br />
* ''How to Steal a Billion'' by Alexander Samoilovich<br />
* ''Algebraic Data Types and their Uses in Programming'' by Roman V. Dushkin<br />
<br />
The editor of the second issue is Lev Walkin.<br />
<br />
== Third issue ==<br />
<br />
The third issue is in the works and planned to be released on November 20, 2009.<br />
<br />
The editors of the third issue are Lev Walkin and Dmitry E. Astapov.</div>Adepthttps://wiki.haskell.org/index.php?title=User:Adept&diff=27412User:Adept2009-04-10T18:30:06Z<p>Adept: </p>
<hr />
<div>== Personal trivia ==<br />
I am known as adept (or ADEpt) at #haskell<br />
<br />
You can reach me via dastapov-at-gmail-dot-com, UIN 18-22-53-38 or JID adept-at-jabber-dot-kiev-dot-ua<br />
<br />
== Texts and articles ==<br />
[[QuickCheck as Test Set Generator]]<br />
<br />
[[Hitchhikers Guide to the Haskell]]<br />
<br />
Source for both of them are available in my [http://adept.linux.kiev.ua:8080/repos darcs repo]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=26776Hitchhikers guide to Haskell2009-02-28T11:05:30Z<p>Adept: Spelling fixes from Jens Kubieziel</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of Haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua:8080/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the Haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua:8080/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua:8080/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to put<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take n $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of useful<br />
functions and you don't know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you don't want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take n $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - don't forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass<br />
<hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 GB.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass<br />
<hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass<br />
<hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make<br />
the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when<br />
expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Don't worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DON'T PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 MB, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.nomaware.com/monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic, Jens Kubieziel.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=24869Hitchhikers guide to Haskell2008-12-16T08:13:12Z<p>Adept: More thanks and attributions</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take n $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take n $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - dont forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass<br />
<hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 Gb.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass<br />
<hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass<br />
<hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make<br />
the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when<br />
expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Dont worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.nomaware.com/monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram, Jake Luck, Ketil<br />
Malde, Mike Mimic.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=24868Hitchhikers guide to Haskell2008-12-16T08:05:21Z<p>Adept: Fixed mismatch between text and sources</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take n $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take n $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - dont forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass<br />
<hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 Gb.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass<br />
<hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass<br />
<hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make<br />
the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when<br />
expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Dont worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.nomaware.com/monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=24866Hitchhikers guide to Haskell2008-12-16T07:56:07Z<p>Adept: Explicitly mentioned the fix for In overflow in (!!)</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
[[Category:Tutorials]]<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - dont forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass<br />
<hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 Gb.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass<br />
<hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass<br />
<hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make<br />
the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when<br />
expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Dont worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Haskell/Understanding_monads this wikibook chapter]. <br />
Then, read [http://www.nomaware.com/monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Parsers&diff=15134Parsers2007-08-18T15:11:34Z<p>Adept: Collection of links to parsers implemented in haskell</p>
<hr />
<div>[[Category:Libraries]]<br />
Language/File format Parsers in Haskell<br />
<br />
This page is intended to serve as a collection of links to various readily-available parsers, implemented in Haskell.<br />
<br />
== ASN.1 ==<br />
[[Parsec]] parser for large subset of ASN.1 grammar (circa 98): http://adept.linux.kiev.ua/repos/asn1/<br />
<br />
== E-mail/SMTP ==<br />
[[Parsec]] parsers for grammars from RFC2821 and RFC2822 : http://cryp.to/hsemail/<br />
<br />
== Javascript ==<br />
[[Parsec]] parser for Javascript 1.5: [[Libraries_and_tools/HJS]]<br />
<br />
== JSON ==<br />
[[Parsec]] parser for JSON: http://www.tom.sfc.keio.ac.jp/~sakai/d/data/200604/JSON.hs<br />
<br />
Another [[Parsec]] parser for JSON: http://snippets.dzone.com/posts/show/3660<br />
<br />
== Other places to look ==<br />
Make sur you visited [[Applications_and_libraries/Compilers_and_interpreters]].<br />
<br />
Found parser which I forgot to mention? Add link here.</div>Adepthttps://wiki.haskell.org/index.php?title=QuickCheck_as_a_test_set_generator&diff=13082QuickCheck as a test set generator2007-05-18T10:42:18Z<p>Adept: Added missing code for "neStringOf"</p>
<hr />
<div><center><span style='font-size:xx-large; font-weight:bold'>Haskell as an ultimate "smoke testing" tool </span><p>OR</p> <p><span style='font-size:x-large; font-weight:bold'>Using QuickCheck as a DIY test data generator</span></p></center><br />
<br />
== Preface ==<br />
<br />
Recently, my wife approached me with the following problem: they had to<br />
test their re-implementation (in Java) of the part of the huge<br />
software system previously written in C++. The original system is poorly<br />
documented and only a small part of the sources were available.<br />
<br />
Among other things, they had to wrote a parser for home-brewn DSL<br />
designed to describe data structures. DSL is a mix of ASN.1 and BNF<br />
grammars, describes a structure of some data records and simple<br />
business rules relevant to processing of said record. The DSL is not<br />
Turing-complete, but allows user to define it's own functions,<br />
specify math and boolean expression on fields and was designed as<br />
"ASN.1 on steroids".<br />
<br />
Problem is, that their implementation (in JavaCC) on this DSL parser<br />
was based on the single available description of the DSL grammar,<br />
which was presumably incomplete. They tested implementation on several<br />
examples available, but the question remained how to test the parser on a<br />
large subset of data in order to be fairly sure that "everything<br />
works"<br />
<br />
== The fame of Quick Check ==<br />
<br />
My wife observed me during the last (2005) ICFP contest and was amazed<br />
at the ease with which our team has tested our protocol parser and<br />
printer using Quick Check. So, she asked me whether it is possible to<br />
generate pseudo-random test data in the similar manner for use<br />
"outside" of Haskell?<br />
<br />
"Why not?" I thought. After all, I found it quite easy to generate<br />
instances of 'Arbitrary' for quite complex data structures.<br />
<br />
== Concept of the '''Variant''' ==<br />
<br />
The task was formulated as follows:<br />
<br />
* The task is to generate test datasets for the external program. Each dataset consists of several files, each containing 1 "record"<br />
<br />
* A "record" is essentially a Haskell data type<br />
<br />
* We must be able to generate pseudo-random "valid" and "invalid" data, to test that external program consumes all "valid" samples and fails to consume all "invalid" ones. Deviation from this behavior signifies an error in external program.<br />
<br />
Lets capture this notion of "valid" and "invalid" data in a type<br />
class:<br />
<br />
<haskell><br />
module Variant where<br />
<br />
import Control.Monad<br />
import Test.QuickCheck<br />
<br />
class Variant a where<br />
valid :: Gen a<br />
invalid :: Gen a<br />
</haskell> <br />
<br />
So, in order to make a set of test data of some type, the user must<br />
provide means to generate "valid" and "invalid" data of this type.<br />
<br />
If we can make a "valid" Foo (for suitable "data Foo = ...") and<br />
"invalid" Foo, then we should also be able to make a "random" Foo:<br />
<br />
<haskell><br />
instance Variant a => Arbitrary a where<br />
coarbitrary = undefined -- Not needed, Easily fixable<br />
arbitrary = oneof [valid, invalid]<br />
</haskell><br />
<br />
Thus, taking for example the following definition for our<br />
"data-to-test":<br />
<br />
<haskell><br />
data Record = InputRecord Name Number<br />
| OutputRecord Name Number OutputType<br />
data Number = Number String<br />
data Name = Name String <br />
data OutputType = OutputType String<br />
</haskell><br />
<br />
we could produce the following instances of the class "Variant":<br />
<br />
<haskell><br />
-- For definition of `neStringOf` see below, for now it is sufficient<br />
-- to say that `neStringOf first next` produces non-empty string whose<br />
-- first character is taken from `first` and all sunsequent - from<br />
-- `next`<br />
garbledString = neStringOf ".,_+-" "abc0!@#$%^&*()."<br />
instance Variant Number where<br />
valid = liftM Number $ resize 4 $ neStringOf "123456789" "0123456789"<br />
invalid = liftM Number $ resize 4 $ garbledString<br />
instance Variant Name where<br />
valid = liftM Name $ elements [ "foo", "bar", "baz" ]<br />
invalid = liftM Name garbledString<br />
data OutputType = OutputType String<br />
valid = liftM OutputType $ elements [ "Binary", "Ascii" ]<br />
invalid = liftM OutputType garbledString<br />
<br />
instance Variant Record where<br />
valid = oneof [ liftM2 InputRecord valid valid<br />
, liftM3 OutputRecord valid valid valid ]<br />
invalid = oneof [ liftM2 InputRecord valid invalid<br />
, liftM2 InputRecord invalid valid<br />
, liftM2 InputRecord invalid invalid<br />
, liftM3 OutputRecord invalid valid valid <br />
, liftM3 OutputRecord valid invalid valid <br />
, liftM3 OutputRecord valid valid invalid<br />
, liftM3 OutputRecord invalid invalid valid <br />
, liftM3 OutputRecord valid invalid invalid <br />
, liftM3 OutputRecord invalid valid invalid<br />
, liftM3 OutputRecord invalid invalid invalid<br />
]<br />
</haskell><br />
<br />
The careful reader will have already spotted that once we hand-coded the instances of 'Variant' for a few "basic" types (like 'Name', 'Number', 'OutputType' etc), defining instances of Variant for more complex datatypes becomes easy, though quite a tedious job. We call to the rescue a set of simple helpers to facilitate this task<br />
<br />
== Helper tools ==<br />
<br />
It could easily be seen that we consider an instance of a data type to be "invalid" if at least one of the arguments to the constructor is "invalid", whereas a "valid" instance should have all arguments to data type constructor to be "valid". This calls for some permutations:<br />
<br />
<haskell><br />
proper1 f = liftM f valid<br />
proper2 f = liftM2 f valid valid<br />
proper3 f = liftM3 f valid valid valid<br />
proper4 f = liftM4 f valid valid valid valid<br />
proper5 f = liftM5 f valid valid valid valid valid<br />
<br />
bad1 f = liftM f invalid<br />
bad2 f = oneof $ tail [ liftM2 f g1 g2 | g1<-[valid, invalid], g2<-[valid, invalid] ]<br />
bad3 f = oneof $ tail [ liftM3 f g1 g2 g3 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid] ]<br />
bad4 f = oneof $ tail [ liftM4 f g1 g2 g3 g4 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid] ]<br />
bad5 f = oneof $ tail [ liftM5 f g1 g2 g3 g4 g5 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid], g5<-[valid, invalid] ]<br />
</haskell><br />
<br />
With those helper definitions we could rewrite our Record instance as follows:<br />
<br />
<haskell><br />
instance Variant Record where<br />
valid = oneof [ proper2 InputRecord<br />
, proper3 OutputRecord ]<br />
invalid = oneof [ bad2 InputRecord<br />
, bad3 OutputRecord ]<br />
</haskell><br />
<br />
Note the drastic decrease in the size of the declaration!<br />
<br />
Oh, almost forgot to include the code for "neStringOf":<br />
<haskell><br />
neStringOf chars_start chars_rest =<br />
do s <- elements chars_start<br />
r <- listOf' $ elements chars_rest<br />
return (s:r)<br />
<br />
listOf' :: Gen a -> Gen [a]<br />
listOf' gen = sized $ \n -><br />
do k <- choose (0,n)<br />
vectorOf' k gen<br />
<br />
vectorOf' :: Int -> Gen a -> Gen [a]<br />
vectorOf' k gen = sequence [ gen | _ <- [1..k] ]<br />
</haskell><br />
<br />
== Producing test data ==<br />
<br />
OK, but how to use all those fancy declarations to actually produce some test data?<br />
<br />
Let's take a look at the following code:<br />
<br />
<haskell><br />
data DataDefinition = DataDefinition Name Record<br />
<br />
main = <br />
do let num = 200 -- Number of test cases in each dataset.<br />
let config = -- Describe several test datasets for "DataDefinition"<br />
-- by defining how we want each component of DataDefinition<br />
-- for each particular dataset - valid, invalid or random<br />
[ ("All_Valid", num, (valid, valid, ))<br />
, ("Invalid_Name", num, (invalid, valid, ))<br />
, ("Invalid_Record" , num, (valid, invalid, ))<br />
, ("Random", num, (arbitrary, arbitrary))<br />
]<br />
mapM_ create_test_set config<br />
<br />
create_test_set (fname, ext, count, gens) =<br />
do rnd <- newStdGen <br />
let test_set = generate 100 rnd $ vectorOf' count (mkDataDef gens)<br />
sequence_ $ zipWith (writeToFile fname ext) [1..] test_set <br />
where<br />
mkDataDef (gen_name, gen_rec) = liftM2 DataDefinition gen_name gen_rec<br />
<br />
writeToFile name_prefix suffix n x =<br />
do h <- openFile (name_prefix ++ "_" ++ pad n ++ "." ++ suffix) WriteMode <br />
hPutStrLn h $ show x<br />
hClose h <br />
where pad n = reverse $ take 4 $ (reverse $ show n) ++ (repeat '0')<br />
</haskell><br />
<br />
You see that we could control size, nature and destination of each test dataset. This approach was taken to produce test datasets for the task I described earlier. The final Haskell module had definitions for 40 Haskell datatypes, and the topmost datatype had a single constructor with 9 fields. <br />
<br />
This proved to be A Whole Lot Of Code(tm), and declaration of "instance Variant ..." proved to be a good 30% of total amount. Since most of them were variations of the "oneof [proper Foo, proper2 Bar, proper4 Baz]" theme, I started looking for a way so simplify/automate generation of such instances.<br />
<br />
== Deriving Variant instances automagically ==<br />
<br />
I took a a post made by Bulat Ziganshin on TemplateHaskell mailing list to show how to derive instances of 'Show' automatically, and hacked it to be able to derive instances of "Variant" in much the same way:<br />
<br />
<haskell><br />
import Language.Haskell.TH<br />
import Language.Haskell.TH.Syntax<br />
<br />
data T3 = T3 String<br />
<br />
deriveVariant t = do<br />
-- Get list of constructors for type t<br />
TyConI (DataD _ _ _ constructors _) <- reify t<br />
<br />
-- Make `valid` or `invalid` clause for one constructor:<br />
-- for "(A x1 x2)" makes "Variant.proper2 A"<br />
let mkClause f (NormalC name fields) = <br />
appE (varE (mkName ("Variant."++f++show(length fields)))) (conE name)<br />
<br />
-- Make body for functions `valid` and `invalid`:<br />
-- valid = oneof [ proper2 A | proper1 C]<br />
-- or<br />
-- valid = proper3 B, depending on the number of constructors<br />
validBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "proper" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "proper") cs)) |]<br />
invalidBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "bad" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "bad") cs)) |]<br />
<br />
-- Generate template instance declaration and replace type name (T1)<br />
-- and function body (x = "text") with our data<br />
d <- [d| instance Variant T3 where<br />
valid = liftM T3 valid<br />
invalid = liftM T3 invalid<br />
|]<br />
let [InstanceD [] (AppT showt (ConT _T3)) [ ValD validf _valid [], ValD invalidf _invalid [] ]] = d<br />
return [InstanceD [] (AppT showt (ConT t )) [ ValD validf validBody [], ValD invalidf invalidBody [] ]]<br />
<br />
-- Usage:<br />
$(deriveVariant ''Record)<br />
</haskell><br />
<br />
[[User:Adept|Adept]]<br />
<br />
[[Category:Idioms]]</div>Adepthttps://wiki.haskell.org/index.php?title=IRC_channel&diff=8771IRC channel2006-11-29T12:36:57Z<p>Adept: added note about haskell.ru</p>
<hr />
<div>Internet Relay Chat is a worldwide text chat service with many thousands<br />
of users among various irc networks.<br />
<br />
The Freenode IRC network has a #haskell channel, with a high water mark<br />
of 276 concurrent clients, as of November 2006. One famous resident is<br />
[[Lambdabot]].<br />
<br />
The IRC channel can be an excellent place to learn more about Haskell,<br />
and to just keep in the loop on new things in the Haskell world. Many<br />
new developments in the Haskell world first appear on the irc channel.<br />
<br />
== Getting there ==<br />
<br />
If you point your irc client to [irc://chat.freenode.net/haskell chat.freenode.net] <br />
and then join the #haskell channel, you'll be there.<br />
<br />
Example, using [http://www.irssi.org/ irssi]:<br />
<br />
$ irssi -c chat.freenode.org -n myname -w mypassword<br />
/join #haskell<br />
<br />
and you're there.<br />
<br />
[[Image:Irc--haskell-screenshot.png|frame|A screenshot of an irssi session in #haskell]]<br />
<br />
== Principles ==<br />
<br />
The #haskell channel is a friendly, welcoming place to hang out, teach<br />
and learn. The goal of #haskell is to encourage learning and discussion<br />
of Haskell, functional programming, and programming in general. As part<br />
of this we welcome newbies, and encourage teaching of the language.<br />
<br />
Part of the #haskell success comes from the approach that the community<br />
is quite tight knit -- we know each other -- it's not just a homework<br />
channel. As a result, many collaborative projects have arisen between #haskell citizens.<br />
<br />
== History ==<br />
<br />
The #haskell channel appeared in the late 90s, and really got going<br />
in early 2001, with the help of Shae Erisson (aka shapr).<br />
<br />
A fairly extensive analysis of the traffic on #haskell over the years is<br />
[http://www.cse.unsw.edu.au/~dons/irc/ kept here]<br />
<br />
<br />
== Related channels ==<br />
<br />
In addition to the main Haskell channel there are also:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! Channel<br />
! Purpose<br />
|- <br />
| #haskell.de<br />
| German speakers<br />
|-<br />
| #haskell.es<br />
| Spanish speakers<br />
|-<br />
| #haskell.fi<br />
| Finnish speakers<br />
|-<br />
| #haskell.fr <br />
| French speakers <br />
|-<br />
| #haskell.hr<br />
| Croatian speakers<br />
|-<br />
| #haskell.it <br />
| Italian speakers<br />
|-<br />
| #haskell.jp <br />
| Japanese speakers<br />
|-<br />
| #haskell.no <br />
| Norwegian speakers<br />
|-<br />
| #haskell.ru <br />
| Russian speakers. Seems that most of them migrated to Jabber conference (haskell@conference.jabber.ru)<br />
|-<br />
| #haskell.se <br />
| Swedish speakers<br />
|-<br />
| #haskell-overflow<br />
| Overflow conversations<br />
|-<br />
| #haskell-blah <br />
| Haskell people talking about anything except Haskell itself<br />
|-<br />
| #gentoo-haskell <br />
| Gentoo/Linux specific Haskell conversations<br />
|-<br />
| #darcs <br />
| Darcs revision control channel (written in Haskell)<br />
|-<br />
| #perl6 <br />
|Perl 6 development (plenty of Haskell chat there too)<br />
|-<br />
|}<br />
<br />
[[Image:Nick-activity.png|frame|Growth of #haskell]]<br />
<br />
== Logs ==<br />
<br />
'''Logs''' are kept at a few places, including<br />
<br />
* [http://tunes.org/~nef/logs/haskell/ tunes.org]<br />
* [http://meme.b9.com/clog/haskell/ meme]<br />
<br />
[[Category:Community]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=4246Hitchhikers guide to Haskell2006-06-07T08:37:56Z<p>Adept: Massive spellchecking</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportional fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot of) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
account that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you don't want to indent your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to understand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Consider the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Side note for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Side note for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - dont forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorithm is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- computed as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass<br />
<hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 Gb.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass<br />
<hask>Num</hask> demands that numeric types implement method<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass<br />
<hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make<br />
the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but careful reader will spot at once that when<br />
expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list element. Dont worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thousand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little laziness or too much laziness. It seems like we have<br />
too little laziness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We already mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefulness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accept this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide default, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightforward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successful:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occurred<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will perform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behavior could be<br />
observed for IO monad. Take into account that different behaviors are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execution at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Programming:Haskell_monads this wikibook]. <br />
Then, read [http://www.nomaware.com/monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=4242Hitchhikers guide to Haskell2006-06-05T22:33:00Z<p>Adept: Chapter 5 added</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportial fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
accout that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you dont want to ident your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - dont forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standard Haskell libraries have special typeclass<br />
<hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 Gb.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standard typeclass<br />
<hask>Num</hask> demands that numeric types implement methog<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass<br />
<hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make<br />
the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but carefull reader will spot at once that when<br />
expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list elemt. Dont worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: (Ab)using monads and destructing constructors for fun and profit ==<br />
<br />
We aready mentioned monads quite a few times. They are described in<br />
numerous articles and tutorial (See Chapter 400). It's hard to read a<br />
daily dose of any Haskell mailing list and not to come across a word<br />
"monad" a dozen times.<br />
<br />
Since we already made quite a progress with Haskell, it's time we<br />
revisit the monads once again. I will let the other sources teach you<br />
theory behind the monads, overall usefullness of the concept, etc.<br />
Instead, I will focus on providing you with examples.<br />
<br />
Let's take a part of the real world program which involves XML<br />
processing. We will work with XML tag attributes, which are<br />
essentially named values:<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
type Attribute = (Name, AttValue)<br />
</haskell><br />
<br />
'Name' is a plain string, and value could be '''either''' string or<br />
references (also strings) to another attributes which holds the actual<br />
value (now, this is not a valid XML thing, but for the sake of<br />
providing a nice example, let's accep this). Word "either" suggests<br />
that we use 'Either' datatype:<br />
<haskell><br />
type AttValue = Either Value [Reference]<br />
type Name = String<br />
type Value = String<br />
type Reference = String<br />
<br />
-- Sample list of simple attributes:<br />
simple_attrs = [ ( "xml:lang", Left "en" )<br />
, ( "xmlns", Left "jabber:client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
<br />
-- Sample list of attributes with references:<br />
complex_attrs = [ ( "xml:lang", Right ["lang"] )<br />
, ( "lang", Left "en" )<br />
, ( "xmlns", Right ["ns","subns"] )<br />
, ( "ns", Left "jabber" )<br />
, ( "subns", Left "client" )<br />
, ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]<br />
</haskell><br />
<br />
'''Our task is:''' to write a function that will look up a value of<br />
attribute by it's name from the given list of attributes. When<br />
attribute contains reference(s), we resolve them (looking for the<br />
referenced attribute in the same list) and concatenate their values,<br />
separated by semicolon. Thus, lookup of attribute "xmlns" form both<br />
sample sets of attributes should return the same value.<br />
<br />
Following the example set by the <hask>Data.List.lookup</hask> from<br />
the standard libraries, we will call our function<br />
<hask>lookupAttr</hask> and it will return <hask>Maybe Value</hask>,<br />
allowing for lookup errors:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
-- Since we dont have code for 'lookupAttr', but want<br />
-- to compile code already, we use the function 'undefined' to<br />
-- provide defaul, "always-fail-with-runtime-error" function body.<br />
lookupAttr = undefined<br />
</haskell><br />
<br />
Let's try to code <hask>lookupAttr</hask> using <hask>lookup</hask> in<br />
a very straightfoward way:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-1.hs'<br />
import Data.List<br />
<br />
lookupAttr :: Name -> [Attribute] -> Maybe Value<br />
lookupAttr nm attrs = <br />
-- First, we lookup 'Maybe AttValue' by name and<br />
-- check whether we are successfull:<br />
case (lookup nm attrs) of<br />
-- Pass the lookup error through.<br />
Nothing -> Nothing <br />
-- If given name exist, see if it is value of reference:<br />
Just attv -> case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will preform lookup of all references ...<br />
Right refs -> let vals = [ lookupAttr ref attrs | ref <- refs ]<br />
-- .. then, we will exclude lookup failures<br />
wo_failures = filter (/=Nothing) vals<br />
-- ... find a way to remove annoying 'Just' wrapper<br />
stripJust (Just v) = v<br />
-- ... use it to extract all lookup results as strings<br />
strings = map stripJust wo_failures<br />
in<br />
-- ... finally, combine them into single String. <br />
-- If all lookups failed, we should pass failure to caller.<br />
case null strings of<br />
True -> Nothing<br />
False -> Just (concat (intersperse ":" strings))<br />
</haskell><br />
<br />
Testing:<br />
<br />
*Main> lookupAttr "xmlns" complex_attrs<br />
Just "jabber:client"<br />
*Main> lookupAttr "xmlns" simple_attrs<br />
Just "jabber:client"<br />
*Main><br />
<br />
It works, but ... It seems strange that such a boatload of code<br />
required for quite simple task. If you examine the code closely,<br />
you'll see that the code bloat is caused by:<br />
<br />
* the fact that after each step we check whether the error occured<br />
<br />
* unwrapping Strings from <hask>Maybe</hask> and <hask>Either</hask> data constructors and wrapping them back.<br />
<br />
At this point C++/Java programmers would say that since we just pass<br />
errors upstream, all those cases could be replaced by the single "try<br />
... catch ..." block, and they would be right. Does this mean that<br />
Haskell programmers are reduced to using "case"s, which were already<br />
obsolete 10 years ago?<br />
<br />
Monads to the rescue! As you can read elsewhere (see section 400),<br />
monads are used in advanced ways to construct computations from other<br />
computations. Just what we need - we want to combine several simple<br />
steps (lookup value, lookup reference, ...) into function<br />
<hask>lookupAttr</hask> in a way that would take into account possible<br />
failures.<br />
<br />
Lets start from the code and dissect in afterwards:<br />
<haskell><br />
-- Taken from 'chapter5-2.hs'<br />
import Control.Monad<br />
<br />
lookupAttr' nm attrs = do<br />
-- First, we lookup 'AttValue' by name<br />
attv <- lookup nm attrs<br />
-- See if it is value of reference:<br />
case attv of<br />
-- It's a value. Return it!<br />
Left val -> Just val<br />
-- It's a list of references :(<br />
-- We have to look them up, accounting for<br />
-- possible failures.<br />
-- First, we will preform lookup of all references ...<br />
Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs<br />
-- ... since all failures are already excluded by "monad magic",<br />
-- ... all all 'Just's have been removed likewise,<br />
-- ... we just combine values into single String,<br />
-- ... and return failure if it is empty. <br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
'''Exercise''': compile the code, test that <hask>lookupAttr</hask><br />
and <hask>lookupAttr'</hask> really behave in the same way. Try to<br />
write a QuickCheck test for that, defining the <br />
<hask>instance Arbitrary Name</hask> such that arbitrary names will be taken from<br />
names available in <hask>simple_attrs</hask>.<br />
<br />
Well, back to the story. Noticed the drastic reduction in code size?<br />
If you drop comments, the code will occupy mere 7 lines instead of 13<br />
- almost two-fold reduction. How we achieved this?<br />
<br />
First, notice that we never ever check whether some computation<br />
returns <hask>Nothing</hask> anymore. Yet, try to lookup some<br />
non-existing attribute name, and <hask>lookupAttr'</hask> will return<br />
<hask>Nothing</hask>. How does this happen? Secret lies in the fact<br />
that type constructor <hask>Maybe</hask> is a "monad".<br />
<br />
We use keyword <hask>do</hask> to indicate that following block of<br />
code is a sequence of '''monadic actions''', where '''monadic magic'''<br />
have to happen when we use '<-', 'return' or move from one action to<br />
another.<br />
<br />
Different monads have different '''magic'''. Library code says that<br />
type constructor <hask>Maybe</hask> is such a monad that we could use<br />
<hask><-</hask> to "extract" values from wrapper <hask>Just</hask> and<br />
use <hask>return</hask> to put them back in form of<br />
<hask>Just some_value</hask>. When we move from one action in the "do" block to<br />
another a check happens. If the action returned <hask>Nothing</hask>,<br />
all subsequent computations will be skipped and the whole "do" block<br />
will return <hask>Nothing</hask>.<br />
<br />
Try this to understand it all better:<br />
<haskell><br />
*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)<br />
Just 6<br />
*Main> let foo x = do v <- x; return (v+1) in foo Nothing <br />
Nothing<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')<br />
Just 97<br />
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing <br />
Nothing<br />
*Main> <br />
</haskell><br />
<br />
Do not mind <hask>sequence</hask> and <hask>guard</hask> just for now<br />
- we will get to them in the little while.<br />
<br />
Since we already removed one reason for code bloat, it is time to deal<br />
with the other one. Notice that we have to use <hask>case</hask> to<br />
'''deconstruct''' the value of type <hask>Either Value<br />
[Reference]</hask>. Surely we are not the first to do this, and such<br />
use case have to be quite a common one. <br />
<br />
Indeed, there is a simple remedy for our case, and it is called<br />
<hask>either</hask>:<br />
<br />
*Main> :t either<br />
either :: (a -> c) -> (b -> c) -> Either a b -> c<br />
<br />
Scary type signature, but here are examples to help you grok it:<br />
<br />
*Main> :t either (+1) (length) <br />
either (+1) (length) :: Either Int [a] -> Int<br />
*Main> either (+1) (length) (Left 5)<br />
6<br />
*Main> either (+1) (length) (Right "foo")<br />
3<br />
*Main> <br />
<br />
Seems like this is exactly our case. Let's replace the<br />
<hask>case</hask> with invocation of <hask>either</hask>:<br />
<br />
<haskell><br />
-- Taken from 'chapter5-3.hs'<br />
lookupAttr'' nm attrs = do<br />
attv <- lookup nm attrs<br />
either Just (dereference attrs) attv<br />
where<br />
dereference attrs refs = do <br />
vals <- sequence $ map (flip lookupAttr'' attrs) refs<br />
guard (not (null vals))<br />
return (concat (intersperse ":" vals))<br />
</haskell><br />
<br />
It keeps getting better and better :)<br />
<br />
Now, as semi-exercise, try to understand the meaning of "sequence",<br />
"guard" and "flip" looking at the following ghci sessions:<br />
<br />
*Main> :t sequence<br />
sequence :: (Monad m) => [m a] -> m [a]<br />
*Main> :t [Just 'a', Just 'b', Nothing, Just 'c']<br />
[Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]<br />
*Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]<br />
<br />
*Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b', Nothing]<br />
Nothing<br />
*Main> sequence [Just 'a', Just 'b']<br />
Just "ab"<br />
<br />
*Main> :t [putStrLn "a", putStrLn "b"]<br />
[putStrLn "a", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", putStrLn "b"]<br />
sequence [putStrLn "a", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", putStrLn "b"]<br />
a<br />
b<br />
<br />
*Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]<br />
[putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]<br />
*Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]<br />
*Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]<br />
a<br />
*** Exception: user error (stop here)<br />
<br />
Notice that for monad <hask>Maybe</hask> sequence continues execution<br />
until the first <hask>Nothing</hask>. The same behaviour could be<br />
observed for IO monad. Take into account that different behaviours are<br />
not hardcoded into the definition of <hask>sequence</hask>!<br />
<br />
Now, let's examine <hask>guard</hask>:<br />
<br />
*Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] <br />
[Just 5,Nothing,Just 7]<br />
<br />
As you can see, it's just a simple way to "stop" execustion at some<br />
condition.<br />
<br />
If you have been hooked on monads, I urge you to read "All About<br />
Monads" right now (link in Chapter 400).<br />
<br />
== Chapter 6: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Read [http://en.wikibooks.org/wiki/Programming:Haskell_monads this wikibook]. <br />
Then, read [http://www.nomaware.com/monads "All about monads"].<br />
'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm,<br />
HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram.<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=4221Hitchhikers guide to Haskell2006-06-02T08:43:22Z<p>Adept: Added links to source code files, massive review and bugfixing</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportial fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- Taken from 'hello.hs'<br />
-- From now on, a comment at the beginning of the code snippet<br />
-- will specify the file which contain the full program from<br />
-- which the snippet is taken. You can get the code from the darcs<br />
-- repository "http://adept.linux.kiev.ua/repos/hhgtth" by issuing<br />
-- command "darcs get http://adept.linux.kiev.ua/repos/hhgtth"<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-1-1.hs'<br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
-- Taken from 'exercise-1-1.hs'<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
accout that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
Often people complain that it is very difficult to write Haskell<br />
because it requires them to align code. Actually, this is not true. If<br />
you align your code, compiler will guess the beginnings and endings of<br />
syntactic blocks. However, if you dont want to ident your code, you<br />
could explicitly specify end of each and every expression and use<br />
arbitrary layout as in this example: <br />
<haskell><br />
-- Taken from 'exercise-1-2.hs'<br />
combine before after = <br />
do { before; <br />
putStrLn "In the middle"; <br />
after; };<br />
<br />
main = <br />
do { combine c c; let { b = combine (putStrLn "Hello!") (putStrLn "Bye!")};<br />
let {d = combine (b) (combine c c)}; <br />
putStrLn "So long!" };<br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<haskell><br />
data Dir = D Int String deriving Show<br />
</haskell><br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-2-1.hs'<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. <br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-1.hs'<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :) Now, do "darcs record" and add some sensible commit message.<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-3-2.hs'<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third<br />
time ;)<br />
<br />
Oh, by the way - dont forget to "darcs record" your changes!<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-1.hs'<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Before we move any further, let's do a small cosmetic change to our<br />
code. Right now our solution uses 'Int' to store directory size. In<br />
Haskell, 'Int' is a platform-dependent integer, which imposes certain<br />
limitations on the values of this type. Attempt to compute the value<br />
of type 'Int' that exceeds the bounds will result in overflow error.<br />
Standart Haskell libraries have special typeclass<br />
<hask>Bounded</hask>, which allows to define and examine such bounds:<br />
<br />
Prelude> :i Bounded <br />
class Bounded a where<br />
minBound :: a<br />
maxBound :: a<br />
-- skip --<br />
instance Bounded Int -- Imported from GHC.Enum<br />
<br />
We see that 'Int' is indeed bounded. Let's examine the bounds:<br />
<br />
Prelude> minBound :: Int <br />
-2147483648<br />
Prelude> maxBound :: Int<br />
2147483647<br />
Prelude> <br />
<br />
Those of you who are C-literate, will spot at once that in this case<br />
the 'Int' is so-called "signed 32-bit integer", which means that we<br />
would run into errors trying to operate on directories/directory packs<br />
which are bigger than 2 Gb.<br />
<br />
Luckily for us, Haskell has integers of arbitrary precision (limited<br />
only by the amount of available memory). The appropriate type is<br />
called 'Integer':<br />
<br />
Prelude> (2^50) :: Int<br />
0 -- overflow<br />
Prelude> (2^50) :: Integer<br />
1125899906842624 -- no overflow<br />
Prelude><br />
<br />
Lets change definitions of 'Dir' and 'DirPack' to allow for bigger<br />
directory sizes:<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
</haskell><br />
<br />
Try to compile the code or load it into ghci. You will get the<br />
following errors:<br />
<br />
cd-fit-4-2.hs:73:79:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the expression: limit - (dir_size d)<br />
In the second argument of `(!!)', namely `(limit - (dir_size d))'<br />
<br />
cd-fit-4-2.hs:89:47:<br />
Couldn't match `Int' against `Integer'<br />
Expected type: Int<br />
Inferred type: Integer<br />
In the second argument of `(!!)', namely `media_size'<br />
In the definition of `dynamic_pack':<br />
dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size<br />
<br />
<br />
It seems like Haskell have some troubles using 'Integer' with '(!!)'.<br />
Let's see why:<br />
<br />
Prelude> :t (!!)<br />
(!!) :: [a] -> Int -> a<br />
<br />
Seems like definition of '(!!)' demands that index will be 'Int', not<br />
'Integer'. Haskell never converts any type to some other type<br />
automatically - programmer have to explicitly ask for that.<br />
<br />
I will not repeat the section "Standard Haskell Classes" from<br />
[http://haskell.org/onlinereport/basic.html the Haskell Report] and<br />
explain, why typeclasses for various numbers organized the way they<br />
are organized. I will just say that standart typeclass<br />
<hask>Num</hask> demands that numeric types implement methog<br />
<hask>fromInteger</hask>:<br />
<br />
Prelude> :i Num<br />
class (Eq a, Show a) => Num a where<br />
(+) :: a -> a -> a<br />
(*) :: a -> a -> a<br />
(-) :: a -> a -> a<br />
negate :: a -> a<br />
abs :: a -> a<br />
signum :: a -> a<br />
fromInteger :: Integer -> a<br />
-- Imported from GHC.Num<br />
instance Num Float -- Imported from GHC.Float<br />
instance Num Double -- Imported from GHC.Float<br />
instance Num Integer -- Imported from GHC.Num<br />
instance Num Int -- Imported from GHC.Num<br />
<br />
We see that <hask>Integer</hask> is a member of typeclass<br />
<hask>Num</hask>, thus we could use <hask>fromInteger</hask> to make<br />
the type errors go away:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
-- snip<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
] of<br />
-- snip<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)<br />
-- snip<br />
</haskell><br />
<br />
Type errors went away, but carefull reader will spot at once that when<br />
expression <hask>(limit - dir_size d)</hask> will exceed the bounds<br />
for <hask>Int</hask>, overflow will occur, and we will not access the<br />
correct list elemt. Dont worry, we will deal with this in a short while.<br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-2.hs'<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-3.hs'<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-4.hs'<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
-- Taken from 'cd-fit-4-5.hs'<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about [[monad]]s" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to:<br />
Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson,<br />
Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor<br />
Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins,<br />
Anastasia Gornostaeva (ermine).<br />
<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=QuickCheck_as_a_test_set_generator&diff=4212QuickCheck as a test set generator2006-05-29T14:38:01Z<p>Adept: </p>
<hr />
<div>= <center>Haskell as an ultimate "smoke testing" tool<p>OR</p><p>Using QuickCheck as DIY test data generator</p></center> =<br />
<br />
== Preface ==<br />
<br />
Recently, my wife approached me with the following problem: they had to<br />
test their re-implementation (in Java) of the part of the huge<br />
software system previously written in C++. The original system is poorly<br />
documented and only a small part of the sources were available.<br />
<br />
Among other things, they had to wrote a parser for home-brewn DSL<br />
designed to describe data structures. DSL is a mix of ASN.1 and BNF<br />
grammars, describes a structure of some data records and simple<br />
business rules relevant to processing of said record. The DSL is not<br />
Turing-complete, but allows user to define it's own functions,<br />
specify math and boolean expression on fields and was designed as<br />
"ASN.1 on steroids".<br />
<br />
Problem is, that their implementation (in JavaCC) on this DSL parser<br />
was based on the single available description of the DSL grammar,<br />
which was presumably incomplete. They tested implementation on several<br />
examples available, but the question remained how to test the parser on a<br />
large subset of data in order to be fairly sure that "everything<br />
works"<br />
<br />
== The fame of Quick Check ==<br />
<br />
My wife observed me during the last (2005) ICFP contest and was amazed<br />
at the ease with which our team has tested our protocol parser and<br />
printer using Quick Check. So, she asked me whether it is possible to<br />
generate pseudo-random test data in the similar manner for use<br />
"outside" of Haskell?<br />
<br />
"Why not?" I thought. After all, I found it quite easy to generate<br />
instances of 'Arbitrary' for quite complex data structures.<br />
<br />
== Concept of the '''Variant''' ==<br />
<br />
The task was formulated as follows:<br />
<br />
* The task is to generate test datasets for the external program. Each dataset consists of several files, each containing 1 "record"<br />
<br />
* A "record" is essentially a Haskell data type<br />
<br />
* We must be able to generate pseudo-random "valid" and "invalid" data, to test that external program consumes all "valid" samples and fails to consume all "invalid" ones. Deviation from this behavior signifies an error in external program.<br />
<br />
Lets capture this notion of "valid" and "invalid" data in a type<br />
class:<br />
<br />
<haskell><br />
module Variant where<br />
<br />
import Control.Monad<br />
import Test.QuickCheck<br />
<br />
class Variant a where<br />
valid :: Gen a<br />
invalid :: Gen a<br />
</haskell> <br />
<br />
So, in order to make a set of test data of some type, the user must<br />
provide means to generate "valid" and "invalid" data of this type.<br />
<br />
If we can make a "valid" Foo (for suitable "data Foo = ...") and<br />
"invalid" Foo, then we should also be able to make a "random" Foo:<br />
<br />
<haskell><br />
instance Variant a => Arbitrary a where<br />
coarbitrary = undefined -- Not needed, Easily fixable<br />
arbitrary = oneof [valid, invalid]<br />
</haskell><br />
<br />
Thus, taking for example the following definition for our<br />
"data-to-test":<br />
<br />
<haskell><br />
data Record = InputRecord Name Number<br />
| OutputRecord Name Number OutputType<br />
data Number = Number String<br />
data Name = Name String <br />
data OutputType = OutputType String<br />
</haskell><br />
<br />
we could produce the following instances of the class "Variant":<br />
<br />
<haskell><br />
-- For definition of `neStringOf` see below, for now it is sufficient<br />
-- to say that `neStringOf first next` produces non-empty string whose<br />
-- first character is taken from `first` and all sunsequent - from<br />
-- `next`<br />
garbledString = neStringOf ".,_+-" "abc0!@#$%^&*()."<br />
instance Variant Number where<br />
valid = liftM Number $ resize 4 $ neStringOf "123456789" "0123456789"<br />
invalid = liftM Number $ resize 4 $ garbledString<br />
instance Variant Name where<br />
valid = liftM Name $ elements [ "foo", "bar", "baz" ]<br />
invalid = liftM Name garbledString<br />
data OutputType = OutputType String<br />
valid = liftM OutputType $ elements [ "Binary", "Ascii" ]<br />
invalid = liftM OutputType garbledString<br />
<br />
instance Variant Record where<br />
valid = oneof [ liftM2 InputRecord valid valid<br />
, liftM3 OutputRecord valid valid valid ]<br />
invalid = oneof [ liftM2 InputRecord valid invalid<br />
, liftM2 InputRecord invalid valid<br />
, liftM2 InputRecord invalid invalid<br />
, liftM3 OutputRecord invalid valid valid <br />
, liftM3 OutputRecord valid invalid valid <br />
, liftM3 OutputRecord valid valid invalid<br />
, liftM3 OutputRecord invalid invalid valid <br />
, liftM3 OutputRecord valid invalid invalid <br />
, liftM3 OutputRecord invalid valid invalid<br />
, liftM3 OutputRecord invalid invalid invalid<br />
]<br />
</haskell><br />
<br />
The careful reader will have already spotted that once we hand-coded the instances of 'Variant' for a few "basic" types (like 'Name', 'Number', 'OutputType' etc), defining instances of Variant for more complex datatypes becomes easy, though quite a tedious job. We call to the rescue a set of simple helpers to facilitate this task<br />
<br />
== Helper tools ==<br />
<br />
It could easily be seen that we consider an instance of a data type to be "invalid" if at least one of the arguments to the constructor is "invalid", whereas a "valid" instance should have all arguments to data type constructor to be "valid". This calls for some permutations:<br />
<br />
<haskell><br />
proper1 f = liftM f valid<br />
proper2 f = liftM2 f valid valid<br />
proper3 f = liftM3 f valid valid valid<br />
proper4 f = liftM4 f valid valid valid valid<br />
proper5 f = liftM5 f valid valid valid valid valid<br />
<br />
bad1 f = liftM f invalid<br />
bad2 f = oneof $ tail [ liftM2 f g1 g2 | g1<-[valid, invalid], g2<-[valid, invalid] ]<br />
bad3 f = oneof $ tail [ liftM3 f g1 g2 g3 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid] ]<br />
bad4 f = oneof $ tail [ liftM4 f g1 g2 g3 g4 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid] ]<br />
bad5 f = oneof $ tail [ liftM5 f g1 g2 g3 g4 g5 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid], g5<-[valid, invalid] ]<br />
</haskell><br />
<br />
With those helper definitions we could rewrite our Record instance as follows:<br />
<br />
<haskell><br />
instance Variant Record where<br />
valid = oneof [ proper2 InputRecord<br />
, proper3 OutputRecord ]<br />
invalid = oneof [ bad2 InputRecord<br />
, bad3 OutputRecord ]<br />
</haskell><br />
<br />
Note the drastic decrease in the size of the declaration!<br />
<br />
== Producing test data ==<br />
<br />
OK, but how to use all those fancy declarations to actually produce some test data?<br />
<br />
Let's take a look at the following code:<br />
<br />
<haskell><br />
data DataDefinition = DataDefinition Name Record<br />
<br />
main = <br />
do let num = 200 -- Number of test cases in each dataset.<br />
let config = -- Describe several test datasets for "DataDefinition"<br />
-- by defining how we want each component of DataDefinition<br />
-- for each particular dataset - valid, invalid or random<br />
[ ("All_Valid", num, (valid, valid, ))<br />
, ("Invalid_Name", num, (invalid, valid, ))<br />
, ("Invalid_Record" , num, (valid, invalid, ))<br />
, ("Random", num, (arbitrary, arbitrary))<br />
]<br />
mapM_ create_test_set config<br />
<br />
create_test_set (fname, ext, count, gens) =<br />
do rnd <- newStdGen <br />
let test_set = generate 100 rnd $ vectorOf' count (mkDataDef gens)<br />
sequence_ $ zipWith (writeToFile fname ext) [1..] test_set <br />
where<br />
mkDataDef (gen_name, gen_rec) = liftM2 DataDefinition gen_name gen_rec<br />
<br />
writeToFile name_prefix suffix n x =<br />
do h <- openFile (name_prefix ++ "_" ++ pad n ++ "." ++ suffix) WriteMode <br />
hPutStrLn h $ show x<br />
hClose h <br />
where pad n = reverse $ take 4 $ (reverse $ show n) ++ (repeat '0')<br />
</haskell><br />
<br />
You see that we could control size, nature and destination of each test dataset. This approach was taken to produce test datasets for the task I described earlier. The final Haskell module had definitions for 40 Haskell datatypes, and the topmost datatype had a single constructor with 9 fields. <br />
<br />
This proved to be A Whole Lot Of Code(tm), and declaration of "instance Variant ..." proved to be a good 30% of total amount. Since most of them were variations of the "oneof [proper Foo, proper2 Bar, proper4 Baz]" theme, I started looking for a way so simplify/automate generation of such instances.<br />
<br />
== Deriving Variant instances automagically ==<br />
<br />
I took a a post made by Bulat Ziganshin on TemplateHaskell mailing list to show how to derive instances of 'Show' automatically, and hacked it to be able to derive instances of "Variant" in much the same way:<br />
<br />
<haskell><br />
import Language.Haskell.TH<br />
import Language.Haskell.TH.Syntax<br />
<br />
data T3 = T3 String<br />
<br />
deriveVariant t = do<br />
-- Get list of constructors for type t<br />
TyConI (DataD _ _ _ constructors _) <- reify t<br />
<br />
-- Make `valid` or `invalid` clause for one constructor:<br />
-- for "(A x1 x2)" makes "Variant.proper2 A"<br />
let mkClause f (NormalC name fields) = <br />
appE (varE (mkName ("Variant."++f++show(length fields)))) (conE name)<br />
<br />
-- Make body for functions `valid` and `invalid`:<br />
-- valid = oneof [ proper2 A | proper1 C]<br />
-- or<br />
-- valid = proper3 B, depending on the number of constructors<br />
validBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "proper" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "proper") cs)) |]<br />
invalidBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "bad" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "bad") cs)) |]<br />
<br />
-- Generate template instance declaration and replace type name (T1)<br />
-- and function body (x = "text") with our data<br />
d <- [d| instance Variant T3 where<br />
valid = liftM T3 valid<br />
invalid = liftM T3 invalid<br />
|]<br />
let [InstanceD [] (AppT showt (ConT _T3)) [ ValD validf _valid [], ValD invalidf _invalid [] ]] = d<br />
return [InstanceD [] (AppT showt (ConT t )) [ ValD validf validBody [], ValD invalidf invalidBody [] ]]<br />
<br />
-- Usage:<br />
$(deriveVariant ''Record)<br />
</haskell><br />
<br />
[[User:Adept|Adept]]<br />
<br />
[[Category:Idioms]]</div>Adepthttps://wiki.haskell.org/index.php?title=QuickCheck_as_a_test_set_generator&diff=4211QuickCheck as a test set generator2006-05-29T14:36:33Z<p>Adept: Added proper syntax highlighting</p>
<hr />
<div>= <center>Haskell as an ultimate "smoke testing" tool<p>OR</p><p>Using QuickCheck as DIY test data generator</p></center> =<br />
<br />
== Preface ==<br />
<br />
Recently, my wife approached me with the following problem: they had to<br />
test their re-implementation (in Java) of the part of the huge<br />
software system previously written in C++. The original system is poorly<br />
documented and only a small part of the sources were available.<br />
<br />
Among other things, they had to wrote a parser for home-brewn DSL<br />
designed to describe data structures. DSL is a mix of ASN.1 and BNF<br />
grammars, describes a structure of some data records and simple<br />
business rules relevant to processing of said record. The DSL is not<br />
Turing-complete, but allows user to define it's own functions,<br />
specify math and boolean expression on fields and was designed as<br />
"ASN.1 on steroids".<br />
<br />
Problem is, that their implementation (in JavaCC) on this DSL parser<br />
was based on the single available description of the DSL grammar,<br />
which was presumably incomplete. They tested implementation on several<br />
examples available, but the question remained how to test the parser on a<br />
large subset of data in order to be fairly sure that "everything<br />
works"<br />
<br />
== The fame of Quick Check ==<br />
<br />
My wife observed me during the last (2005) ICFP contest and was amazed<br />
at the ease with which our team has tested our protocol parser and<br />
printer using Quick Check. So, she asked me whether it is possible to<br />
generate pseudo-random test data in the similar manner for use<br />
"outside" of Haskell?<br />
<br />
"Why not?" I thought. After all, I found it quite easy to generate<br />
instances of 'Arbitrary' for quite complex data structures.<br />
<br />
== Concept of the '''Variant''' ==<br />
<br />
The task was formulated as follows:<br />
<br />
* The task is to generate test datasets for the external program. Each dataset consists of several files, each containing 1 "record"<br />
<br />
* A "record" is essentially a Haskell data type<br />
<br />
* We must be able to generate pseudo-random "valid" and "invalid" data, to test that external program consumes all "valid" samples and fails to consume all "invalid" ones. Deviation from this behavior signifies an error in external program.<br />
<br />
Lets capture this notion of "valid" and "invalid" data in a type<br />
class:<br />
<br />
<haskell><br />
module Variant where<br />
<br />
import Control.Monad<br />
import Test.QuickCheck<br />
<br />
class Variant a where<br />
valid :: Gen a<br />
invalid :: Gen a<br />
</haskell> <br />
<br />
So, in order to make a set of test data of some type, the user must<br />
provide means to generate "valid" and "invalid" data of this type.<br />
<br />
If we can make a "valid" Foo (for suitable "data Foo = ...") and<br />
"invalid" Foo, then we should also be able to make a "random" Foo:<br />
<br />
<haskell><br />
instance Variant a => Arbitrary a where<br />
coarbitrary = undefined -- Not needed, Easily fixable<br />
arbitrary = oneof [valid, invalid]<br />
</haskell><br />
<br />
Thus, taking for example the following definition for our<br />
"data-to-test":<br />
<br />
<haskell><br />
data Record = InputRecord Name Number<br />
| OutputRecord Name Number OutputType<br />
data Number = Number String<br />
data Name = Name String <br />
data OutputType = OutputType String<br />
</haskell><br />
<br />
we could produce the following instances of the class "Variant":<br />
<br />
<haskell><br />
-- For definition of `neStringOf` see below, for now it is sufficient<br />
-- to say that `neStringOf first next` produces non-empty string whose<br />
-- first character is taken from `first` and all sunsequent - from<br />
-- `next`<br />
garbledString = neStringOf ".,_+-" "abc0!@#$%^&*()."<br />
instance Variant Number where<br />
valid = liftM Number $ resize 4 $ neStringOf "123456789" "0123456789"<br />
invalid = liftM Number $ resize 4 $ garbledString<br />
instance Variant Name where<br />
valid = liftM Name $ elements [ "foo", "bar", "baz" ]<br />
invalid = liftM Name garbledString<br />
data OutputType = OutputType String<br />
valid = liftM OutputType $ elements [ "Binary", "Ascii" ]<br />
invalid = liftM OutputType garbledString<br />
<br />
instance Variant Record where<br />
valid = oneof [ liftM2 InputRecord valid valid<br />
, liftM3 OutputRecord valid valid valid ]<br />
invalid = oneof [ liftM2 InputRecord valid invalid<br />
, liftM2 InputRecord invalid valid<br />
, liftM2 InputRecord invalid invalid<br />
, liftM3 OutputRecord invalid valid valid <br />
, liftM3 OutputRecord valid invalid valid <br />
, liftM3 OutputRecord valid valid invalid<br />
, liftM3 OutputRecord invalid invalid valid <br />
, liftM3 OutputRecord valid invalid invalid <br />
, liftM3 OutputRecord invalid valid invalid<br />
, liftM3 OutputRecord invalid invalid invalid<br />
]<br />
</haskell><br />
<br />
The careful reader will have already spotted that once we hand-coded the instances of 'Variant' for a few "basic" types (like 'Name', 'Number', 'OutputType' etc), defining instances of Variant for more complex datatypes becomes easy, though quite a tedious job. We call to the rescue a set of simple helpers to facilitate this task<br />
<br />
== Helper tools ==<br />
<br />
It could easily be seen that we consider an instance of a data type to be "invalid" if at least one of the arguments to the constructor is "invalid", whereas a "valid" instance should have all arguments to data type constructor to be "valid". This calls for some permutations:<br />
<br />
<haskell><br />
proper1 f = liftM f valid<br />
proper2 f = liftM2 f valid valid<br />
proper3 f = liftM3 f valid valid valid<br />
proper4 f = liftM4 f valid valid valid valid<br />
proper5 f = liftM5 f valid valid valid valid valid<br />
<br />
bad1 f = liftM f invalid<br />
bad2 f = oneof $ tail [ liftM2 f g1 g2 | g1<-[valid, invalid], g2<-[valid, invalid] ]<br />
bad3 f = oneof $ tail [ liftM3 f g1 g2 g3 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid] ]<br />
bad4 f = oneof $ tail [ liftM4 f g1 g2 g3 g4 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid] ]<br />
bad5 f = oneof $ tail [ liftM5 f g1 g2 g3 g4 g5 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid], g5<-[valid, invalid] ]<br />
</haskell><br />
<br />
With those helper definitions we could rewrite our Record instance as follows:<br />
<br />
<haskell><br />
instance Variant Record where<br />
valid = oneof [ proper2 InputRecord<br />
, proper3 OutputRecord ]<br />
invalid = oneof [ bad2 InputRecord<br />
, bad3 OutputRecord ]<br />
</haskell><br />
<br />
Note the drastic decrease in the size of the declaration!<br />
<br />
== Producing test data ==<br />
<br />
OK, but how to use all those fancy declarations to actually produce some test data?<br />
<br />
Let's take a look at the following code:<br />
<br />
<haskell><br />
data DataDefinition = DataDefinition Name Record<br />
<br />
main = <br />
do let num = 200 -- Number of test cases in each dataset.<br />
let config = -- Describe several test datasets for "DataDefinition"<br />
-- by defining how we want each component of DataDefinition<br />
-- for each particular dataset - valid, invalid or random<br />
[ ("All_Valid", num, (valid, valid, ))<br />
, ("Invalid_Name", num, (invalid, valid, ))<br />
, ("Invalid_Record" , num, (valid, invalid, ))<br />
, ("Random", num, (arbitrary, arbitrary))<br />
]<br />
mapM_ create_test_set config<br />
<br />
create_test_set (fname, ext, count, gens) =<br />
do rnd <- newStdGen <br />
let test_set = generate 100 rnd $ vectorOf' count (mkDataDef gens)<br />
sequence_ $ zipWith (writeToFile fname ext) [1..] test_set <br />
where<br />
mkDataDef (gen_name, gen_rec) = liftM2 DataDefinition gen_name gen_rec<br />
<br />
writeToFile name_prefix suffix n x =<br />
do h <- openFile (name_prefix ++ "_" ++ pad n ++ "." ++ suffix) WriteMode <br />
hPutStrLn h $ show x<br />
hClose h <br />
where pad n = reverse $ take 4 $ (reverse $ show n) ++ (repeat '0')<br />
</haskell><br />
<br />
You see that we could control size, nature and destination of each test dataset. This approach was taken to produce test datasets for the task I described earlier. The final Haskell module had definitions for 40 Haskell datatypes, and the topmost datatype had a single constructor with 9 fields. <br />
<br />
This proved to be A Whole Lot Of Code(tm), and declaration of "instance Variant ..." proved to be a good 30% of total amount. Since most of them were variations of the "oneof [proper Foo, proper2 Bar, proper4 Baz]" theme, I started looking for a way so simplify/automate generation of such instances.<br />
<br />
== Deriving Variant instances automagically ==<br />
<br />
I took a a post made by Bulat Ziganshin on TemplateHaskell mailing list to show how to derive instances of 'Show' automatically, and hacked it to be able to derive instances of "Variant" in much the same way:<br />
<br />
<haskell><br />
import Language.Haskell.TH<br />
import Language.Haskell.TH.Syntax<br />
<br />
data T3 = T3 String<br />
<br />
deriveVariant t = do<br />
-- Get list of constructors for type t<br />
TyConI (DataD _ _ _ constructors _) <- reify t<br />
<br />
-- Make `valid` or `invalid` clause for one constructor:<br />
-- for "(A x1 x2)" makes "Variant.proper2 A"<br />
let mkClause f (NormalC name fields) = <br />
appE (varE (mkName ("Variant."++f++show(length fields)))) (conE name)<br />
<br />
-- Make body for functions `valid` and `invalid`:<br />
-- valid = oneof [ proper2 A | proper1 C]<br />
-- or<br />
-- valid = proper3 B, depending on the number of constructors<br />
validBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "proper" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "proper") cs)) |]<br />
invalidBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "bad" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "bad") cs)) |]<br />
<br />
-- Generate template instance declaration and replace type name (T1)<br />
-- and function body (x = "text") with our data<br />
d <- [d| instance Variant T3 where<br />
valid = liftM T3 valid<br />
invalid = liftM T3 invalid<br />
|]<br />
let [InstanceD [] (AppT showt (ConT _T3)) [ ValD validf _valid [], ValD invalidf _invalid [] ]] = d<br />
return [InstanceD [] (AppT showt (ConT t )) [ ValD validf validBody [], ValD invalidf invalidBody [] ]]<br />
<br />
-- Usage:<br />
$(deriveVariant ''Record)<br />
</haskell><br />
<br />
[[Adept]]<br />
<br />
[[Category:Idioms]]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=4210Hitchhikers guide to Haskell2006-05-29T06:38:10Z<p>Adept: Int -> Integer in few places</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportial fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- put this in hello.hs<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
accout that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
If you dont want to ident your code, you could explicitly specify end<br />
of each and every expression and use arbitrary layout:<br />
<haskell><br />
combine before after = do { before; putStrLn "In the middle"; after }<br />
<br />
main = do { combine c c<br />
; let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
; let d = combine (b) (combine c c)<br />
; putStrLn "So long!"<br />
} <br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<br />
data Dir = D Int String deriving Show<br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".<br />
<br />
----<br />
Here is complete "cd-fit.hs" what we should have written so far:<br />
<br />
module Main where<br />
<haskell><br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- Output of "du -sb" -- which is our input -- consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Information about single direcory is a size (number), some spaces,<br />
-- then directory name, which extends till newline<br />
data Dir = Dir Int String deriving Show<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return $ Dir (read size) dir_name<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
-- compute solution and print it<br />
</haskell><br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :)<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Integer, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!limit<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
<haskell><br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about [[monad]]s" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to: Helge,<br />
alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin<br />
Percossi, SpellingNazi, Davor Cubranic, Brett Giles, Stdrange, Brian Chrisman.<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=4209Hitchhikers guide to Haskell2006-05-28T19:53:09Z<p>Adept: Added yet another bit on layout rules</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportial fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- put this in hello.hs<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
Notice how we carefully indent lines so that source looks neat?<br />
Actually, Haskell code has to be aligned this way, or it will not<br />
compile. If you use tabulation to indent your sources, take into<br />
accout that Haskell compilers assume that tabstop is 8 characters<br />
wide.<br />
<br />
If you dont want to ident your code, you could explicitly specify end<br />
of each and every expression and use arbitrary layout:<br />
<haskell><br />
combine before after = do { before; putStrLn "In the middle"; after }<br />
<br />
main = do { combine c c<br />
; let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
; let d = combine (b) (combine c c)<br />
; putStrLn "So long!"<br />
} <br />
</haskell><br />
<br />
Back to the exercise - see how we construct code out of thin air? Try<br />
to imagine what this code will do, then run it and check yourself.<br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<br />
data Dir = D Int String deriving Show<br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
'''Exercises:''' <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
'''Exercise:'''<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".<br />
<br />
----<br />
Here is complete "cd-fit.hs" what we should have written so far:<br />
<br />
module Main where<br />
<haskell><br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- Output of "du -sb" -- which is our input -- consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Information about single direcory is a size (number), some spaces,<br />
-- then directory name, which extends till newline<br />
data Dir = Dir Int String deriving Show<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return $ Dir (read size) dir_name<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
-- compute solution and print it<br />
</haskell><br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :)<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
'''Exercises:'''<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
'''Exercises:'''<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!limit<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
<haskell><br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about [[monad]]s" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to: Helge,<br />
alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin<br />
Percossi, SpellingNazi, Davor Cubranic, Brett Giles, Stdrange, Brian Chrisman.<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Books&diff=3971Books2006-05-03T07:13:05Z<p>Adept: Added link to "Hitchhikers guide to Haskell"</p>
<hr />
<div>Books and tutorials covering many aspects of Haskell.<br />
<br />
=Language Definition=<br />
<br />
<DL><br />
<DT>Simon Peyton Jones: [http://titles.cambridge.org/catalogue.asp?isbn=0521826144 <EM>Haskell 98 Language and Libraries</EM>], Cambridge University Press, 2003, Hardback, 272 pages, ISBN: 0521826144, £35.00<br />
<BR><br />
<BLOCKQUOTE><br />
<B>Book Description</B><BR> <br />
Haskell is the world's leading lazy functional programming language,<br />
widely used for teaching, research, and applications. The language<br />
continues to develop rapidly, but in 1998 the community decided to<br />
capture a stable snapshot of the language: Haskell 98. All Haskell<br />
compilers support Haskell 98, so practitioners and educators alike<br />
have a stable base for their work. This book constitutes the agreed<br />
definition of the Haskell 98, both the language itself and its<br />
supporting libraries. It has been considerably revised and refined<br />
since the original definition, and appears in print for the first<br />
time. It should be a standard reference work for anyone involved in<br />
research, teaching, or application of Haskell.<br />
</BLOCKQUOTE> <br />
The entire language definition is also available online:<br />
[[Language_and_library_specification|Language and library specification]]<br />
</DT><br />
</DL><br />
<br />
= Textbooks=<br />
<br />
<DL><br />
<DT>Paul Hudak: [http://www.haskell.org/soe <EM>The Haskell School of Expression: Learning Functional Programming through Multimedia</EM>], Cambridge University Press, New York, 2000, 416<br />
pp, 15 line diagrams, 75 exercises, Paperback $29.95, ISBN:<br />
0521644089, Hardback $74.95, ISBN: 0521643384<br />
<blockquote><br />
<B>Book Description</B><BR> <br />
This book teaches functional programming as a way of thinking and<br />
problem solving, using Haskell, the most popular purely functional<br />
language. Rather than using the conventional mathematical examples<br />
commonly found in other programming language textbooks, the author<br />
draws examples from multimedia applications, including graphics,<br />
animation, and computer music, thus rewarding the reader with working<br />
programs for inherently more interesting applications. Aimed at both<br />
beginning and advanced programmers, this tutorial begins with a gentle<br />
introduction to functional programming and moves rapidly on to more<br />
advanced topics. An underlying theme is the design and implementation<br />
of domain specific languages, using three examples: FAL (a Functional<br />
Animation Language), IRL (an Imperative Robot Language), and MDL (a<br />
Music Description Language). Details about programming in Haskell<br />
are presented in boxes throughout the text so they can be easily<br />
referred to and found quickly.<br />
<br />
The book's Web Site contains source files for all programs in the<br />
text, as well as the graphics libraries to run them under Windows and<br />
Linux platforms. It also contains PowerPoint slides useful for<br />
teaching a course using the textbook.<br />
</blockquote><br />
<DT>Simon Thompson: [http://www.cs.ukc.ac.uk/people/staff/sjt/craft2e/ <EM>Haskell: The Craft of Functional Programming</EM>], Second Edition,<br />
Addison-Wesley, 507&nbsp;pages, paperback, 1999. ISBN<br />
0-201-34275-8.<br />
<blockquote><br />
<B>Book Description</B><BR> <br />
The second edition of Haskell: The Craft of Functional Programming is essential reading for beginners to functional programming and newcomers to the Haskell programming language. The emphasis is on the process of crafting programs and the text contains many examples and running case studies, as well as advice an program design, testing, problem solving and how to avoid common pitfalls. <br />
<br />
Building on the strengths of the first edition, the book includes many new and improved features: <br />
*Complete coverage of Haskell 98, the standard version of Haskell which will be stable and supported by implementations for years to come. <br />
*An emphasis on software engineering principles, encouraging a disciplined approach to building reusable libraries of software components. <br />
*Detailed coverage of the Hugs interpreter with an appendix covering other implementations. <br />
*A running case study of pictures emphasizes the built-in functions which appear in the standard prelude and libraries. It is also used to give an early preview of some of the more complex language features, such as high-order functions. <br />
*List comprehensions and the standard functions over lists are covered before recursion. <br />
*Early coverage of polymorphism supporting the "toolkit" approach and encouraging the resuse of built-in functions and types. <br />
*Extensive reference material containing details of further reading in books, journals and on the World Wide Web. <br />
*Accompanying Web Site supporting the book, containing all the program code, further teaching materials and other useful resources. <br />
<B>Synopsis</B><BR> <br />
This books introduces Haskell at a level appropriate for those with little or no prior experience of functional programming. The emphasis is on the process of crafting programs, solving problems, and avoiding common errors.<br />
</blockquote><br />
<br />
<DT>Richard Bird: [http://www.prenhall.com/allbooks/ptr_0134843460.html <EM>Introduction to Functional Programming using Haskell</EM>], 2nd edition, Prentice Hall Press, 1998, 460 pp., ISBN: 0-13-484346-0.<br />
<blockquote><br />
From the cover:<br />
<br />
After the success of the first edition, Introduction to Functional Programming using Haskell has been thoroughly updated and revised to provide a complete grounding in the principles and techniques of programming with functions.<br />
<br />
The second edition uses the popular language Haskell to express functional programs. There are new chapters on program optimisation, abstract datatypes in a functional setting, and programming in a monadic style. There are completely new case studies, and many new exercises.<br />
<br />
As in the first edition, there is an emphasis on the fundamental techniques for reasoning about functional programs, and for deriving them systematically from their specifications.<br />
<br />
The book is self-contained, assuming no prior knowledge of programming, and is suitable as an introductory undergraduate text for first- or second-year students.<br />
</blockquote><br />
<br />
<DT>Antony Davie: [http://www.cup.org/Titles/25/0521258308.html <EM>An Introduction to Functional Programming Systems Using Haskell</EM>], Cambridge University Press, 1992. ISBN 0-521-25830-8 (hardback). ISBN 0-521-27724-8 (paperback).<br />
<blockquote><br />
Cover:<br />
<br />
Functional programming is a style of programming that has become increasingly popular during the past few years.<br />
Applicative programs have the advantage of being almost immediately expressible as functional descriptions; they can<br />
be proved correct and transformed through the referential transparency property.<br />
<br />
This book presents the basic concepts of functional programming, using the language Haskell for examples. The author<br />
incorporates a discussion of lambda calculus and its relationship with Haskell, exploring the implications for<br />
parallelism. Contents: SASL for Beginners / Examples of SASL Programming / More Advanced Applicative Programming<br />
Techniques / Lambda Calculus / The Relationship Between Lambda Calculus and SASL / Program Transformation and<br />
Efficiency / Correctness, Equivalence and Program Verification / Landin's SECD Machine and Related<br />
Implementations / Further Implementation Techniques / Special Purpose Hardware / The Applicative Style of<br />
Semantics / Other Applicative Languages / Implications for Parallelism / Functional Programming in Von Neumann<br />
Languages <br />
</blockquote><br />
<br />
<DT>Fethi Rabhi and Guy Lapalme: [http://www.iro.umontreal.ca/~lapalme/Algorithms-functional.html <EM> Algorithms: A functional programming approach</EM>], <br />
Addison-Wesley, 235&nbsp;pages, paperback, 1999. ISBN<br />
0-201-59604-0<BR><br />
<BLOCKQUOTE><br />
<B>Book Description</B><BR> <br />
The authors challenge more traditional methods of teaching algorithms<br />
by using a functional programming context, with Haskell as an<br />
implementation language. This leads to smaller, clearer and more<br />
elegant programs which enable the programmer to understand the<br />
algorithm more quickly and to use that understanding to explore<br />
alternative solutions. <br><br />
<b>Key features:</b><br />
*Most chapters are self-contained and can be taught independently from each other.<br />
*All programs are in Haskell'98 and provided on a WWW site.<br />
*End of chapter exercises throughout.<br />
*Comprehensive index and bibliographical notes.<br />
<B>Synopsis</B><BR> <br />
The book is organised as a classic algorithms book according to topics<br />
such as Abstract Data Types, sorting and searching. It uses a<br />
succession of practical programming examples to develop in the reader<br />
problem-solving skills which can be easily transferred to other<br />
language paradigms. It also introduces the idea of capturing<br />
algorithmic design strategies (e.g. Divide-and-Conquer, Dynamic<br />
Programming) through higher-order functions.<br><br />
<b>Target audience</b><br><br />
The book is intended for computer science students taking algorithms<br />
and/or (basic or advanced) functional programming courses.<br />
</BLOCKQUOTE><br />
<br />
<dt>Jeremy Gibbons and Oege de Moor (eds.): [http://www.palgrave.com/catalogue/catalogue.asp?Title_Id=0333992857 <em>The Fun of Programming</em>],Palgrave, 2002, 288 pages. ISBN 0333992857.<br />
<blockquote><br />
<b>Book description:</b><br><br />
In this textbook, leading researchers give tutorial expositions on the current state of the art of functional<br />
programming. The text is suitable for an undergraduate course immediately following an introduction to<br />
functional programming, and also for self-study. All new concepts are illustrated by plentiful examples,<br />
as well as exercises. A website gives access to accompanying software.<br />
</blockquote><br />
<br />
<dt> Cordelia Hall and John O'Donnell: [http://www.dcs.gla.ac.uk/~jtod/discrete-mathematics/ <em>Discrete Mathematics Using a Computer</em>],<br />
Springer, 2000, 360 pages. ISBN 1-85233-089-9.<br />
<blockquote><br />
<b>Book description:</b><br><br />
This book introduces the main topics of discrete mathematics with a strong emphasis on<br />
applications to computer science. It uses computer programs to implement and illustrate<br />
the mathematical ideas, helping the reader to gain a concrete understanding of the<br />
abstract mathematics. The programs are also useful for practical calculations, and they<br />
can serve as a foundation for larger software packages. <br />
<br />
Designed for first and second year undergraduate students, the book is also ideally suited<br />
to self-study. No prior knowledge of functional programming is required; the book and<br />
the online documentation provide everything you will need. <br />
</blockquote><br />
<br />
<dt>Kees Doets and Jan van Eijck: [http://www.cwi.nl/~jve/HR <em>The Haskell Road to Logic, Maths and Programming</em>]. King's College Publications, London, 2004. ISBN 0-9543006-9-6 (14.00 pounds, $25.00).<br />
<blockquote><br />
<b>Book description:</b><br><br />
The purpose of this book is to teach logic and mathematical reasoning<br />
in practice, and to connect logical reasoning with computer<br />
programming. Throughout the text, abstract concepts are linked to<br />
concrete representations in Haskell. Everything one has to know about<br />
programming in Haskell to understand the examples in the book is<br />
explained as we go along, but we do not cover every aspect of the<br />
language. Haskell is a marvelous demonstration tool for logic and<br />
maths because its functional character allows implementations to<br />
remain very close to the concepts that get implemented, while the<br />
laziness permits smooth handling of infinite data structures.<br />
<br />
We do not assume that our readers have previous experience with either<br />
programming or construction of formal proofs. We do assume previous<br />
acquaintance with mathematical notation, at the level of secondary<br />
school mathematics. Wherever necessary, we will recall relevant <br />
facts. Everything one needs to know about mathematical<br />
reasoning or programming is explained as we go along. We do assume<br />
that our readers are able to retrieve software from the Internet and<br />
install it, and that they know how to use an editor for constructing<br />
program texts.<br />
<br />
After having worked through the material in the book, i.e., after<br />
having digested the text and having carried out a substantial number<br />
of the exercises, the reader will be able to write interesting<br />
programs, reason about their correctness, and document them in a clear<br />
fashion. The reader will also have learned how to set up mathematical<br />
proofs in a structured way, and how to read and digest mathematical<br />
proofs written by others.<br />
<br />
The book can be used as a course textbook, but since it comes with<br />
solutions to all exercises (electronically available from the authors<br />
upon request) it is also well suited for private study. The source<br />
code of all programs discussed in the text, a list of errata, <br />
further relevant material and an email link to the authors<br />
can be found [http://www.cwi.nl/~jve/HR here].<br />
</blockquote><br />
<br />
<dt>Simon Peyton Jones: <em>Implementation of Functional Programming Language</em>,Prentice-Hall, 1987. ISBN 0134533259.<br />
<br />
<dt>Simon Peyton Jones, David Lester: <em>Implementing Functional Languages</em>, 1992.<br><br />
<blockquote><br />
The book is out of print. The full sources and a postscript version are <br />
[http://research.microsoft.com/Users/simonpj/Papers/papers.html available for free].<br />
</blockquote><br />
<br />
<dt>Simon Thompson: <em>Type Theory and Functional Programming</em>, Addison-Wesley, 1991. ISBN 0-201-41667-0.<br />
<blockquote><br />
Now out of print, the original version is available [http://www.cs.kent.ac.uk/people/staff/sjt/TTFP/ here].<br />
<br />
<em>Preface</em>:<br />
Constructive Type theory has been a topic of research interest to computer scientists,<br />
mathematicians, logicians and philosophers for a number of years. For computer scientists it provides<br />
a framework which brings together logic and programming languages in a most elegant and fertile way:<br />
program development and verification can proceed within a single system. Viewed in a different way,<br />
type theory is a functional programming language with some novel features, such as the totality of<br />
all its functions, its expressive type system allowing functions whose result type depends upon the<br />
value of its input, and sophisticated modules and abstract types whose interfaces can contain logical<br />
assertions as well as signature information. A third point of view emphasizes that programs (or<br />
functions) can be extracted from proofs in the logic.<br />
</blockquote><br />
<br />
</DL><br />
<br />
=Tutorials and books=<br />
<br />
==Introductions to Haskell==<br />
<br />
;[http://www.haskell.org/tutorial/ A Gentle Introduction to Haskell] <br />
:By Paul Hudak, John Peterson, and Joseph H. Fasel. The title is a bit misleading. Some knowledge of another functional programming language is expected. The emphasis is on the type system and those features which are really new in Haskell (compared to other functional programming languages). A classic.<br />
<br />
;[http://www.isi.edu/~hdaume/htut/ Yet Another Haskell Tutorial] <br />
:By Hal Daume III et al. A recommended tutorial for Haskell that is still under construction but covers already much ground. Also a classic text.<br />
<br />
;[http://www.haskell.org/~pairwise/intro/intro.html Haskell Tutorial for C Programmers]<br />
:By Eric Etheridge. From the intro: "This tutorial assumes that the reader is familiar with C/C++, Python, Java, or Pascal. I am writing for you because it seems that no other tutorial was written to help students overcome the difficulty of moving from C/C++, Java, and the like to Haskell."<br />
<br />
;[http://www-106.ibm.com/developerworks/edu/os-dw-linuxhask-i.html Beginning Haskell] <br />
:From IBM developerWorks. This tutorial targets programmers of imperative languages wanting to learn about functional programming in the language Haskell. If you have programmed in languages such as C, Pascal, Fortran, C++, Java, Cobol, Ada, Perl, TCL, REXX, JavaScript, Visual Basic, or many others, you have been using an imperative paradigm. This tutorial provides a gentle introduction to the paradigm of functional programming, with specific illustrations in the Haskell 98 language. (Free registration required.)<br />
<br />
;[http://www.informatik.uni-bonn.de/~ralf/teaching/Hskurs_toc.html Online Haskell Course] <br />
:By Ralf Hinze (in German).<br />
<br />
;[http://www.cs.uu.nl/people/jeroen/courses/fp-eng.pdf Functional Programming]<br />
:By Jeroen Fokker, 1995. (153 pages, 600 KB). Textbook for learning functional programming with Gofer (an older implementation of Haskell). Here without Chapters&nbsp;6 and&nbsp;7. <br />
<br />
;[http://www.cs.chalmers.se/~rjmh/tutorials.html Tutorial Papers in Functional Programming].<br />
:A collection of links to other Haskell tutorials, from John Hughes.<br />
<br />
;[http://www.cs.ou.edu/cs1323h/textbook/haskell.shtml Two Dozen Short Lessons in Haskell] <br />
:By Rex Page. A draft of a textbook on functional programming, available by ftp. It calls for active participation from readers by omitting material at certain points and asking the reader to attempt to fill in the missing information based on knowledge they have already acquired. The missing information is then supplied on the reverse side of the page. <br />
<br />
;[http://www.cs.chalmers.se/~augustss/AFP/manuals/haskeller.dvi.gz The Little Haskeller] <br />
:By Cordelia Hall and John Hughes. 9. November 1993, 26 pages. An introduction using the Chalmers Haskell B interpreter (hbi). Beware that it relies very much on the user interface of hbi which is quite different for other Haskell systems, and the tutorials cover Haskell 1.2 , not Haskell 98.<br />
<br />
;[http://pleac.sourceforge.net/pleac_haskell/t1.html PLEAC-Haskell]<br />
:Following the Perl Cookbook (by Tom Christiansen and Nathan Torkington, published by O'Reilly) spirit, the PLEAC Project aims to gather fans of programming, in order to implement the solutions in other programming languages.<br />
<br />
;[ftp://ftp.geoinfo.tuwien.ac.at/navratil/HaskellTutorial.pdf Haskell-Tutorial] <br />
:By Damir Medak and Gerhard Navratil. The fundamentals of functional languages for beginners. <br />
<br />
;[http://www.reid-consulting-uk.ltd.uk/docs/ffi.html A Guide to Haskell's Foreign Function Interface]<br />
:A guide to using the foreign function interface extension, using the rich set of functions in the Foreign libraries, design issues, and FFI preprocessors.<br />
<br />
;[http://en.wikibooks.org/wiki/Programming:Haskell Programming Haskell Wikibook] <br />
:A communal effort by several authors to produce the definitive Haskell textbook. Its very much a work in progress at the moment, and contributions are welcome.<br />
<br />
;[http://video.s-inf.de/#FP.2005-SS-Giesl.(COt).HD_Videoaufzeichnung Video Lectures] <br />
:Lectures (in English) by Jürgen Giesl. About 30 hours in total, and great for learning Haskell. The lectures are 2005-SS-FP.V01 through 2005-SS-FP.V26. Videos 2005-SS-FP.U01 through 2005-SS-FP.U11 are exercise answer sessions, so you probably don't want those.<br />
<br />
;[http://www.cs.utoronto.ca/~trebla/fp/ Albert's Functional Programming Course] <br />
:A 15 lesson introduction to most aspects of Haskell.<br />
<br />
;[http://www.iceteks.com/articles.php/haskell/1 Introduction to Haskell]<br />
:By Chris Dutton, An "attempt to bring the ideas of functional programming to the masses here, and an experiment in finding ways to make it easy and interesting to follow".<br />
<br />
;[http://www.csc.depauw.edu/~bhoward/courses/0203Spring/csc122/haskintro/ An Introduction to Haskell]<br />
:A brief introduction, by Brian Howard.<br />
<br />
;[http://web.syntaxpolice.org/lectures/haskellTalk/slides/index.html Introduction to Haskell]<br />
:By Isaac Jones (2003).<br />
<br />
;[[Hitchhikers Guide to the Haskell]]<br />
: Tutorial for C/Java/OCaml/... programers by Dmitry Astapov. From the intro: "This text intends to introduce the reader to the practical aspects of Haskell from the very beginning (plans for the first chapters include: I/O, darcs, Parsec, QuickCheck, profiling and debugging, to mention a few)".<br />
<br />
==Reference material==<br />
<br />
;[http://www.cs.uu.nl/~afie/haskell/tourofsyntax.html Tour of the Haskell Syntax] <br />
:By Arjan van IJzendoorn.<br />
<br />
;[http://zvon.org/other/haskell/Outputglobal/index.html Haskell Reference] <br />
:By Miloslav Nic.<br />
<br />
;[http://www.cs.uu.nl/~afie/haskell/tourofprelude.html A Tour of the Haskell Prelude] <br />
:By Bernie Pope and Arjan van IJzendoorn.<br />
<br />
;[http://members.chello.nl/hjgtuyl/tourdemonad.html A tour of the Haskell Monad functions]<br />
:By Henk-Jan van Tuyl.<br />
<br />
;[http://www.cse.unsw.edu.au/~en1000/haskell/inbuilt.html Useful Haskell functions]<br />
:An explanation for beginners of many Haskell functions that are predefined in the Haskell Prelude.<br />
<br />
;[http://haskell.org/ghc/docs/latest/html/libraries/ Documentation for the standard libraries]<br />
:Complete documentation of the standard Haskell libraries.<br />
<br />
;[http://www.haskell.org/haskellwiki/Category:Idioms Haskell idioms]<br />
:A collection of articles describing some common Haskell idioms. Often quite advanced.<br />
<br />
;[http://www.haskell.org/haskellwiki/Blow_your_mind Useful idioms]<br />
:A collection of short, useful Haskell idioms.<br />
<br />
;[http://www.haskell.org/haskellwiki/Programming_guidelines Programming guidelines]<br />
:Some Haskell programming and style conventions.<br />
<br />
== Motivation for Using Haskell ==<br />
<br />
;[http://www.md.chalmers.se/~rjmh/Papers/whyfp.html Why Functional Programming Matters] <br />
:By [http://www.md.chalmers.se/~rjmh/ John Hughes], The Computer Journal, Vol. 32, No. 2, 1989, pp. 98 - 107. Also in: David A. Turner (ed.): Research Topics in Functional Programming, Addison-Wesley, 1990, pp. 17 - 42.<BR> Exposes the advantages of functional programming languages. Demonstrates how higher-order functions and lazy evaluation enable new forms of modularization of programs.<br />
<br />
;[[Why Haskell matters]] <br />
:Discussion of the advantages of using Haskell in particular. An excellent article.<br />
<br />
;[http://www.cs.ukc.ac.uk/pubs/1997/224/index.html Higher-order + Polymorphic = Reusable] <br />
:By [http://www.cs.ukc.ac.uk/people/staff/sjt/index.html Simon Thompson]. Unpublished, May 1997.<BR> <STRONG>Abstract:</STRONG> This paper explores how certain ideas in object oriented languages have their correspondents in functional languages. In particular we look at the analogue of the iterators of the C++ standard template library. We also give an example of the use of constructor classes which feature in Haskell 1.3 and Gofer.<br />
<br />
==Analysis and Design Methods==<br />
<br />
See [[Analysis and design]].<br />
<br />
== Teaching Haskell == <br />
<br />
;[http://www.cs.ukc.ac.uk/pubs/1997/208/index.html Where do I begin? A problem solving approach to teaching functional programming]<br />
:By [http://www.cs.ukc.ac.uk/people/staff/sjt/index.html Simon Thompson]. In Krzysztof Apt, Pieter Hartel, and Paul Klint, editors, First International Conference on Declarative Programming Languages in Education. Springer-Verlag, September 1997. <br> <STRONG>Abstract:</STRONG> This paper introduces a problem solving method for teaching functional programming, based on Polya's `How To Solve It', an introductory investigation of mathematical method. We first present the language independent version, and then show in particular how it applies to the development of programs in Haskell. The method is illustrated by a sequence of examples and a larger case study. <br />
<br />
;[http://www.cs.ukc.ac.uk/pubs/1995/214/index.html Functional programming through the curriculum]<br />
:By [http://www.cs.ukc.ac.uk/people/staff/sjt/index.html Simon Thompson]and Steve Hill. In Pieter H. Hartel and Rinus Plasmeijer, editors, Functional Programming Languages in Education, LNCS 1022, pages 85-102. Springer-Verlag, December 1995. <br> <STRONG>Abstract:</STRONG> This paper discusses our experience in using a functional language in topics across the computer science curriculum. After examining the arguments for taking a functional approach, we look in detail at four case studies from different areas: programming language semantics, machine architectures, graphics and formal languages. <br />
<br />
;[http://www.cse.unsw.edu.au/~chak/papers/CK02a.html The Risks and Benefits of Teaching Purely Functional Programming in First Year]<br />
:By [http://www.cse.unsw.edu.au/~chak Manuel M. T. Chakravarty] and [http://www.cse.unsw.edu.au/~keller Gabriele Keller]. Journal of Functional Programming 14(1), pp 113-123, 2004. An earlier version of this paper was presented at Functional and Declarative Programming in Education (FDPE02). <br> <strong>Abstract</strong> We argue that teaching purely functional programming as such in freshman courses is detrimental to both the curriculum as well as to promoting the paradigm. Instead, we need to focus on the more general aims of teaching elementary techniques of programming and essential concepts of computing. We support this viewpoint with experience gained during several semesters of teaching large first-year classes (up to 600 students) in Haskell. These classes consisted of computer science students as well as students from other disciplines. We have systematically gathered student feedback by conducting surveys after each semester. This article contributes an approach to the use of modern functional languages in first year courses and, based on this, advocates the use of functional languages in this setting.<br />
<br />
==Using Monads==<br />
<br />
;[http://research.microsoft.com/%7Esimonpj/Papers/marktoberdorf Tackling the awkward squad: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell]<br />
:Simon Peyton Jones. Presented at the 2000 Marktoberdorf Summer School. In "Engineering theories of software construction", ed Tony Hoare, Manfred Broy, Ralf Steinbruggen, IOS Press, ISBN 1 58603 1724, 2001, pp47-96. The standard reference for monadic IO in GHC/Haskell. <br><strong>Abstract:</strong>Functional programming may be beautiful, but to write real applications we must grapple with awkward real-world issues: input/output, robustness, concurrency, and interfacing to programs written in other languages.<br />
<br />
;[http://db.ewi.utwente.nl/Publications/PaperStore/db-utwente-0000003696.pdf The Haskell Programmer's Guide to the IO Monad - Don't Panic.] <br />
:Stefan Klinger. This report scratches the surface of category theory, an abstract branch of algebra, just deep enough to find the monad structure. It seems well written.<br />
<br />
;[http://www.nomaware.com/monads/html/ All About Monads] <br />
:By Jeff Newbern. This tutorial aims to explain the concept of a monad and its application to functional programming in a way that is easy to understand and useful to beginning and intermediate Haskell programmers. Familiarity with the Haskell language is assumed, but no prior experience with monads is required. <br />
<br />
;[http://www.dcs.gla.ac.uk/~nww/Monad.html What the hell are Monads?] <br />
:By Noel Winstanley. A basic introduction to monads, monadic programming and IO. This introduction is presented by means of examples rather than theory, and assumes a little knowledge of Haskell. <br />
<br />
;[http://www.engr.mun.ca/~theo/Misc/haskell_and_monads.htm Monads for the Working Haskell Programmer -- a short tutorial]<br />
:By Theodore Norvell. <br />
<br />
See also [[Research papers/Monads and arrows]]<br />
<br />
==Using Arrows==<br />
<br />
;[http://www.haskell.org/arrows/ Arrows: A General Interface to Computation]<br />
:Ross Paterson's page on arrows.<br />
<br />
;[http://haskell.org/hawiki/UnderstandingArrows UnderstandingArrows]<br />
<br />
See also [[Research papers/Monads and arrows]]<br />
<br />
== Attribute Grammars ==<br />
<br />
Wouter Swierstra's [http://www.haskell.org/tmrwiki/WhyAttributeGrammarsMatter WhyAttributeGrammarsMatter].<br />
<br />
Utrecht University's [http://www.cs.uu.nl/wiki/HUT/AttributeGrammarSystem Attribute Grammar System] tools include also an attribute grammar compiler, UUAGC. The concept of attribute grammar was used in their [http://www.cs.uu.nl/wiki/Ehc/WebHome Essential Haskell Compiler] project, which gives us not only a working programming language, but also a good didactical material about using attribute grammars, e.g. in writing compilers.<br />
<br />
Albeits these materials are self-contained, they reveal that the theory of attribute grammars is related to these concepts:<br />
* circular programming<br />
* catamorphism<br />
Here is a HaWiki page on [http://haskell.org/hawiki/CircularProgramming CircularProgramming].<br />
<br />
== Categorical Programming ==<br />
<br />
Catamorphisms and related concepts, categorical approach to functional programming, categorical programming. Many materials cited here refer to category theory, so as an introduction to this discipline see the ''Foundations'' section at the end of this page.<br />
* Erik Meijer, Maarten Fokkinga, Ross Paterson: [http://citeseer.ist.psu.edu/meijer91functional.html Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire]. See also related documents (in the CiteSeer page). Understanding the article does not require a category theory knowledge -- a self-contained material on the concept of catamorphism, anamoprhism and other related concepts.<br />
* Varmo Vene and Tarmo Uustalu: [http://citeseer.ist.psu.edu/vene98functional.html Functional Programming with Apomorphisms / Corecursion]<br />
* Varmo Vene: [http://www.cs.ut.ee/~varmo/papers/thesis.pdf Categorical Programming with Inductive and Coinductive Types]. The book accompanies the deep categorical theory topic with Haskell examples.<br />
* Tatsuya Hagino: [http://www.tom.sfc.keio.ac.jp/~hagino/thesis.pdf A Categorical Programming Language]<br />
* [http://pll.cpsc.ucalgary.ca/charity1/www/home.html Charity], a categorical programming language implementation.<br />
* [http://okmij.org/ftp/Haskell/categorical-maxn.lhs Deeply uncurried products, as categorists might like them] article mentions a conjecture: relatedness to [[Combinatory logic]]<br />
<br />
== Data Structures ==<br />
<br />
;[http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521663504 Purely Functional Data Structures]<br />
:[http://www.cs.columbia.edu/~cdo/ Chris Okasaki], 232 pp., Cambridge University Press, 1998. ISBN 0-521-63124-6<BR> From the cover: <BLOCKQUOTE> Most books on data structures assume an imperative language like C or C++. However, data structures for these languages do not always translate well to functional languages such as Standard ML, Haskell, or Scheme. This book describes data structures and data structure design techniques from the point of view of functional languages. It includes code for a wide assortment both of classical data structures and of data structures developed exclusively for functional languages.This handy reference for professional programmers working with functional languages can also be used as a tutorial or for self-study. [http://www.cs.columbia.edu/~cdo/pfds-haskell.tar.gz Haskell source code for the book] </BLOCKQUOTE><br />
<br />
See [[Research papers/Data structures]]<br />
<br />
==Schools on Advanced Funtional Programming==<br />
<br />
<EM>Advanced Functional Programming</EM>, First International Spring<br />
School on Advanced Functional Programming Techniques, Bastad, Sweden, LNCS 925, Springer-Verlag, 1995 (editors: J. Jeuring, E. Meijer).<br />
*<EM>Functional Parsers</EM> by Jeroen Fokker, p.&nbsp;1-23.<br />
*<EM>Monads for functional programming</EM> by Philip Wadler, p.&nbsp;24-52.<br />
*<EM>The Design of a Pretty-printing Library</EM> by John Hughes, p.&nbsp;52-96.<br />
*<EM>Functional Programming with Overloading and Higher-Order Polymorphism</EM>, Mark P. Jones, p.&nbsp;97-136.<br />
*<EM>Programming with Fudgets</EM> by Thomas Hallgren and Magnus Carlsson, p.&nbsp;137-182.<br />
*<EM>Constructing Medium Sized Efficient Functional Programs in Clean</EM> by Marko C.J.D. van Eekelen and Rinus J. Plasmeijer, p.&nbsp;183-227.<br />
*<EM>Merging Monads and Folds for Functional Programming</EM> by Erik Meijer and Johan Jeuring, p.&nbsp;228-266.<br />
*<EM>Programming with Algebras</EM> by Richard B. Kieburtz and Jeffrey Lewis, p.&nbsp;267-307.<br />
*<EM>Graph Algorithms with a Functional Flavour</EM> by John Launchbury, p.&nbsp;308-331.<br />
<br />
[http://www.cse.ogi.edu/PacSoft/conf/summerschool96.html <EM>Advanced Functional Programming</EM>], Second International Summer School on Advanced Functional Programming Techniques, Evergreen State College, WA, USA, LNCS 1126, Springer-Verlag, 1996 (editors: J. Launchbury, E. Meijer, T. Sheard).<br />
*<EM>Composing the User Interface with Haggis</EM> by Sigbjorn Finne and Simon Peyton Jones, p.&nbsp;1-37.<br />
*<EM>Haskore Music Tutorial</EM> by Paul Hudak, p.&nbsp;38-67.<br />
*<EM>Polytypic Programming</EM> by Johan Jeuring and Patrick Jansson, p.&nbsp;68-114.<br />
*<EM>Implementing Threads in Standard ML</EM> by Peter Lee, p.&nbsp;115-130.<br />
*<EM>Functional Data Structures</EM> by Chris Okasaki, p.&nbsp;131-158.<br />
*<EM>Heap Profiling for Space Efficiency</EM> by Colin Runciman and Niklas R&ouml;jemo, p.&nbsp;159-183.<br />
*<EM&gt;Deterministic, Error-Correcting Combinator Parsers</EM> by S. Doaitse Swierstra and Luc Duponcheel, p.&nbsp;184-207.<br />
*<EM>Essentials of Standard ML Modules</EM> by Mads Tofte, p.&nbsp;208-238.<br />
<br />
[http://alfa.di.uminho.pt/~afp98/ Advanced Functional Programming, Third International School, AFP'98], <br />
in Braga, Portugal from 12th to 19th September 1998, LNCS 1608, Springer-Verlag, 1999<br />
(editors: D. Swierstra, P. Henriques and J. Oliveira).<BR><br />
All lecture notes and further material are available from the web site.<br />
<br />
= Foundations =<br />
<br />
;[http://www.dcs.qmul.ac.uk/~pt/Practical_Foundations/ Practical Foundations of Mathematics]<br />
:Paul Taylor. Cambridge University Press, ISBN: 0-521-63107-6, xii+576 pages, September 2000.<br />
<br />
;[http://www.cwru.edu/artsci/math/wells/pub/ttt.html Toposes, Triples and Theories]<br />
:Michael Barr and Charles Wells. The revised version of their formerly Springer Verlag published book is online for free download. Note that they use the name ''triple'' instead of ''monad''.<br />
<br />
=[[Research papers]]=<br />
<br />
A large collection of research papers published on various aspects of Haskell.</div>Adepthttps://wiki.haskell.org/index.php?title=User:Adept&diff=3920User:Adept2006-04-30T23:04:22Z<p>Adept: </p>
<hr />
<div>== Personal trivia ==<br />
I am known as adept (or ADEpt) at #haskell<br />
<br />
You can reach me via dastapov-at-gmail-dot-com, UIN 18-22-53-38 or JID adept-at-jabber-dot-kiev-dot-ua<br />
<br />
== Texts and articles ==<br />
[[QuickCheck as Test Set Generator]]<br />
<br />
[[Hitchhikers Guide to the Haskell]]<br />
<br />
Source for both of them are available in my [http://adept.linux.kiev.ua/repos darcs repo]</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=3919Hitchhikers guide to Haskell2006-04-30T22:55:49Z<p>Adept: Added a small bit about darcs</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportial fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- put this in hello.hs<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
See how we construct code out of thin air? Try to imagine what this code will<br />
do, then run it and check yourself. <br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<br />
data Dir = D Int String deriving Show<br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
Exercises: <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
Exercise:<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".<br />
<br />
----<br />
Here is complete "cd-fit.hs" what we should have written so far:<br />
<br />
module Main where<br />
<haskell><br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- Output of "du -sb" -- which is our input -- consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Information about single direcory is a size (number), some spaces,<br />
-- then directory name, which extends till newline<br />
data Dir = Dir Int String deriving Show<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return $ Dir (read size) dir_name<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
-- compute solution and print it<br />
</haskell><br />
<br />
If you followed advice to put your code under version control, you<br />
could now use "darcs whatsnew" or "darcs diff -u" to examine your<br />
changes to the previous version. Use "darcs record" to commit them. As<br />
an exercise, first record the changes "outside" of function "main" and<br />
then record the changes in "main". Do "darcs changes" to examine a<br />
list of changes you've recorded so far.<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
Exercise: examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :)<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
Exercises:<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
Exercises:<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!limit<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
<haskell><br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about [[monad]]s" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to: Helge,<br />
alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin<br />
Percossi, SpellingNazi, Davor Cubranic, Brett Giles, Stdrange, Brian Chrisman.<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=3918Hitchhikers guide to Haskell2006-04-30T22:04:13Z<p>Adept: changed all examples to use syntax highlighting</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", '''that layout is 2-dimensional''', etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
'''In case you've skipped over the previous paragraph''', I would like<br />
to stress out once again that Haskell is sensitive to indentation and<br />
spacing, so pay attention to that during cut-n-pastes or manual<br />
alignment of code in the text editor with proportial fonts.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
<haskell><br />
-- put this in hello.hs<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
</haskell><br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
<haskell><br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
</haskell><br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
<haskell><br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
</haskell><br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
<haskell><br />
input <- getContents<br />
</haskell><br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
<haskell><br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
</haskell><br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
</haskell><br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
<haskell><br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
</haskell><br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
<haskell><br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
</haskell><br />
<br />
See how we construct code out of thin air? Try to imagine what this code will<br />
do, then run it and check yourself. <br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
<haskell><br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
</haskell><br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the [http://en.wikipedia.org/wiki/Application_programming_interface API] of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
<haskell><br />
data Dir = Dir Int String deriving Show<br />
</haskell><br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<br />
data Dir = D Int String deriving Show<br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
Exercises: <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
<haskell><br />
data Either a b = Left a | Right b<br />
</haskell><br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
<haskell><br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
</haskell><br />
<br />
Exercise:<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".<br />
<br />
----<br />
Here is complete "cd-fit.hs" what we should have written so far:<br />
<br />
module Main where<br />
<haskell><br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- Output of "du -sb" -- which is our input -- consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Information about single direcory is a size (number), some spaces,<br />
-- then directory name, which extends till newline<br />
data Dir = Dir Int String deriving Show<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return $ Dir (read size) dir_name<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
-- compute solution and print it<br />
</haskell><br />
<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
----<br />
Exercise: examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
<haskell><br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
</haskell><br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
<haskell><br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
</haskell><br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :)<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
<haskell><br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
</haskell><br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
<haskell><br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
</haskell><br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
Exercises:<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
<haskell><br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
</haskell><br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
<haskell><br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
</haskell><br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
Exercises:<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
<haskell><br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
</haskell><br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
<haskell><br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!limit<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
</haskell><br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
</haskell><br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
<haskell><br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
</haskell><br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
<haskell><br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
</haskell><br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
<haskell><br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
</haskell><br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
<haskell><br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
</haskell><br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about [[monad]]s" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
<haskell><br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
</haskell><br />
<br />
really is just a syntax sugar for:<br />
<br />
<haskell><br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
</haskell><br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to: Helge,<br />
alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin<br />
Percossi, SpellingNazi, Davor Cubranic, Brett Giles, Stdrange, Brian Chrisman.<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=3889Hitchhikers guide to Haskell2006-04-27T07:32:22Z<p>Adept: Fixed path to repo</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", that layout is 2-dimensional, etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
-- put this in hello.hs<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
input <- getContents<br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
<br />
See how we construct code out of thin air? Try to imagine what this code will<br />
do, then run it and check yourself. <br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
<br />
Just add those lines to the top of "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hide all underlying complexities from us, exposing the API of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
data Dir = Dir Int String deriving Show<br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<br />
data Dir = D Int String deriving Show<br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
Exercises: <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
data Either a b = Left a | Right b<br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
<br />
Exercise:<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".<br />
<br />
----<br />
Here is complete "cd-fit.hs" what we should have written so far:<br />
<br />
module Main where<br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- Output of "du -sb" -- which is our input -- consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Information about single direcory is a size (number), some spaces,<br />
-- then directory name, which extends till newline<br />
data Dir = Dir Int String deriving Show<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return $ Dir (read size) dir_name<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
-- compute solution and print it<br />
<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
<br />
----<br />
Exercise: examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :)<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
Exercises:<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
Exercises:<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!limit<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (1,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about [[monad]]s" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
<br />
really is just a syntax sugar for:<br />
<br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to: Helge,<br />
alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin<br />
Percossi, SpellingNazi, Davor Cubranic, Brett Giles, Stdrange, Brian Chrisman.<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=3549Hitchhikers guide to Haskell2006-04-09T00:16:33Z<p>Adept: Chapter 4 :)</p>
<hr />
<div>== Preface: DON'T PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (think about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", that layout is 2-dimensional, etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DON'T PANIC, take your towel with you and<br />
read along.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[User:Adept|Adept]] for contact info) or submit<br />
patches to the tutorial via darcs (<br />
[http://adept.linux.kiev.ua/~adept/repos/hhgtth/ repository is here]) or directly to this<br />
Wiki. <br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put several GB of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you don't want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tightly as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
-- put this in hello.hs<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of a version control<br />
system, and we will not make an exception. We will use the modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
a repository in itself.<br />
<br />
First, let's create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create a new file called "cd-fit.hs"<br />
in our working directory. Now let's think for a moment about how our<br />
program will operate and express it in pseudocode:<br />
<br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Let's simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look more closely at what's written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
input <- getContents<br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind it to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what a function will do.<br />
<br />
Let's fire up an interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind its result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to assign value to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return its result (if any). <br />
<br />
We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:<br />
<br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and have found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with its notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
The standard language library (named "Prelude", by the way) provides<br />
us with lots of functions that return useful primitive IO actions. In<br />
order to combine them to produce an even more complex actions, we use a "do":<br />
<br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' its result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' its result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When will all this actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce an even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When will the "main" be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''The execution of a Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and that arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good news: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!")<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
<br />
See how we construct code out of thin air? Try to imagine what this code will<br />
do, then run it and check yourself. <br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
let's celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), let's forget about IO and actually do<br />
some useful work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"[[Parsec]]" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
[[Parsec]] parsers do not require a separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for a separate<br />
syntax for parser description. But enough advertisements, let's actually<br />
do some parsing:<br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - its size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
<br />
Just add those lines into "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. "Do" is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about [[monad]] as a "[[:Category:Idioms|design pattern]]" in the functional world.<br />
[[Monad]] is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hid all underlying complexities from us, exposing the API of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
let's use the interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''data[[type]]'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
data Dir = Dir Int String deriving Show<br />
<br />
In order to construct such records, we must use ''data [[constructor]]''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<br />
data Dir = D Int String deriving Show<br />
<br />
, which would define ''data[[type]]'' "Dir" with ''data [[constructor]]'' "D".<br />
However, traditionally name of the data[[type]] and its [[constructor]] are<br />
chosen to be the same.<br />
<br />
Clause "[[deriving]] Show" instructs the compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type [[class]]'' Show. We will explain ''type [[class]]es'' later, for<br />
now let's just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
Exercises: <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? the [[Parsec]] library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
At first the [[type]] might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise [[type]].<br />
<br />
Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
data Either a b = Left a | Right b<br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another [[:Category:Idioms|power tool]] in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is a monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the [[GHC]] or [[Hugs]] runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.<br />
<br />
let's extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
<br />
Exercise:<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now let's try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".<br />
<br />
----<br />
Here is complete "cd-fit.hs" what we should have written so far:<br />
<br />
module Main where<br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- Output of "du -sb" -- which is our input -- consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Information about single direcory is a size (number), some spaces,<br />
-- then directory name, which extends till newline<br />
data Dir = Dir Int String deriving Show<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return $ Dir (read size) dir_name<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
-- compute solution and print it<br />
<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. let's go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" ([http://www.google.com/search?q=knapsack+problem google it up], if you don't know already what it<br />
is. There are more than 100000 links).<br />
<br />
let's start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of its components:<br />
<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
<br />
----<br />
Exercise: examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get its name, provided that "d" is of type "Dir".<br />
<br />
The Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so let's add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so let's add a lines:<br />
<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :)<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''[[QuickCheck]]'''.<br />
<br />
[[QuickCheck]] is a tool to do automated testing of your functions using<br />
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of<br />
praise" let's show the code for testing the following ''property'': An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:<br />
<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
<br />
let's run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
let's dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''[[instance]]''' of '''type[[class]]''' "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''type[[class]]'''? A typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (e.g. String). On the other hand, you want to enforce<br />
the convention that arguments for your function must satisfy a certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as a '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
functions. <br />
<br />
Let's examine the typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any [[type]] (let's name it 'a') could be a member of the [[class]] Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".<br />
<br />
Now, if you write a function which operates on its arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is an instance of "Arbitrary"!<br />
<br />
let's say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for its functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied fewer functions than are required<br />
for minimal implementation, the compiler/interpreter will say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of [[type]]s are already instances of typeclass Ord, and thus we are able to sort them.<br />
<br />
Now, let's take a look back to the definition of "Dir":<br />
<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
<br />
See that "[[deriving]]" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
Exercises:<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "[[Monad]]" and we will talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a [[type variable]] which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?<br />
<br />
Let's look at the code:<br />
<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
We have used the library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write another not-so-trivial packing<br />
method, compare packing methods efficiency, and learn something new<br />
about debugging and profiling of the Haskell programs along the way.<br />
<br />
It might not be immediately obvious whether our packing algorith is<br />
effective, and if yes - in which particular way? Whether it's runtime,<br />
memory consumption or result are of sufficient quality, are there any<br />
alternative algorithms, and how do they compare to each other?<br />
<br />
Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.<br />
<br />
This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:<br />
<br />
----------------------------------------------------------------------------------<br />
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem<br />
--<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
precomputeDisksFor :: [Dir] -> [DirPack]<br />
precomputeDisksFor dirs = <br />
-- By calculating `bestDisk' for all possible disk sizes, we could<br />
-- obtain a solution for particular case by simple lookup in our list of<br />
-- solutions :)<br />
let precomp = map bestDisk [0..] <br />
<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty<br />
bestDisk 0 = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit = <br />
-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
in precomp<br />
<br />
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to <br />
-- particular problem is simple: just take the solution for the required 'media_size' and<br />
-- that's it!<br />
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size<br />
<br />
Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?<br />
<br />
----<br />
<br />
Exercises:<br />
* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?<br />
* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)<br />
* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.<br />
* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line<br />
* Notice how we use function composition to compose complex condition to filter the list of dirs<br />
<br />
---- <br />
<br />
Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack ds <br />
in pack_size pack == pack_size (dynamic_pack (dirs pack))<br />
<br />
Now, lets try to run (DONT PANIC and save all you work in other applications first!):<br />
<br />
*Main> quickCheck dynamic_pack_is_fixpoint<br />
<br />
Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.<br />
<br />
What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.<br />
<br />
Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.<br />
<br />
Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?<br />
<br />
Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:<br />
<br />
Prelude> :l cd-fit.hs<br />
Compiling Main ( cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :set +s<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.06 secs, 1277972 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.00 secs, 0 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.01 secs, 1519064 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000<br />
DirPack {pack_size = 0, dirs = []}<br />
(0.03 secs, 1081808 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000<br />
DirPack {pack_size = 0, dirs = []}<br />
(1.39 secs, 12714088 bytes)<br />
*Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000<br />
Interrupted.<br />
<br />
Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute <tt>700*1024*1024</tt>th element...<br />
<br />
Lets modify our code a bit, to allow disk size to be tweaked:<br />
<br />
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!limit<br />
<br />
prop_dynamic_pack_is_fixpoint ds =<br />
let pack = dynamic_pack media_size ds <br />
in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))<br />
<br />
prop_dynamic_pack_small_disk ds =<br />
let pack = dynamic_pack 50000 ds<br />
in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))<br />
<br />
-- rename "old" main to "moin"<br />
main = quickCheck prop_dynamic_pack_small_disk<br />
<br />
Compute a profiling version of you code with <tt>ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs</tt> and run it like this: <br />
<br />
$ ./cd-fit +RTS -p<br />
OK, passed 100 tests.<br />
<br />
First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory. <br />
<br />
Most probably, you'll see something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 2.18 secs (109 ticks @ 20 ms)<br />
total alloc = 721,433,008 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
precomputeDisksFor Main 88.1 99.8<br />
dynamic_pack Main 11.0 0.0<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
<br />
MAIN MAIN 1 0 0.0 0.0 100.0 100.0<br />
CAF Main 174 11 0.9 0.2 100.0 100.0<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 99.1 99.8<br />
dynamic_pack Main 182 200 11.0 0.0 99.1 99.8<br />
precomputeDisksFor Main 183 200 88.1 99.8 88.1 99.8<br />
main Main 180 1 0.0 0.0 0.0 0.0<br />
<br />
Examine column of "individual %alloc". As we thought, all memory was<br />
allocated within <tt>precomputeDisksFor</tt>. However, amount of<br />
memory allocated (more than 700 mb, according to the line "total<br />
alloc") seems to be a little too much for our simple task. We will dig<br />
deeper and find where we a wasting it.<br />
<br />
Let's examine memory consumption a little closer via so-called "heap<br />
profiles". Run <tt>./cd-fit +RTS -hb</tt>. This produces "biographical<br />
heap profile", which tells us how various parts of the memory were<br />
used during the program run time. Heap profile was saved to<br />
"cd-fit.hp". It is next to impossible to read and comprehend it as is,<br />
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which<br />
is worth a thouthand words. View it with "gv" or "ghostview" or "full<br />
Adobe Acrobat (not Reader)". (This and subsequent pictures are<br />
'''not''' attached here).<br />
<br />
Notice that most of the graph is taken up by region marked as "VOID". <br />
This means that memory allocated was never used. Notice that there is<br />
'''no''' areas marked as "USE", "LAG" or "DRAG". Seems like our<br />
program hardly uses '''any''' of the allocated memory at all. Wait a<br />
minute! How could that be? Surely it must use something when it packs<br />
to the imaginary disks of 50000 bytes those random-generated<br />
directories which are 10 to 1400 Mb in size.... Oops. Severe size<br />
mismatch. We should have spotted it earlier, when we were timing<br />
<tt>precomputeDisksFor</tt>. Scroll back and observe how each run<br />
returned the very same result - empty directory set.<br />
<br />
Our random directories are too big, but nevertheless code spends time<br />
and memory trying to "pack" them. Obviously,<br />
<tt>precomputeDisksFor</tt> (which is responsible for 90% of total<br />
memory consumption and run time) is flawed in some way.<br />
<br />
Let's take a closer look at what takes up so much memory. Run<br />
<tt>./cd-fit +RTS -h -hbvoid</tt> and produce PostScript picture for<br />
this memory profile. This will give us detailed breakdown of all<br />
memory whose "biography" shows that it's been "VOID" (unused). My<br />
picture (and I presume that yours as well) shows that VOID memory<br />
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could<br />
safely assume that second word would be "precomp" (You wonder why?<br />
Look again at the code and try to find function named "pre.*" which is<br />
called from inside <tt>precomputeDisksFor</tt>)<br />
<br />
This means that memory has been taken by the list generated inside<br />
"precomp". Rumor has it that memory leaks with Haskell are caused by<br />
either too little lazyness or too much lazyness. It seems like we have<br />
too little lazyness here: we evaluate more elements of the list that<br />
we actually need and keep them from being garbage-collected. <br />
<br />
Note how we look up element from "precomp" in this piece of code:<br />
<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)=precomp!!(limit - dir_size d)<br />
, d `notElem` ds<br />
<br />
<br />
Obviously, the whole list generated by "precomp" must be kept in<br />
memory for such lookups, since we can't be sure that some element<br />
could be garbage collected and will not be needed again.<br />
<br />
Let's rewrite the code to eliminate the list:<br />
<br />
-- Let the `bestDisk x' be the "most tightly packed" disk of total <br />
-- size no more than `x'.<br />
-- How to calculate `bestDisk'? Lets opt for a recursive definition:<br />
-- Recursion base: best packed disk of size 0 is empty and best-packed<br />
-- disk for empty list of directories on it is also empty.<br />
bestDisk 0 _ = DirPack 0 []<br />
bestDisk _ [] = DirPack 0 []<br />
-- Recursion step: for size `limit`, bigger than 0, best packed disk is<br />
-- comptued as follows:<br />
bestDisk limit dirs =<br />
-- Take all non-empty dirs that could possibly fit to that disk by itself.<br />
-- Consider them one by one. Let the size of particular dir be `dir_size d'.<br />
-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus<br />
-- producing the disk of size <= limit. Lets do that for all "candidate" dirs that<br />
-- are not yet on our disk:<br />
case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) dirs <br />
, d `notElem` ds<br />
] of<br />
-- We either fail to add any dirs (probably, because all of them too big).<br />
-- Well, just report that disk must be left empty:<br />
[] -> DirPack 0 []<br />
-- Or we produce some alternative packings. Let's choose the best of them all:<br />
packs -> maximumBy cmpSize packs<br />
<br />
cmpSize a b = compare (pack_size a) (pack_size b)<br />
<br />
dynamic_pack limit dirs = bestDisk limit dirs<br />
<br />
<br />
Compile the profiling version of this code and obtain the overall<br />
execution profile (with "+RTS -p"). You'll get something like this:<br />
<br />
cd-fit +RTS -p -RTS<br />
<br />
total time = 0.00 secs (0 ticks @ 20 ms)<br />
total alloc = 1,129,520 bytes (excludes profiling overheads)<br />
<br />
COST CENTRE MODULE %time %alloc<br />
<br />
CAF GHC.Float 0.0 4.4<br />
main Main 0.0 93.9<br />
<br />
individual inherited<br />
COST CENTRE MODULE no. entries %time %alloc %time %alloc<br />
MAIN MAIN 1 0 0.0 0.0 0.0 100.0<br />
main Main 180 1 0.0 93.9 0.0 94.2<br />
prop_dynamic_pack_small_disk Main 181 100 0.0 0.0 0.0 0.3<br />
dynamic_pack Main 182 200 0.0 0.2 0.0 0.3<br />
bestDisk Main 183 200 0.0 0.1 0.0 0.1<br />
<br />
We achieved the major improvement: memory consumption is reduced by factor<br />
of 700! Now we could test the code on the "real task" - change the<br />
code to run the test for packing the full-sized disk:<br />
<br />
main = quickCheck prop_dynamic_pack_is_fixpoint<br />
<br />
Compile with profiling and run (with "+RTS -p"). If you are not lucky<br />
and a considerably big test set would be randomly generated for your<br />
runs, you'll have to wait. And wait even more. And more.<br />
<br />
Go make some tea. Drink it. Read some Tolstoi (Do you have "War and<br />
peace" handy?). Chances are that by the time you are done with<br />
Tolstoi, program will still be running (just take my word on it, don't<br />
check).<br />
<br />
If you are lucky, your program will finish fast enough and leave you<br />
with profile. According to a profile, program spends 99% of its time<br />
inside <tt>bestDisk</tt>. Could we speed up <tt>bestDisk</tt> somehow?<br />
<br />
Note that <tt>bestDisk</tt> performs several simple calculation for<br />
which it must call itself. However, it is done rather inefficiently -<br />
each time we pass to <tt>bestDisk</tt> the exact same set of<br />
directories as it was called with, even if we have already "packed"<br />
some of them. Let's amend this:<br />
<br />
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (1,limit)).dir_size ) dirs<br />
, d <- small_enough<br />
, dir_size d > 0<br />
, let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)<br />
] of<br />
<br />
Recompile and run again. Runtimes could be lengthy, but bearable, and<br />
number of times <tt>bestDisk</tt> is called (according to the profile)<br />
should decrease significantly. <br />
<br />
Finally, let's compare both packing algorithms. Intuitively, we feel<br />
that greedy algorithm should produce worse results, don't we? Lets put<br />
this feeling to the test:<br />
<br />
prop_greedy_pack_is_no_better_than_dynamic_pack ds =<br />
pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)<br />
<br />
Verify that it is indeed so by running <tt>quickCheck</tt> for this<br />
test several time. I feel that this concludes our knapsacking<br />
exercises. <br />
<br />
Adventurous readers could continue further by implementing so-called<br />
"scaling" for <tt>dynamic_pack</tt> where we divide all directory<br />
sizes and medium size by the size of the smallest directory to proceed<br />
with smaller numbers (which promises faster runtimes). <br />
<br />
== Chapter 5: Where do you want to go tomorrow? ==<br />
<br />
As the name implies, the author is open for proposals - where should<br />
we go next? I had networking + xml/xmpp in mind, but it might be too<br />
heavy and too narrow for most of the readers.<br />
<br />
What do you think? Drop me a line.<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about [[monad]]s" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
<br />
really is just a syntax sugar for:<br />
<br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of [[GHC]] (6.4 or above) or [[Hugs]] (v200311 or<br />
above) and "[[darcs]]", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to: Helge,<br />
alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin<br />
Percossi, SpellingNazi, Davor Cubranic, Brett Giles, Stdrange, Brian Chrisman.<br />
If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=Hitchhikers_guide_to_Haskell&diff=2206Hitchhikers guide to Haskell2006-02-06T20:33:32Z<p>Adept: Migrated from hawiki</p>
<hr />
<div>= Hitchhikers Guide To The Haskell =<br />
<br />
== Preface: DONT PANIC! ==<br />
<br />
Recent experiences from a few of my fellow C++/Java programmers<br />
indicate that they read various Haskell tutorials with "exponential<br />
speedup" (thing about how TCP/IP session starts up). They start slow<br />
and cautious, but when they see that the first 3-5 pages do not<br />
contain "anything interesting" in terms of code and examples, they<br />
begin skipping paragraphs, then chapters, then whole pages, only to<br />
slow down - often to a complete halt - somewhere on page 50, finding<br />
themselves in the thick of concepts like "type classes", "type<br />
constructors", "monadic IO", at which point they usually panic, think<br />
of a perfectly rational excuse not to read further anymore, and<br />
happily forget this sad and scary encounter with Haskell (as human<br />
beings usually tend to forget sad and scary things).<br />
<br />
This text intends to introduce the reader to the practical aspects of Haskell<br />
from the very beginning (plans for the first chapters include: I/O, darcs,<br />
Parsec, QuickCheck, profiling and debugging, to mention the few). The reader<br />
is expected to know (where to find) at least the basics of haskell: how to run<br />
"hugs" or "ghci", that layout is 2-dimensional, etc. Other than that, we do<br />
not plan to take radical leaps, and will go one step at a time in order not to<br />
lose the reader along the way. So DONT PANIC, take your towel with you and<br />
read along.<br />
<br />
Oh, almost forgot: author is very interested in ANY feedback. Drop him a line<br />
or a word (see [[Adept]] for contact info)<br />
<br />
== Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell ==<br />
<br />
Each chapter will be dedicated to one small real-life task which we will<br />
complete from the ground up.<br />
<br />
So here is the task for this chapter: in order to free up space on<br />
your hard drive for all the haskell code you are going to write in the<br />
nearest future, you are going to archive some of the old and dusty<br />
information on CDs and DVDs. While CD (or DVD) burning itself is easy<br />
these days, it usually takes some (or quite a lot ot) time to decide<br />
how to put a several Gb's of digital photos on CD-Rs, when directories<br />
with images range from 10 to 300 Mb's in size, and you dont want to<br />
burn half-full (or half-empty) CD-Rs.<br />
<br />
So, the task is to write a program which will help us to put a given<br />
collection of directories on the minimum possible amount of media,<br />
while packing the media as tight as possible. Let's name this program<br />
"cd-fit".<br />
<br />
Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,<br />
and then move on to more interesting things:<br />
<br />
-- put this in hello.hs<br />
module Main where<br />
main = putStrLn "Hello world!"<br />
<br />
Run it:<br />
<br />
$ runhaskell ./hello.hs<br />
Hello world!<br />
<br />
OK, we've done it. Move along now, nothing interesting here :)<br />
<br />
Any serious development must be done with the help of version control<br />
system, and we will not make an exception. We will use modern<br />
distributed version control system "darcs". "Modern" means that it is<br />
written in Haskell, "distributed" means that each working copy is<br />
repository in itself.<br />
<br />
First, lets create an empty directory for all our code, and invoke<br />
"darcs init" there, which will create subdirectory "_darcs" to store<br />
all version-control-related stuff there.<br />
<br />
Fire up your favorite editor and create new file "cd-fit.hs" in our<br />
working directory. Now lets think for a moment about how our program<br />
will operate and express it in pseudocode:<br />
<br />
main = read list of directories and their sizes<br />
decide how to fit them on CD-Rs<br />
print solution<br />
<br />
Sounds reasonable? I thought so.<br />
<br />
Lets simplify our life a little and assume for now that we will<br />
compute directory sizes somewhere outside our program (for example,<br />
with "du -sb *") and read this information from stdin.<br />
Now let me convert all this to Haskell:<br />
<br />
module Main where<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
-- compute solution and print it<br />
<br />
Not really working, but pretty close to plain English, eh? Let's stop<br />
for a moment and look closer at whats written here line-by-line<br />
<br />
Let's begin from the top:<br />
<br />
input <- getContents<br />
<br />
This is an example of the Haskell syntax for doing IO (namely, input). This<br />
line is an instruction to read all the information available from the stdin,<br />
return it as a single string, and bind to the symbol "input", so we can<br />
process this string any way we want.<br />
<br />
How did I know that? Did I memorize all the functions by heart? Of course not!<br />
Each function has a type, which, along with function's name, usually tells a<br />
lot about what function will do.<br />
<br />
Let's fire up interactive Haskell environment and examine this function<br />
up close:<br />
<br />
$ ghci<br />
___ ___ _<br />
/ _ \ /\ /\/ __(_)<br />
/ /_\// /_/ / / | | GHC Interactive, version 6.4.1, for Haskell 98.<br />
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/<br />
\____/\/ /_/\____/|_| Type :? for help.<br />
<br />
Loading package base-1.0 ... linking ... done.<br />
Prelude> :type getContents<br />
getContents :: IO String<br />
Prelude> <br />
<br />
We see that "getContents" is a function without arguments, that will return<br />
"IO String". Prefix "IO" meant that this is an IO action. It will return<br />
String, when evaluated. Action will be evaluated as soon as we use "<-" to<br />
bind it's result to some symbol.<br />
<br />
Note that "<-" is not a fancy way to do assignment to variable. It is a way to<br />
evaluate (execute) IO actions, in other words - to actually do some I/O and<br />
return it's result (if any). <br />
<br />
We can choose not to evaluate action obtained from "getContents", but to carry<br />
it around a bit and evaluate later:<br />
<br />
let x = getContents<br />
-- 300 lines of code here<br />
input <- x<br />
<br />
So, as you see, IO actions can act like an ordinary values. Suppose that we<br />
have built a list of IO actions and found a way to execute them one by one.<br />
This would be a way to simulate imperative programming with it's notion of<br />
"order of execution".<br />
<br />
Haskell allows you to do better than that. <br />
<br />
Standard language library (named "Prelude", by the way) provides us with lots<br />
of functions that return useful primitive IO actions. In order to combine them<br />
to produce more complex actions, we use a "do":<br />
<br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
putStrLn "done"<br />
<br />
Here we '''bind''' "c" to an action with the following "scenario":<br />
* '''evaluate''' action "someAction" and '''bind''' it's result to "a"<br />
* then, '''evaluate''' "someOtherAction" and '''bind''' it's result to "b"<br />
* then, process "b" with function "bar" and print result<br />
* then, process "a" with function "foo" and print result<br />
* then, print the word "done"<br />
<br />
When all this will actually be executed? Answer: as soon as we evaluate "c"<br />
using the "<-" (if it returns result, as "getContents" does) or just<br />
by using it as a function name (if it does not return a result, as "print"<br />
does):<br />
<br />
process = do putStrLn "Will do some processing"<br />
c<br />
putStrLn "Done"<br />
<br />
Notice that we took a bunch of functions ("someAction", "someOtherAction",<br />
"print", "putStrLn") and using "do" created from them a new function, which we<br />
bound to symbol "c". Now we could use "c" as a building block to produce even<br />
more complex function, "process", and we could carry this on and on.<br />
Eventually, some of the functions will be mentioned in the code of function<br />
"main", to which the ultimate topmost IO action any Haskell program is bound.<br />
<br />
When the "main" will be executed/evaluated/forced? As soon as we run the<br />
program. Read this twice and try to comprehend: <br />
<br />
''Execution of the Haskell program is an evaluation of the symbol "main" to<br />
which we have bound an IO action. Via evaluation we obtain the result of that<br />
action''. <br />
<br />
Readers familiar with advanced C++ or Java programming and arcane body of<br />
knowledge named "OOP Design Patterns" might note that "build actions from<br />
actions" and "evaluate actions to get result" is essentially a "Command<br />
pattern" and "Composition pattern" combined. Good new: in Haskell you get them<br />
for all your IO, and get them '''for free''' :)<br />
<br />
----<br />
'''Exercise:'''<br />
Consider the following code:<br />
<br />
module Main where<br />
c = putStrLn "C!"<br />
<br />
combine before after =<br />
do before<br />
putStrLn "In the middle"<br />
after<br />
<br />
main = do combine c c<br />
let b = combine (putStrLn "Hello!") (putStrLn "Bye!)<br />
let d = combine (b) (combine c c)<br />
putStrLn "So long!"<br />
<br />
See how we construct code out of thin air? Try to imagine what this code will<br />
do, then run it and check yourself. <br />
<br />
Do you understand why "Hello!" and "Bye!" are not printed?<br />
----<br />
<br />
Let's examine our "main" function closer:<br />
<br />
Prelude> :load cd-fit.hs<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> :type main<br />
main :: IO ()<br />
*Main> <br />
<br />
We see that "main" is indeed an IO action which will return nothing<br />
when evaluated. When combining actions with "do", the type of the<br />
result will be the type of the last action, and "putStrLn something" has type<br />
"IO ()": <br />
<br />
*Main> :type putStrLn "Hello world!"<br />
putStrLn "Hello world!" :: IO ()<br />
*Main> <br />
<br />
Oh, by the way: have you noticed that we actually compiled our first<br />
Haskell program in order to examine "main"? :)<br />
<br />
Lets celebrate that by putting it under version control: execute<br />
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions<br />
and provide a commit comment "Skeleton of cd-fit.hs"<br />
<br />
Let's try to run it:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
----<br />
'''Exercises''':<br />
<br />
* Try to write program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);<br />
<br />
* Try to write program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).<br />
<br />
== Chapter 2: Parsing the input ==<br />
<br />
OK, now that we have proper understanding of the powers of Haskell IO<br />
(and are awed by them, I hope), lets forget about IO and actually do<br />
some usefull work. <br />
<br />
As you remember, we set forth to pack some CD-Rs as tightly as<br />
possible with data scattered in several input directories. We assume<br />
that "du -sb" will compute the sizes of input directories and output<br />
something like:<br />
<br />
65572 /home/adept/photos/raw-to-burn/dir1<br />
68268 /home/adept/photos/raw-to-burn/dir2<br />
53372 /home/adept/photos/raw-to-burn/dir3<br />
713124 /home/adept/photos/raw-to-burn/dir4<br />
437952 /home/adept/photos/raw-to-burn/dir5<br />
<br />
Our next task is to parse that input into some suitable internal<br />
representation.<br />
<br />
For that we will use powerful library of '''parsing combinators''' named<br />
"Parsec" which ships with most Haskell implementations.<br />
<br />
Much like the IO facilities we have seen in the first chapter, this<br />
library provides a set of basic parsers and means to combine into more<br />
complex parsing constructs.<br />
<br />
Unlike other tools in this area (lex/yacc or JavaCC to name a few),<br />
Parsec parsers do not require separate preprocessing stage. Since in<br />
Haskell we can return function as a result of function and thus<br />
construct functions "from the thin air", there is no need for separate<br />
syntax for parser description. But enough advertisements, lets actually<br />
do some parsing:<br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- parseInput parses output of "du -sb", which consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Datatype Dir holds information about single directory - it's size and name<br />
data Dir = Dir Int String deriving Show<br />
<br />
-- `dirAndSize` parses information about single directory, which is:<br />
-- a size in bytes (number), some spaces, then directory name, which extends till newline<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return (Dir (read size) dir_name)<br />
<br />
Just add those lines into "cd-fit.hs". Here we see quite a lot of new<br />
things, and several those that we know already. <br />
<br />
First of all, note the familiar "do" construct, which, as we know, is<br />
used to combine IO actions to produce new IO actions. Here we use it<br />
to combine "parsing" actions into new "parsing" actions. Does this<br />
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must<br />
admit that I lied to you - "do" is used not only to combine IO<br />
actions. Is is used to combine any kind of so-called ''monadic<br />
actions'' or ''monadic values'' together.<br />
<br />
Think about monad as of "design pattern" in the functional world.<br />
Monad is a way to hide from the user (programmer) all the machinery<br />
required for complex functionality to operate.<br />
<br />
As you might have heard, Haskell has no notion of "assignment",<br />
"mutable state", "variables", and is a "pure functional language",<br />
which means that every function called with the same input parameters<br />
will return exactly the same result. Meanwhile "doing IO" requires<br />
hauling around file handles and their states and dealing with IO<br />
errors. "Parsing" requires to track position in the input and dealing<br />
with parsing errors.<br />
<br />
In both cases Wise Men Who Wrote Libraries cared for our needs and<br />
hid all underlying complexities from us, exposing the API of their<br />
libraries (IO and parsing) in the form of "monadic action" which we<br />
are free to combine as we see fit. <br />
<br />
Think of programming with monads as of doing the remodelling with the<br />
help of professional remodelling crew. You describe sequence of<br />
actions on the piece of paper (that's us writing in "do" notation),<br />
and then, when required, that sequence will be evaluated by the<br />
remodelling crew ("in the monad") which will provide you with end<br />
result, hiding all the underlying complexity (how to prepare the<br />
paint, which nails to choose, etc) from you.<br />
<br />
Lets use interactive Haskell environment to decipher all the<br />
instructions we've written for the parsing library. As usually, we'll<br />
go top-down:<br />
<br />
*Main> :reload<br />
Ok, modules loaded: Main.<br />
*Main> :t parseInput<br />
parseInput :: GenParser Char st [Dir]<br />
*Main> :t dirAndSize<br />
dirAndSize :: GenParser Char st Dir<br />
*Main> <br />
<br />
Assuming (well, take my word for it) that "GenParser Char st" is our<br />
parsing monad, we could see that "parseInput", when evaluated, will<br />
produce a list of "Dir", and "dirAndSize", when evaluated, will<br />
produce "Dir". Assuming that "Dir" somehow represents information<br />
about single directory, that is pretty much what we wanted, isn't it?<br />
<br />
Let's see what a "Dir" means. We defined ''datatype'' Dir as a record,<br />
which holds an Int and a String:<br />
<br />
data Dir = Dir Int String deriving Show<br />
<br />
In order to construct such records, we must use ''data constructor''<br />
Dir:<br />
<br />
*Main> :t Dir 1 "foo"<br />
Dir 1 "foo" :: Dir<br />
<br />
In order to reduce confusion for newbies, we could have written:<br />
<br />
data Dir = D Int String deriving Show<br />
<br />
, which would define ''datatype'' "Dir" with ''data constructor'' "D".<br />
However, traditionally name of the datatype and it's constructor are<br />
chosen to be the same.<br />
<br />
Clause "deriving Show" instructs compiler to make enough code "behind<br />
the curtains" to make this ''datatype'' conform to the interface of<br />
the ''type class'' Show. We will explain ''type classes'' later, for<br />
now lets just say that this will allow us to "print" instances of<br />
"Dir".<br />
<br />
Exercises: <br />
* examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.<br />
<br />
* compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called ''partial application''.<br />
<br />
<br />
OK. So, we combined a lot of primitive parsing actions to get ourselves a<br />
parser for output of "du -sb". How can we actually parse something? Parsec<br />
library supplies us with function "parse":<br />
<br />
*Main> :t parse<br />
parse :: GenParser tok () a<br />
-> SourceName<br />
-> [tok]<br />
-> Either ParseError a<br />
*Main> :t parse parseInput<br />
parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]<br />
*Main> <br />
<br />
First type might be a bit cryptic, but once we supply "parse" with parser we<br />
made, compiler gets more information and presents us with a more concise type.<br />
<br />
Stop and consider this for a moment. Compiler figured out type of the function<br />
without a single type annotation supplied by us! Imagine that Java compiler<br />
deduces types for you, and you dont have to specify types of arguments and<br />
return values of methods, ever.<br />
<br />
OK, back to the code. We can observe that the "parser" is a function, which,<br />
given a parser, a name of the source file or channel (f.e. "stdin"), and<br />
source data (String, which is a list of "Char"s, which is written "[Char]"),<br />
will either produce parse error, or parse us a list of "Dir".<br />
<br />
Datatype "Either" is an example of datatype whose constructor has name, different<br />
from the name of the datatype. In fact, "Either" has two constructors:<br />
<br />
data Either a b = Left a | Right b<br />
<br />
In order to undestand better what does this mean consider the following<br />
example:<br />
<br />
*Main> :t Left 'a'<br />
Left 'a' :: Either Char b<br />
*Main> :t Right "aaa"<br />
Right "aaa" :: Either a [Char]<br />
*Main> <br />
<br />
You see that "Either" is a ''union'' (much like the C/C++ "union") which could<br />
hold value of one of the two distinct types. However, unlike C/C++ "union",<br />
when presented with value of type "Either Int Char" we could immediately see<br />
whether its an Int or a Char - by looking at the constructor which was used to<br />
produce the value. Such datatypes are called "tagged unions", and they are<br />
another power tool in the Haskell toolset.<br />
<br />
Did you also notice that we provide "parse" with parser, which is monadic<br />
value, but receive not a new monadic value, but a parsing result? That is<br />
because "parse" is an evaluator for "Parser" monad, much like the GHC or Hugs<br />
runtime is an evaluator for the IO monad. Function "parser" implements all<br />
monadic machinery: tracks errors and positions in input, implements<br />
backtracking and lookahead, etc.<br />
<br />
Lets extend our "main" function to use "parse" and actually parse the input<br />
and show us the parsed data structures:<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
<br />
Exercise:<br />
<br />
* In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).<br />
* Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.<br />
* Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.<br />
<br />
Let's try to run what we have so far:<br />
<br />
$ du -sb * | runhaskell ./cd-fit.hs<br />
<br />
DEBUG: got input 22325 Article.txt<br />
18928 Article.txt~<br />
1706 cd-fit.hs<br />
964 cd-fit.hs~<br />
61609 _darcs<br />
<br />
DEBUG: parsed:<br />
[Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]<br />
<br />
Seems to be doing exactly as planned. Now lets try some erroneous<br />
input:<br />
<br />
$ echo "foo" | runhaskell cd-fit.hs<br />
DEBUG: got input foo<br />
<br />
DEBUG: parsed:<br />
*** Exception: Input:<br />
"foo\n"<br />
Error:<br />
"stdin" (line 1, column 1):<br />
unexpected "f"<br />
expecting digit or end of input<br />
<br />
Seems to be doing fine. Let's "darcs record" it, giving the commit comment "Implemented parsing of input".<br />
<br />
----<br />
Here is complete "cd-fit.hs" what we should have written so far:<br />
<br />
module Main where<br />
<br />
import Text.ParserCombinators.Parsec<br />
<br />
-- Output of "du -sb" -- which is our input -- consists of many lines,<br />
-- each of which describes single directory<br />
parseInput = <br />
do dirs <- many dirAndSize<br />
eof<br />
return dirs<br />
<br />
-- Information about single direcory is a size (number), some spaces,<br />
-- then directory name, which extends till newline<br />
data Dir = Dir Int String deriving Show<br />
dirAndSize = <br />
do size <- many1 digit<br />
spaces<br />
dir_name <- anyChar `manyTill` newline<br />
return $ Dir (read size) dir_name<br />
<br />
main = do input <- getContents<br />
putStrLn ("DEBUG: got input " ++ input)<br />
let dirs = case parse parseInput "stdin" input of<br />
Left err -> error $ "Input:\n" ++ show input ++ <br />
"\nError:\n" ++ show err<br />
Right result -> result<br />
putStrLn "DEBUG: parsed:"; print dirs<br />
-- compute solution and print it<br />
<br />
<br />
== Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!) ==<br />
<br />
Enough preliminaries already. Lets go pack some CDs.<br />
<br />
As you might already have recognized, our problem is a classical one. It is<br />
called a "knapsack problem" (google it up, if you don't know already what is<br />
it. There are more than 100000 links).<br />
<br />
Lets start from the greedy solution, but first let's slightly modify our "Dir"<br />
datatype to allow easy extraction of it's components:<br />
<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
<br />
----<br />
Exercise: examine types of "Dir", "dir_size" and "dir_name"<br />
----<br />
<br />
From now on, we could use "dir_size d" to get a size of directory, and<br />
"dir_name d" to get it's name, provided that "d" is of type "Dir".<br />
<br />
Greedy algorithm sorts directories from the biggest down, and tries to puts<br />
them on CD one by one, until there is no room for more. We will need to track<br />
which directories we added to CD, so lets add another datatype, and code this<br />
simple packing algorithm:<br />
<br />
import Data.List (sortBy)<br />
<br />
-- DirPack holds a set of directories which are to be stored on single CD.<br />
-- 'pack_size' could be calculated, but we will store it separately to reduce<br />
-- amount of calculation<br />
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show<br />
<br />
-- For simplicity, lets assume that we deal with standard 700 Mb CDs for now<br />
media_size = 700*1024*1024<br />
<br />
-- Greedy packer tries to add directories one by one to initially empty 'DirPack'<br />
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs<br />
where<br />
cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)<br />
<br />
-- Helper function, which only adds directory "d" to the pack "p" when new<br />
-- total size does not exceed media_size<br />
maybe_add_dir p d =<br />
let new_size = pack_size p + dir_size d<br />
new_dirs = d:(dirs p)<br />
in if new_size > media_size then p else DirPack new_size new_dirs<br />
<br />
----<br />
I'll highlight the areas which you could explore on your own (using other nice<br />
tutorials out there, of which I especially recommend "Yet Another Haskell<br />
Tutorial" by Hal Daume):<br />
* We choose to import a single function "sortBy" from a module Data.List, not the whole thing.<br />
* Instead of coding case-by-case recursive definition of "greedy_pack", we go with high-order approach, choosing "foldl" as a vehicle for list traversal.Examine it's type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!<br />
* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where user could provide custom "modifier" for generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".<br />
* To code quite complex function "maybe_add_dir", we introduced several '''local definition''' in the "let" clause, which we could reuse within function body. We used "where" clause in the "greedy_pack" function to achieve the same. Read about "let" and "where" clauses and difference between them.<br />
* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used helper accessor functions "pack_size" and "dirs"<br />
----<br />
<br />
In order to actually use our greedy packer we must call it from our "main"<br />
function, so lets add a lines:<br />
<br />
main = do ...<br />
-- compute solution and print it<br />
putStrLn "Solution:" ; print (greedy_pack dirs)<br />
<br />
Verify integrity of our definitions by (re)loading our code in ghci. Compiles?<br />
Thought so :)<br />
<br />
Now it is time to test our creation. We could do it by actually running it in<br />
the wild like this:<br />
<br />
$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs<br />
<br />
This will prove that our code seems to be working. At least, this once. How<br />
about establishing with reasonable degree of certainty that our code, parts<br />
and the whole, works properly, and doing so in re-usable manner? In other<br />
words, how about writing some test?<br />
<br />
Java programmers used to JUnit probably thought about screens of boiler-plate<br />
code and hand-coded method invocations. Never fear, we will not do anything as<br />
silly :)<br />
<br />
Enter '''QuickCheck'''.<br />
<br />
QuickCheck is a tool to do automated testing of you functions using<br />
(semi)random input data. In the spirit of "100b of code examples worth 1kb of<br />
praise" lets show the code for testing the following ''property'': attempt to pack<br />
directories returned by "greedy_pack" should return "DirPack" of exactly the<br />
same pack:<br />
<br />
import Test.QuickCheck<br />
import Control.Monad (liftM2)<br />
<br />
-- We must teach QuickCheck how to generate arbitrary "Dir"s<br />
instance Arbitrary Dir where<br />
-- Let's just skip "coarbitrary" for now, ok? <br />
-- I promise, we will get back to it later :)<br />
coarbitrary = undefined<br />
-- We generate arbitrary "Dir" by generating random size and random name<br />
-- and stuffing them inside "Dir"<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".<br />
-- Assume that "ds" will be a random list of "Dir"s and code your test.<br />
prop_greedy_pack_is_fixpoint ds =<br />
let pack = greedy_pack ds <br />
in pack_size pack == pack_size (greedy_pack (dirs pack))<br />
<br />
Lets run the test, after which I'll explain how it all works:<br />
<br />
Prelude> :r<br />
Compiling Main ( ./cd-fit.hs, interpreted )<br />
Ok, modules loaded: Main.<br />
*Main> quickCheck prop_greedy_pack_is_fixpoint<br />
[numbers spinning]<br />
OK, passed 100 tests.<br />
*Main> <br />
<br />
We've just seen our "greedy_pack" run on a 100 completely (well, almost<br />
completely) random lists of "Dir"s, and it seems that property indeed holds.<br />
<br />
Lets dissect the code. The most intriguing part is "instance Arbitrary Dir<br />
where", which declares that "Dir" is an '''instance''' of '''typeclass'''<br />
"Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a<br />
bit. <br />
<br />
What is a '''typeclass'''? Typeclass is a Haskell way of dealing with the<br />
following situation: suppose that you are writing a library of usefull<br />
functions and you dont know in advance how exactly they will be used, so you<br />
want to make them generic. Now, on one hand you dont want to restrict your<br />
users to certain type (f.e. String). On the other hand, you want to enforce<br />
the convention, that arguments for your function must satisfy certain set of<br />
constraints. That is where '''typeclass''' comes in handy. <br />
<br />
Think of typeclass as of '''contract''' (or "interface", in Java terms) that<br />
your type must fulfill in order to be admitted as an argument to certain<br />
function. <br />
<br />
Let's examine typeclass "Arbitrary":<br />
<br />
*Main> :i Arbitrary<br />
class Arbitrary a where<br />
arbitrary :: Gen a<br />
coarbitrary :: a -> Gen b -> Gen b<br />
-- Imported from Test.QuickCheck<br />
instance Arbitrary Dir<br />
-- Defined at ./cd-fit.hs:61:0<br />
instance Arbitrary Bool -- Imported from Test.QuickCheck<br />
instance Arbitrary Double -- Imported from Test.QuickCheck<br />
instance Arbitrary Float -- Imported from Test.QuickCheck<br />
instance Arbitrary Int -- Imported from Test.QuickCheck<br />
instance Arbitrary Integer -- Imported from Test.QuickCheck<br />
-- rest skipped --<br />
<br />
It could be read this way: "Any type (let's name it 'a') could be a member of<br />
class Arbitrary as soon as we define two functions for it: "arbitrary" and<br />
"coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and<br />
Integer such definition were provided, so all those types are instance of<br />
class Arbitrary".<br />
<br />
Now, if you write a function which operates on it's arguments solely by means<br />
of "arbitrary" and "coarbitrary", you can be sure that this function will work<br />
on any type which is and instance of "Arbitrary"!<br />
<br />
Lets say it again. Someone (maybe even you) writes the code (API or library),<br />
which requires that input values implement certain ''interfaces'', which is<br />
described in terms of functions. Once you show how your type implements this<br />
''interface'' you are free to use API or library.<br />
<br />
Conside the function "sort" from standard library:<br />
<br />
*Main> :t Data.List.sort<br />
Data.List.sort :: (Ord a) => [a] -> [a]<br />
<br />
We see that it sorts lists of any values which are instance of typeclass<br />
"Ord". Let's examine that class:<br />
<br />
*Main> :i Ord<br />
class Eq a => Ord a where<br />
compare :: a -> a -> Ordering<br />
(<) :: a -> a -> Bool<br />
(>=) :: a -> a -> Bool<br />
(>) :: a -> a -> Bool<br />
(<=) :: a -> a -> Bool<br />
max :: a -> a -> a<br />
min :: a -> a -> a<br />
-- skip<br />
instance Ord Double -- Imported from GHC.Float<br />
instance Ord Float -- Imported from GHC.Float<br />
instance Ord Bool -- Imported from GHC.Base<br />
instance Ord Char -- Imported from GHC.Base<br />
instance Ord Integer -- Imported from GHC.Num<br />
instance Ord Int -- Imported from GHC.Base<br />
-- skip<br />
*Main> <br />
<br />
We see a couple of interesting things: first, there is an additional<br />
requirement listed: in order to be an instance of "Ord", type must first be an<br />
instance of typeclass "Eq". Then, we see that there is an awful lot of<br />
functions to define in order to be an instance of "Ord". Wait a second, isn't<br />
it silly to define both (<) and (>) when one could be expressed via another? <br />
<br />
Right you are! Usually, typeclass contains several "default" implementation<br />
for it's functions, when it is possible to express them through each other (as<br />
it is with "Ord"). In this case it is possible to supply only a minimal<br />
definition (which in case of "Ord" consists of any single function) and others<br />
will be automatically derived. If you supplied less functions than required<br />
for minimal implementation, compiler/interpreter will surely say so and<br />
explain which functions you still have to define.<br />
<br />
Once again, we see that a lot of type are already instances of typeclass Ord,<br />
and thus we are able to sort them.<br />
<br />
Now, lets take a look back to the definition of "Dir":<br />
<br />
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show<br />
<br />
See that "deriving" clause? It instructs compiler to automatically derive code<br />
to make "Dir" an instance of typeclass Show. Compiler knows about a bunch of<br />
standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and<br />
knows how to make a type into "suitably good" instance of any of them. If you<br />
want to derive instances of more than one typeclass, say it this way:<br />
"deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of<br />
that type!<br />
<br />
Sidenote for Java programmers: just imagine java compiler which derives code<br />
for "implements Storable" for you...<br />
<br />
Sidenote for C++ programmers: just imagine that deep copy constructors are<br />
being written for you by compiler....<br />
<br />
----<br />
Exercises:<br />
* Examine typeclasses Eq and Show<br />
* Examine types of (==) and "print"<br />
* Try to make "Dir" instance of "Eq"<br />
----<br />
<br />
OK, back to our tests. So, what we have had to do in order to make "Dir" an<br />
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's<br />
examine it up close:<br />
<br />
*Main> :t arbitrary<br />
arbitrary :: (Arbitrary a) => Gen a<br />
<br />
See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser<br />
a" which we've seen already. This is yet another example of action-returning<br />
function, which could be used inside "do"-notation. (You might ask yourself,<br />
wouldn't it be useful to generalize that convenient concept of actions and<br />
"do"? Of course! It is already done, the concept is called "Monad" and we will<br />
talk about it in Chapter 400 :) )<br />
<br />
Since 'a' here is a type which is an instance of "Arbitrary", we could<br />
substitute "Dir" here. So, how we can make and return action of type "Gen<br />
Dir"?<br />
<br />
Let's look at the code:<br />
<br />
arbitrary = liftM2 Dir gen_size gen_name<br />
-- Generate random size between 10 and 1400 Mb<br />
where gen_size = do s <- choose (10,1400)<br />
return (s*1024*1024)<br />
-- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" <br />
gen_name = do n <- choose (1,300)<br />
sequence $ take (n*10+1) $ repeat (elements "fubar/")<br />
<br />
We have used library-provided functions "choose" and "elements" to build up<br />
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my<br />
word on that. Find a way to check types of "gen_name" and "gen_size"). Since<br />
"Int" and "String" are components of "Dir", we sure must be able to use "Gen<br />
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for<br />
that? There is none, and there is only single call to "liftM2". <br />
<br />
Let's examine it:<br />
<br />
*Main> :t liftM2<br />
liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r<br />
<br />
Kind of scary, right? Let's provide typechecker with more context:<br />
<br />
*Main> :t liftM2 Dir<br />
liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir<br />
<br />
Since you already heard that "Gen" is a "Monad", you could substitute "Gen" fo<br />
r "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -><br />
Gen Dir". Exactly what we wanted!<br />
<br />
Consider "liftM2" to be "advanced topic" of this chapter (which we will cover<br />
later) and just note for now that:<br />
* "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"<br />
* There are also "liftM", "liftM3", "liftM4", "liftM5"<br />
* "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"<br />
<br />
Hopefully, this will all make sense after you read it for the third time ;)<br />
<br />
== Chapter 4: REALLY packing the knapsack this time == <br />
<br />
In this chapter we are going to write several not-so-trivial packing methods,<br />
compare their efficiency, and learn something new about debugging and<br />
profiling of the Haskell programs along the way<br />
<br />
== Chapter 400: Monads up close ==<br />
<br />
Google "All about monads" and read it. 'Nuff said :)<br />
<br />
== Chapter 500: IO up close ==<br />
<br />
Shows that:<br />
<br />
c = do a <- someAction<br />
b <- someOtherAction<br />
print (bar b)<br />
print (foo a)<br />
print "done"<br />
<br />
really is just a syntax sugar for:<br />
<br />
c = someAction >>= \a -><br />
someOtherAction >>= \b -><br />
print (bar b) >><br />
print (foo a) >><br />
print "done"<br />
<br />
and explains about ">>=" and ">>". Oh wait. This was already explained<br />
in Chapter 400 :)<br />
<br />
== Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software ==<br />
<br />
Plenty of material on this on the web and this wiki. Just go get<br />
yourself installation of GHC (6.4 or above) or Hugs (v200311 or<br />
above) and "darcs", which we will use for version control.<br />
<br />
== Chapter 10000: Thanks! ==<br />
<br />
Thanks for comments, proofreading, good advice and kind words go to: Helge,<br />
alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, avalez, Martin<br />
Percossi. If I should have mentioned YOU and forgot - tell me so.<br />
<br />
Without you I would have stopped after Chapter 1 :)</div>Adepthttps://wiki.haskell.org/index.php?title=QuickCheck_as_a_test_set_generator&diff=2205QuickCheck as a test set generator2006-02-06T19:35:14Z<p>Adept: Migrated from hawiki</p>
<hr />
<div>= <center>Haskell as an ultimate "smoke testing" tool<p>OR</p><p>Using QuickCheck as DIY test data generator</p></center> =<br />
<br />
== Preface ==<br />
<br />
Recently, my wife approached me with the following problem: they had to<br />
test their re-implementation (in Java) of the part of the huge<br />
software system previously written in C++. The original system is poorly<br />
documented and only a small part of the sources were available.<br />
<br />
Among other things, they had to wrote a parser for home-brewn DSL<br />
designed to describe data structures. DSL is a mix of ASN.1 and BNF<br />
grammars, describes a structure of some data records and simple<br />
business rules relevant to processing of said record. The DSL is not<br />
Turing-complete, but allows user to define it's own functions,<br />
specify math and boolean expression on fields and was designed as<br />
"ASN.1 on steroids".<br />
<br />
Problem is, that their implementation (in JavaCC) on this DSL parser<br />
was based on the single available description of the DSL grammar,<br />
which was presumably incomplete. They tested implementation on several<br />
examples available, but the question remained how to test the parser on a<br />
large subset of data in order to be fairly sure that "everything<br />
works"<br />
<br />
== The fame of Quick Check ==<br />
<br />
My wife observed me during the last (2005) ICFP contest and was amazed<br />
at the ease with which our team has tested our protocol parser and<br />
printer using Quick Check. So, she asked me whether it is possible to<br />
generate pseudo-random test data in the similar manner for use<br />
"outside" of Haskell?<br />
<br />
"Why not?" I thought. After all, I found it quite easy to generate<br />
instances of 'Arbitrary' for quite complex data structures.<br />
<br />
== Concept of the '''Variant''' ==<br />
<br />
The task was formulated as follows:<br />
<br />
* The task is to generate test datasets for the external program. Each dataset consists of several files, each containing 1 "record"<br />
<br />
* A "record" is essentially a Haskell data type<br />
<br />
* We must be able to generate pseudo-random "valid" and "invalid" data, to test that external program consumes all "valid" samples and fails to consume all "invalid" ones. Deviation from this behavior signifies an error in external program.<br />
<br />
Lets capture this notion of "valid" and "invalid" data in a type<br />
class:<br />
<br />
module Variant where<br />
<br />
import Control.Monad<br />
import Test.QuickCheck<br />
<br />
class Variant a where<br />
valid :: Gen a<br />
invalid :: Gen a<br />
<br />
So, in order to make a set of test data of some type, the user must<br />
provide means to generate "valid" and "invalid" data of this type.<br />
<br />
If we can make a "valid" Foo (for suitable "data Foo = ...") and<br />
"invalid" Foo, then we should also be able to make a "random" Foo:<br />
<br />
instance Variant a => Arbitrary a where<br />
coarbitrary = undefined -- Not needed, Easily fixable<br />
arbitrary = oneof [valid, invalid]<br />
<br />
Thus, taking for example the following definition for our<br />
"data-to-test":<br />
<br />
data Record = InputRecord Name Number<br />
| OutputRecord Name Number OutputType<br />
data Number = Number String<br />
data Name = Name String <br />
data OutputType = OutputType String<br />
<br />
we could produce the following instances of the class "Variant":<br />
<br />
-- For definition of `neStringOf` see below, for now it is sufficient<br />
-- to say that `neStringOf first next` produces non-empty string whose<br />
-- first character is taken from `first` and all sunsequent - from<br />
-- `next`<br />
garbledString = neStringOf ".,_+-" "abc0!@#$%^&*()."<br />
instance Variant Number where<br />
valid = liftM Number $ resize 4 $ neStringOf "123456789" "0123456789"<br />
invalid = liftM Number $ resize 4 $ garbledString<br />
instance Variant Name where<br />
valid = liftM Name $ elements [ "foo", "bar", "baz" ]<br />
invalid = liftM Name garbledString<br />
data OutputType = OutputType String<br />
valid = liftM OutputType $ elements [ "Binary", "Ascii" ]<br />
invalid = liftM OutputType garbledString<br />
<br />
instance Variant Record where<br />
valid = oneof [ liftM2 InputRecord valid valid<br />
, liftM3 OutputRecord valid valid valid ]<br />
invalid = oneof [ liftM2 InputRecord valid invalid<br />
, liftM2 InputRecord invalid valid<br />
, liftM2 InputRecord invalid invalid<br />
, liftM3 OutputRecord invalid valid valid <br />
, liftM3 OutputRecord valid invalid valid <br />
, liftM3 OutputRecord valid valid invalid<br />
, liftM3 OutputRecord invalid invalid valid <br />
, liftM3 OutputRecord valid invalid invalid <br />
, liftM3 OutputRecord invalid valid invalid<br />
, liftM3 OutputRecord invalid invalid invalid<br />
]<br />
<br />
<br />
The careful reader will have already spotted that once we hand-coded the instances of 'Variant' for a few "basic" types (like 'Name', 'Number', 'OutputType' etc), defining instances of Variant for more complex datatypes becomes easy, though quite a tedious job. We call to the rescue a set of simple helpers to facilitate this task<br />
<br />
== Helper tools ==<br />
<br />
It could easily be seen that we consider an instance of a data type to be "invalid" if at least one of the arguments to the constructor is "invalid", whereas a "valid" instance should have all arguments to data type constructor to be "valid". This calls for some permutations:<br />
<br />
proper1 f = liftM f valid<br />
proper2 f = liftM2 f valid valid<br />
proper3 f = liftM3 f valid valid valid<br />
proper4 f = liftM4 f valid valid valid valid<br />
proper5 f = liftM5 f valid valid valid valid valid<br />
<br />
bad1 f = liftM f invalid<br />
bad2 f = oneof $ tail [ liftM2 f g1 g2 | g1<-[valid, invalid], g2<-[valid, invalid] ]<br />
bad3 f = oneof $ tail [ liftM3 f g1 g2 g3 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid] ]<br />
bad4 f = oneof $ tail [ liftM4 f g1 g2 g3 g4 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid] ]<br />
bad5 f = oneof $ tail [ liftM5 f g1 g2 g3 g4 g5 | g1<-[valid, invalid], g2<-[valid, invalid], g3<-[valid, invalid], g4<-[valid, invalid], g5<-[valid, invalid] ]<br />
<br />
With those helper definitions we could rewrite our Record instance as follows:<br />
<br />
instance Variant Record where<br />
valid = oneof [ proper2 InputRecord<br />
, proper3 OutputRecord ]<br />
invalid = oneof [ bad2 InputRecord<br />
, bad3 OutputRecord ]<br />
<br />
Note the drastic decrease in the size of the declaration!<br />
<br />
== Producing test data ==<br />
<br />
OK, but how to use all those fancy declarations to actually produce some test data?<br />
<br />
Let's take a look at the following code:<br />
<br />
data DataDefinition = DataDefinition Name Record<br />
<br />
main = <br />
do let num = 200 -- Number of test cases in each dataset.<br />
let config = -- Describe several test datasets for "DataDefinition"<br />
-- by defining how we want each component of DataDefinition<br />
-- for each particular dataset - valid, invalid or random<br />
[ ("All_Valid", num, (valid, valid, ))<br />
, ("Invalid_Name", num, (invalid, valid, ))<br />
, ("Invalid_Record" , num, (valid, invalid, ))<br />
, ("Random", num, (arbitrary, arbitrary))<br />
]<br />
mapM_ create_test_set config<br />
<br />
create_test_set (fname, ext, count, gens) =<br />
do rnd <- newStdGen <br />
let test_set = generate 100 rnd $ vectorOf' count (mkDataDef gens)<br />
sequence_ $ zipWith (writeToFile fname ext) [1..] test_set <br />
where<br />
mkDataDef (gen_name, gen_rec) = liftM2 DataDefinition gen_name gen_rec<br />
<br />
writeToFile name_prefix suffix n x =<br />
do h <- openFile (name_prefix ++ "_" ++ pad n ++ "." ++ suffix) WriteMode <br />
hPutStrLn h $ show x<br />
hClose h <br />
where pad n = reverse $ take 4 $ (reverse $ show n) ++ (repeat '0')<br />
<br />
You see that we could control size, nature and destination of each test dataset. This approach was taken to produce test datasets for the task I described earlier. The final Haskell module had definitions for 40 Haskell datatypes, and the topmost datatype had a single constructor with 9 fields. <br />
<br />
This proved to be A Whole Lot Of Code(tm), and declaration of "instance Variant ..." proved to be a good 30% of total amount. Since most of them were variations of the "oneof [proper Foo, proper2 Bar, proper4 Baz]" theme, I started looking for a way so simplify/automate generation of such instances.<br />
<br />
== Deriving Variant instances automagically ==<br />
<br />
I took a a post made by Bulat Ziganshin on TemplateHaskell mailing list to show how to derive instances of 'Show' automatically, and hacked it to be able to derive instances of "Variant" in much the same way:<br />
<br />
import Language.Haskell.TH<br />
import Language.Haskell.TH.Syntax<br />
<br />
data T3 = T3 String<br />
<br />
deriveVariant t = do<br />
-- Get list of constructors for type t<br />
TyConI (DataD _ _ _ constructors _) <- reify t<br />
<br />
-- Make `valid` or `invalid` clause for one constructor:<br />
-- for "(A x1 x2)" makes "Variant.proper2 A"<br />
let mkClause f (NormalC name fields) = <br />
appE (varE (mkName ("Variant."++f++show(length fields)))) (conE name)<br />
<br />
-- Make body for functions `valid` and `invalid`:<br />
-- valid = oneof [ proper2 A | proper1 C]<br />
-- or<br />
-- valid = proper3 B, depending on the number of constructors<br />
validBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "proper" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "proper") cs)) |]<br />
invalidBody <- case constructors of<br />
[c] -> normalB [| $(mkClause "bad" c) |]<br />
cs -> normalB [| oneof $(listE (map (mkClause "bad") cs)) |]<br />
<br />
-- Generate template instance declaration and replace type name (T1)<br />
-- and function body (x = "text") with our data<br />
d <- [d| instance Variant T3 where<br />
valid = liftM T3 valid<br />
invalid = liftM T3 invalid<br />
|]<br />
let [InstanceD [] (AppT showt (ConT _T3)) [ ValD validf _valid [], ValD invalidf _invalid [] ]] = d<br />
return [InstanceD [] (AppT showt (ConT t )) [ ValD validf validBody [], ValD invalidf invalidBody [] ]]<br />
<br />
-- Usage:<br />
$(deriveVariant ''Record)<br />
<br />
[[Adept]]</div>Adepthttps://wiki.haskell.org/index.php?title=User:Adept&diff=2204User:Adept2006-02-06T19:05:25Z<p>Adept: Moving from hawiki</p>
<hr />
<div>== Personal trivia ==<br />
I am known as adept (or ADEpt) at #haskell<br />
<br />
You can reach me via dastapov-at-gmail-dot-com, UIN 18-22-53-38 or JID adept-at-jabber-dot-kiev-dot-ua<br />
<br />
== Texts and articles ==<br />
[[QuickCheck as Test Set Generator]]<br />
<br />
[[Hitchhikers Guide to the Haskell]]</div>Adept