[Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Mon Jul 9 16:15:07 EDT 2007

Bulat Ziganshin wrote:
> Hello Andrew,
>
> Sunday, July 8, 2007, 7:07:59 PM, you wrote:
>
> i don't think that ppm is so complex - it's just probability of
> symbol in some context. it's just too slow in naive implementation
>
>   

Oh, sure, the *idea* is simple enough. Trying to actually *implement* it 
correctly is something else... ;-)

(Same statemenst go for arithmetic coding, really.)

>> (The downside of course is that now we need a Huffman table in the 
>> output - and any rare symbols might end up with rather long codewords.
>> But remember: Huffman *guarantees* 0% compression or higher on the 
>> payload itself. A Huffman-compressed payload can *never* be bigger, only
>> smaller or same size. So long as you can encode the Huffman table 
>> efficiently, you should be fine...)
>>     
>
> the devil in details. just imagine size of huffman table with 64k
> entries :)  huffman encoding is inappropriate for lzw output simply
> because most words will have only a few occurrences and economy on
> their optimal encoding doesn't justify price of their entries in the table
>   

...which is why you need to "encode the Huffman table efficiently", to 
quote myself. ;-)

Using canonical Huffman, you only actually need to know how many bits 
were assigned to each symbol. This information is probably very 
ameanable to RLE. (Which, incidentally, is why I started this whole 
"parser on top of a phaser" crazyness in the first place.) So, yeah, 
there may be 64k symbols - but if only 1k of them are ever *used*... ;-)

>> .ru = Russia?
>>     
>
> of course
>   

My Russian is very rusty. ;-)

>> Oh hey, I think GHC is already pretty smart. But no optimiser can ever
>> hope to cover *every* possible case. And transforming [Bool] -> [Bool]
>> into UArray Word8 ->> UArray Word8 just seems a liiiiittle bit
>> optimistic, methinks. ;-)
>>     
>
> 15 years ago i've written very smart asm program (btw, it was ARJ
> unpacker) with handmade function inlining, loop unrolling, register
> allocation, cpu recognition and so on. now, most of these tricks are
> standard for C compilers. times changes and now it's hard to imagine which
> optimizations will be available 10 years later
>   

Yes, but there are limits to what an optimiser can hope to accomplish.

For example, you wouldn't implement a bubble sort and seriously expect 
the compiler to be able to "optimise" that into a merge sort, would you? ;-)

> ghc's native and via-C modes are blind vs lame. in native mode, its
> codegenerator is comparable with 20 years-old C codegenerators. see
> above how much modern C compilers changed in these years. in via-C
> mode it generates unnatural C code which is hard to optimize for any C
> compiler.

I'll take your word for it. ;-)

(I have made cursory attempts to comprehend the inner workings of GHC - 
but this is apparently drastically beyond my powers of comprehension.)

> the jhc is very different story

Yes - last I heard, it's an experimental research project rather than a 
production-ready compiler...