Difference between revisions of "FreeArc/Universal Archive Format"

From HaskellWiki
Jump to navigation Jump to search
(finished "Archive block structure")
Line 14: Line 14:
 
Usually archive consists of blocks in the following order:
 
Usually archive consists of blocks in the following order:
   
* HEADER BLOCK: contains archive signature ("ArC\1") and archiver version. The only block whose contents are never compressed. So first 4 bytes of FreeArc archives are always "ArC\1"
+
* HEADER BLOCK: contains archive signature ("ArC\1") and archiver version. The only block whose contents are never compressed. So first 4 bytes of FreeArc archives are always "ArC\1" that may be used to filetype detection
 
* One or more DATA BLOCKS, containing compressed data for one or more solid blocks
 
* One or more DATA BLOCKS, containing compressed data for one or more solid blocks
 
* DIRECTORY BLOCK containing info about files compressed in previous data blocks (filename, size, datetime, attributes plus its CRC32 for consistency checking) plus descriptors of data blocks plus info about which files are stored in which data blocks
 
* DIRECTORY BLOCK containing info about files compressed in previous data blocks (filename, size, datetime, attributes plus its CRC32 for consistency checking) plus descriptors of data blocks plus info about which files are stored in which data blocks
Line 20: Line 20:
 
* FOOTER BLOCK: contains info about all other control blocks plus archive-wide information - archive comment, amount of recovery info and so on
 
* FOOTER BLOCK: contains info about all other control blocks plus archive-wide information - archive comment, amount of recovery info and so on
 
* optionally, TWO RECOVERY RECORD BLOCKS, followed by second copy of FOOTER BLOCK - these are added when -rr option is enabled
 
* optionally, TWO RECOVERY RECORD BLOCKS, followed by second copy of FOOTER BLOCK - these are added when -rr option is enabled
  +
  +
In order to allow broken archives recovery, after every CONTROL BLOCK its decriptor is written with the following info:
  +
# FreeArc signature ("ArC\1")
  +
# above-mentioned block description info
  +
  +
In the descriptor, contol block position is written relative to block descriptor itself, so even splitting archive into several chunks doesn't prohibit archive recovery. Usually, we get info about all control blocks from FOOTER BLOCK but if it's broken, we scan archive for FreeArc signatures, parsing following descriptors and checking that it's indeed a control block by checking CRC32 of decompressed data.
   
 
So, archive decompression goes in the following way:
 
So, archive decompression goes in the following way:
1) Read last 4096 bytes of archive and find last occurence of archive signature in these data
+
# Read last 4096 bytes of archive and find last occurence of archive signature in these data
2) Read block descriptor after signature found and ensure that it's a footer block
+
# Read block descriptor after signature found and ensure that it's a footer block
3) Decompress and parse footer block and get info about directory blocks
+
# Decompress and parse footer block and get info about directory blocks
4) Decompress and parse directory blocks and get info about files contained in archive and data blocks containing compressed data
+
# Decompress and parse directory blocks and get info about files contained in archive and data blocks containing compressed data
5) Decompressed data blocks writing decompressed data into files
+
# Decompressed data blocks writing decompressed data into files

Revision as of 14:41, 8 July 2008

It's description of FreeArc archive format and ideas how it can be further improved.

Archive block structure

Archive consists of blocks which divide into DATA BLOCKS (one datablock contains compressed data of one solid block) and CONTROL BLOCKS which stores archive meta-info (directories, comments, compression methods, recovery records...). Every block may be described by following info:

  • block type (0 - data block, 1.. - various control blocks)
  • its position in archive (number of first byte)
  • original size
  • compressed size
  • compression algorithm used to compress this block (usually all blocks are compressed, data blocks compression controlled by -m option, control blocks compression by -dm option)
  • CRC32 of original data - used to check block consistency

Usually archive consists of blocks in the following order:

  • HEADER BLOCK: contains archive signature ("ArC\1") and archiver version. The only block whose contents are never compressed. So first 4 bytes of FreeArc archives are always "ArC\1" that may be used to filetype detection
  • One or more DATA BLOCKS, containing compressed data for one or more solid blocks
  • DIRECTORY BLOCK containing info about files compressed in previous data blocks (filename, size, datetime, attributes plus its CRC32 for consistency checking) plus descriptors of data blocks plus info about which files are stored in which data blocks
  • Then one or more DATA BLOCKS followed by corresponding DIRECTORY BLOCK may go again, and again. Unlike other archivers, FreeArc arcghive directory may be split into many parts each containing info only about part of archive - this simplifies archive recovery. Directory splitting is controlled by -s option
  • FOOTER BLOCK: contains info about all other control blocks plus archive-wide information - archive comment, amount of recovery info and so on
  • optionally, TWO RECOVERY RECORD BLOCKS, followed by second copy of FOOTER BLOCK - these are added when -rr option is enabled

In order to allow broken archives recovery, after every CONTROL BLOCK its decriptor is written with the following info:

  1. FreeArc signature ("ArC\1")
  2. above-mentioned block description info

In the descriptor, contol block position is written relative to block descriptor itself, so even splitting archive into several chunks doesn't prohibit archive recovery. Usually, we get info about all control blocks from FOOTER BLOCK but if it's broken, we scan archive for FreeArc signatures, parsing following descriptors and checking that it's indeed a control block by checking CRC32 of decompressed data.

So, archive decompression goes in the following way:

  1. Read last 4096 bytes of archive and find last occurence of archive signature in these data
  2. Read block descriptor after signature found and ensure that it's a footer block
  3. Decompress and parse footer block and get info about directory blocks
  4. Decompress and parse directory blocks and get info about files contained in archive and data blocks containing compressed data
  5. Decompressed data blocks writing decompressed data into files