Difference between revisions of "FreeArc/Universal Archive Format"

From HaskellWiki
Jump to navigation Jump to search
(First version finished)
m
 
Line 3: Line 3:
 
== Archive blocks ==
 
== Archive blocks ==
   
Archive consists of blocks which divide into DATA BLOCKS (one datablock contains compressed data of one solid block) and CONTROL BLOCKS which stores archive meta-info (directories, comments, compression methods, recovery records...). Every block may be described by following info:
+
Archive consists of blocks which divide into DATA BLOCKS (one datablock contains compressed data of one solid block) and CONTROL BLOCKS which stores archive meta-info (directories, comments, compression methods, recovery records...). Each block contains the following info:
   
 
* block type (0 - data block, 1.. - various control blocks)
 
* block type (0 - data block, 1.. - various control blocks)
* its position in archive (number of first byte)
+
* its position in archive (offset of the first byte)
 
* original size
 
* original size
 
* compressed size
 
* compressed size
Line 17: Line 17:
 
* One or more DATA BLOCKS, containing compressed data for one or more solid blocks
 
* One or more DATA BLOCKS, containing compressed data for one or more solid blocks
 
* DIRECTORY BLOCK containing info about files compressed in previous data blocks (filename, size, datetime, attributes plus its CRC32 for consistency checking) plus descriptors of data blocks plus info about which files are stored in which data blocks
 
* DIRECTORY BLOCK containing info about files compressed in previous data blocks (filename, size, datetime, attributes plus its CRC32 for consistency checking) plus descriptors of data blocks plus info about which files are stored in which data blocks
* Then one or more DATA BLOCKS followed by corresponding DIRECTORY BLOCK may go again, and again. Unlike other archivers, FreeArc arcghive directory may be split into many parts each containing info only about part of archive - this simplifies archive recovery. Directory splitting is controlled by -s option
+
* Then one or more DATA BLOCKS followed by corresponding DIRECTORY BLOCK may go again, and again. Unlike other archivers, FreeArc archive directory may be split into many parts each containing info only about part of archive - this simplifies archive recovery. Directory splitting is controlled by -s option
 
* FOOTER BLOCK: contains info about all other control blocks plus archive-wide information - archive comment, amount of recovery info and so on
 
* FOOTER BLOCK: contains info about all other control blocks plus archive-wide information - archive comment, amount of recovery info and so on
 
* optionally, TWO RECOVERY RECORD BLOCKS, followed by second copy of FOOTER BLOCK - these are added when -rr option is enabled
 
* optionally, TWO RECOVERY RECORD BLOCKS, followed by second copy of FOOTER BLOCK - these are added when -rr option is enabled
Line 25: Line 25:
 
# above-mentioned block description info
 
# above-mentioned block description info
   
In the descriptor, contol block position is written relative to block descriptor itself, so even splitting archive into several chunks doesn't prohibit archive recovery. Usually, we get info about all control blocks from FOOTER BLOCK but if it's broken, we scan archive for FreeArc signatures, parsing following descriptors and checking that it's indeed a control block by checking CRC32 of decompressed data.
+
In the descriptor, control block position is written relatively to block descriptor itself, so even splitting archive into several chunks doesn't prohibit archive recovery. Usually, we get info about all control blocks from FOOTER BLOCK but if it's broken, we scan archive for FreeArc signatures, parsing following descriptors and checking that it's indeed a control block by checking CRC32 of decompressed data.
   
 
So, archive decompression goes in the following way:
 
So, archive decompression goes in the following way:
# Read last 4096 bytes of archive and find last occurence of archive signature in these data
+
# Read last 4096 bytes of archive and find last occurrence of archive signature in these data
 
# Read block descriptor after signature found and ensure that it's a footer block
 
# Read block descriptor after signature found and ensure that it's a footer block
 
# Decompress and parse footer block and get info about directory blocks
 
# Decompress and parse footer block and get info about directory blocks

Latest revision as of 20:56, 2 September 2008

It's description of FreeArc archive format and ideas how it can be further improved. ArcStructure.h from FreeArc sources may be used to find more [precise] info about FreeArc archive format.

Archive blocks

Archive consists of blocks which divide into DATA BLOCKS (one datablock contains compressed data of one solid block) and CONTROL BLOCKS which stores archive meta-info (directories, comments, compression methods, recovery records...). Each block contains the following info:

  • block type (0 - data block, 1.. - various control blocks)
  • its position in archive (offset of the first byte)
  • original size
  • compressed size
  • compression algorithm used to compress this block (usually all blocks are compressed, data blocks compression controlled by -m option, control blocks compression by -dm option)
  • CRC32 of original data - used to check block consistency

Archive consists of blocks in the following order:

  • HEADER BLOCK: contains archive signature ("ArC\1") and archiver version. The only block whose contents are never compressed. So first 4 bytes of FreeArc archives are always "ArC\1" that may be used to filetype detection
  • One or more DATA BLOCKS, containing compressed data for one or more solid blocks
  • DIRECTORY BLOCK containing info about files compressed in previous data blocks (filename, size, datetime, attributes plus its CRC32 for consistency checking) plus descriptors of data blocks plus info about which files are stored in which data blocks
  • Then one or more DATA BLOCKS followed by corresponding DIRECTORY BLOCK may go again, and again. Unlike other archivers, FreeArc archive directory may be split into many parts each containing info only about part of archive - this simplifies archive recovery. Directory splitting is controlled by -s option
  • FOOTER BLOCK: contains info about all other control blocks plus archive-wide information - archive comment, amount of recovery info and so on
  • optionally, TWO RECOVERY RECORD BLOCKS, followed by second copy of FOOTER BLOCK - these are added when -rr option is enabled

In order to allow broken archives recovery, after every CONTROL BLOCK its decriptor is written with the following info:

  1. FreeArc signature ("ArC\1")
  2. above-mentioned block description info

In the descriptor, control block position is written relatively to block descriptor itself, so even splitting archive into several chunks doesn't prohibit archive recovery. Usually, we get info about all control blocks from FOOTER BLOCK but if it's broken, we scan archive for FreeArc signatures, parsing following descriptors and checking that it's indeed a control block by checking CRC32 of decompressed data.

So, archive decompression goes in the following way:

  1. Read last 4096 bytes of archive and find last occurrence of archive signature in these data
  2. Read block descriptor after signature found and ensure that it's a footer block
  3. Decompress and parse footer block and get info about directory blocks
  4. Decompress and parse directory blocks and get info about files contained in archive and data blocks containing compressed data
  5. Decompressed data blocks writing decompressed data into files


Footer block

Footer block includes the following info:

  • Descriptors of all previous control blocks in archive (block positions are written relative to position of footer block itself)
  • Archive comment
  • Archive recover settings

Directory block

Directory block includes the following info:

  • Number of directories and their names
  • Number of files and info about them: first, directory number for every file. then, rest of filename for every file. then, size of every file. then, Unix datetime stamp for every file. then, CRC32 for every file. and last, IsDirectory? flag for every file.
  • Number of solid blocks, number of files in every solid block, compression algorithm for every solid block, descriptor for every solid block (aka DATA BLOCK)

Data formats

Strings are stored in UTF8Z format. Numbers in variable-size format inspired by 7-zip. Fixed-size numbers (CRC32, datetime) are stored using fixed number of bytes.