Difference between revisions of "FreeArc/Universal Archive Format"

From HaskellWiki
Jump to navigation Jump to search
(finished "Archive block structure")
(First version finished)
Line 1: Line 1:
It's description of FreeArc archive format and ideas how it can be further improved.
+
It's description of FreeArc archive format and ideas how it can be further improved. ArcStructure.h from FreeArc sources may be used to find more [precise] info about FreeArc archive format.
   
== Archive block structure ==
+
== Archive blocks ==
   
 
Archive consists of blocks which divide into DATA BLOCKS (one datablock contains compressed data of one solid block) and CONTROL BLOCKS which stores archive meta-info (directories, comments, compression methods, recovery records...). Every block may be described by following info:
 
Archive consists of blocks which divide into DATA BLOCKS (one datablock contains compressed data of one solid block) and CONTROL BLOCKS which stores archive meta-info (directories, comments, compression methods, recovery records...). Every block may be described by following info:
Line 12: Line 12:
 
* CRC32 of original data - used to check block consistency
 
* CRC32 of original data - used to check block consistency
   
Usually archive consists of blocks in the following order:
+
Archive consists of blocks in the following order:
   
 
* HEADER BLOCK: contains archive signature ("ArC\1") and archiver version. The only block whose contents are never compressed. So first 4 bytes of FreeArc archives are always "ArC\1" that may be used to filetype detection
 
* HEADER BLOCK: contains archive signature ("ArC\1") and archiver version. The only block whose contents are never compressed. So first 4 bytes of FreeArc archives are always "ArC\1" that may be used to filetype detection
Line 33: Line 33:
 
# Decompress and parse directory blocks and get info about files contained in archive and data blocks containing compressed data
 
# Decompress and parse directory blocks and get info about files contained in archive and data blocks containing compressed data
 
# Decompressed data blocks writing decompressed data into files
 
# Decompressed data blocks writing decompressed data into files
  +
  +
  +
== Footer block ==
  +
  +
Footer block includes the following info:
  +
* Descriptors of all previous control blocks in archive (block positions are written relative to position of footer block itself)
  +
* Archive comment
  +
* Archive recover settings
  +
  +
== Directory block ==
  +
  +
Directory block includes the following info:
  +
* Number of directories and their names
  +
* Number of files and info about them: first, directory number for every file. then, rest of filename for every file. then, size of every file. then, Unix datetime stamp for every file. then, CRC32 for every file. and last, IsDirectory? flag for every file.
  +
* Number of solid blocks, number of files in every solid block, compression algorithm for every solid block, descriptor for every solid block (aka DATA BLOCK)
  +
  +
== Data formats ==
  +
  +
Strings are stored in UTF8Z format. Numbers in variable-size format inspired by 7-zip. Fixed-size numbers (CRC32, datetime) are stored using fixed number of bytes.

Revision as of 15:07, 8 July 2008

It's description of FreeArc archive format and ideas how it can be further improved. ArcStructure.h from FreeArc sources may be used to find more [precise] info about FreeArc archive format.

Archive blocks

Archive consists of blocks which divide into DATA BLOCKS (one datablock contains compressed data of one solid block) and CONTROL BLOCKS which stores archive meta-info (directories, comments, compression methods, recovery records...). Every block may be described by following info:

  • block type (0 - data block, 1.. - various control blocks)
  • its position in archive (number of first byte)
  • original size
  • compressed size
  • compression algorithm used to compress this block (usually all blocks are compressed, data blocks compression controlled by -m option, control blocks compression by -dm option)
  • CRC32 of original data - used to check block consistency

Archive consists of blocks in the following order:

  • HEADER BLOCK: contains archive signature ("ArC\1") and archiver version. The only block whose contents are never compressed. So first 4 bytes of FreeArc archives are always "ArC\1" that may be used to filetype detection
  • One or more DATA BLOCKS, containing compressed data for one or more solid blocks
  • DIRECTORY BLOCK containing info about files compressed in previous data blocks (filename, size, datetime, attributes plus its CRC32 for consistency checking) plus descriptors of data blocks plus info about which files are stored in which data blocks
  • Then one or more DATA BLOCKS followed by corresponding DIRECTORY BLOCK may go again, and again. Unlike other archivers, FreeArc arcghive directory may be split into many parts each containing info only about part of archive - this simplifies archive recovery. Directory splitting is controlled by -s option
  • FOOTER BLOCK: contains info about all other control blocks plus archive-wide information - archive comment, amount of recovery info and so on
  • optionally, TWO RECOVERY RECORD BLOCKS, followed by second copy of FOOTER BLOCK - these are added when -rr option is enabled

In order to allow broken archives recovery, after every CONTROL BLOCK its decriptor is written with the following info:

  1. FreeArc signature ("ArC\1")
  2. above-mentioned block description info

In the descriptor, contol block position is written relative to block descriptor itself, so even splitting archive into several chunks doesn't prohibit archive recovery. Usually, we get info about all control blocks from FOOTER BLOCK but if it's broken, we scan archive for FreeArc signatures, parsing following descriptors and checking that it's indeed a control block by checking CRC32 of decompressed data.

So, archive decompression goes in the following way:

  1. Read last 4096 bytes of archive and find last occurence of archive signature in these data
  2. Read block descriptor after signature found and ensure that it's a footer block
  3. Decompress and parse footer block and get info about directory blocks
  4. Decompress and parse directory blocks and get info about files contained in archive and data blocks containing compressed data
  5. Decompressed data blocks writing decompressed data into files


Footer block

Footer block includes the following info:

  • Descriptors of all previous control blocks in archive (block positions are written relative to position of footer block itself)
  • Archive comment
  • Archive recover settings

Directory block

Directory block includes the following info:

  • Number of directories and their names
  • Number of files and info about them: first, directory number for every file. then, rest of filename for every file. then, size of every file. then, Unix datetime stamp for every file. then, CRC32 for every file. and last, IsDirectory? flag for every file.
  • Number of solid blocks, number of files in every solid block, compression algorithm for every solid block, descriptor for every solid block (aka DATA BLOCK)

Data formats

Strings are stored in UTF8Z format. Numbers in variable-size format inspired by 7-zip. Fixed-size numbers (CRC32, datetime) are stored using fixed number of bytes.