Update format docs

2025-09-09 12:28:13 -04:00 · 2021-03-23 12:46:31 +01:00 · 2021-03-23 12:46:31 +01:00 · df5de1f486
commit df5de1f486
parent 88d684379e
1 changed files with 97 additions and 13 deletions
--- a/doc/dwarfs-format.md
+++ b/doc/dwarfs-format.md
@ -81,6 +81,8 @@ a collection of bit-packed arrays and structures. The exact layout of
 each list and structure depends on the actual data and is stored
 separately in `METADATA_V2_SCHEMA`.

+## Metadata Format
+
 Here is a high-level overview of how all the bits and pieces relate
 to each other:

@ -152,7 +154,10 @@ to each other:

 Thanks to the bit-packing, fields that are unused or only contain a
 single (zero) value, e.g. a `group_index` that's always zero because
-all files belong to the same group, do not occupy any space.
+all files belong to the same group, do not occupy any space in the
+metadata block.
+
+### Determining Inode Offsets

 Before you can start traversing the metadata, you need to determine
 the offsets for symlinks, regular files, devices etc. in the `inodes`
@ -172,29 +177,31 @@ The `inodes` list is strictly in the following order:

 * socket/pipe inodes (`S_IFSOCK`, `S_IFIFO`)

-The offsets can thus be found using a simple binary search.
+The offsets can thus be found by using a binary search with a
+predicate on the inode more. The shared file offset can be found
+by subtracting the length of `shared_files_table` from the total
+number of regular files.
+
+### Unique and Shared File Inodes

 The difference between *unique* and *shared* file inodes is that
 there is only one *unique* file inode that references a particular
 index in the `chunk_table`, whereas there are multiple *shared*
 file inodes that will reference the same index. This is how DwarFS
 implements file-level de-duplication beyond hardlinks. Hardlinks
-share the same inode. Duplicate files that are not hardlinked all
+share the same inode. Duplicate files that are not hardlinked each
 have a unique inode, but still reference the same content through
 the `chunk_table`.

 The `shared_files_table` provides the necessary indirection that
-maps a *shared* file inode to a `chunk_table` index. However, the
-`shared_files_table` is stored in a packed format that only encodes
-the number of shared links to a `chunk_table` index, so it must be
-unpacked first.
+maps a *shared* file inode to a `chunk_table` index.

-Once the offsets have been determined and the `shared_files_table`
-is unpacked, you can start traversing the metadata. Typically, you
-would start a the root directory which is at `dir_entries[0]`,
+### Traversing the Metadata
+
+You typically start at the root directory which is at `dir_entries[0]`,
 `inodes[0]` and `directories[0]`. Note that the root directory
 implicitly has no name, so that `dir_entries[0].name_index`
-shouldn't be used.
+should not be used.

 To determine the contents of a directory, we determine the range
 of entries from `directories[inode_num].first_entry` to
@ -219,12 +226,12 @@ after adjusting the index:

    chunk_index = inode_num - file_inode_offset

-For *shared* regular file inodes, you can index into the unpacked
+For *shared* regular file inodes, you can index into the (unpacked)
 `shared_files_table`:

    shared_index = shared_files[inode_num - file_inode_offset - num_unique_files]

-The, you can index into `chunk_table`, but you need to adjust the
+Then, you can index into `chunk_table`, but you need to adjust the
 index once more:

    chunk_index = shared_index + num_unique_files
@ -244,3 +251,80 @@ Last but not least, to read the device id for a device inode, you
 can index into `devices`:

    device_id = devices[inode_num - device_inode_offset]
+
+### Optionally Packed Structures
+
+The overview above assumes metadata without any additional packing,
+which can be produced using:
+
+    mkdwarfs --pack-metadata=none --plain-string-tables
+
+However, this isn't the default, and parts of the metadata are
+likely stored in a packed format. These are mostly easy to unpack.
+
+#### Shared Files Table Packing
+
+The `shared_files_table` can be stored in a packed format that
+only encodes the number of shared links to a `chunk_table` index.
+As the minimum number of links is always 2 (otherwise it wouldn't
+be shared), the numbers in the packed format are additionally
+offset by 2. So for example, a packed table like
+
+    [0, 3, 1, 0, 1]
+
+would unpack to:
+
+    [0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4]
+
+The packed format is used when `options.packed_shared_files_table`
+is true.
+
+#### Directories Packing
+
+The `directories` table, when stored in packed format, omits
+all `parent_entry` fields and uses delta compression for the
+`first_entry` fields. 
+
+In order to unpack all information, you first have to delta-
+decompress the `first_entry` fields, then traverse the whole
+directory tree once to fill in the `parent_entry` fields.
+This sounds like a lot of work, but it's actually reasonably
+fast. For example, for a file system with 15 million entries
+in 90,000 directories, reconstructing the `directories` takes
+only about 50 milliseconds.
+
+The packed format is used when `options.packed_directories`
+is true.
+
+#### Chunk Table Packing
+
+The `chunk_table` can also be stored delta-compressed and
+must be unpacked accordingly.
+
+The packed format is used when `options.packed_chunk_table`
+is true.
+
+#### Names and Symlinks String Table Packing
+
+Both the `names` and `symlinks` tables can be stored in a
+packed format in `compact_names` and `compact_symlinks`.
+
+There are two separate packing schemes that can be combined.
+If none of these schemes is active, the difference between
+e.g. `names` and `compact_names` is that the former is stored
+as a "proper" list, whereas the latter is stored as a single
+string plus an index of offsets. As lists of strings store
+both offset and length for each element, this already saves
+the storage for the length fields, which can easily be
+determined from the offsets at run-time.
+
+If the `packed_index` scheme is used in addition, the index
+is stored delta-compressed.
+
+Last but not least, the individual strings can be compressed
+as well. The [fsst library](https://github.com/cwida/fsst)
+allows for compression of short strings with random access
+and is typically able to reduce the overall size of the
+string tables by 50%, using a dictionary that is only a few
+hundred bytes long. If a `symtab` is set for the string table,
+this compression is used.