diff --git a/doc/dwarfs-format.md b/doc/dwarfs-format.md index e5684d58..54bf6e19 100644 --- a/doc/dwarfs-format.md +++ b/doc/dwarfs-format.md @@ -81,6 +81,8 @@ a collection of bit-packed arrays and structures. The exact layout of each list and structure depends on the actual data and is stored separately in `METADATA_V2_SCHEMA`. +## Metadata Format + Here is a high-level overview of how all the bits and pieces relate to each other: @@ -152,7 +154,10 @@ to each other: Thanks to the bit-packing, fields that are unused or only contain a single (zero) value, e.g. a `group_index` that's always zero because -all files belong to the same group, do not occupy any space. +all files belong to the same group, do not occupy any space in the +metadata block. + +### Determining Inode Offsets Before you can start traversing the metadata, you need to determine the offsets for symlinks, regular files, devices etc. in the `inodes` @@ -172,29 +177,31 @@ The `inodes` list is strictly in the following order: * socket/pipe inodes (`S_IFSOCK`, `S_IFIFO`) -The offsets can thus be found using a simple binary search. +The offsets can thus be found by using a binary search with a +predicate on the inode more. The shared file offset can be found +by subtracting the length of `shared_files_table` from the total +number of regular files. + +### Unique and Shared File Inodes The difference between *unique* and *shared* file inodes is that there is only one *unique* file inode that references a particular index in the `chunk_table`, whereas there are multiple *shared* file inodes that will reference the same index. This is how DwarFS implements file-level de-duplication beyond hardlinks. Hardlinks -share the same inode. Duplicate files that are not hardlinked all +share the same inode. Duplicate files that are not hardlinked each have a unique inode, but still reference the same content through the `chunk_table`. The `shared_files_table` provides the necessary indirection that -maps a *shared* file inode to a `chunk_table` index. However, the -`shared_files_table` is stored in a packed format that only encodes -the number of shared links to a `chunk_table` index, so it must be -unpacked first. +maps a *shared* file inode to a `chunk_table` index. -Once the offsets have been determined and the `shared_files_table` -is unpacked, you can start traversing the metadata. Typically, you -would start a the root directory which is at `dir_entries[0]`, +### Traversing the Metadata + +You typically start at the root directory which is at `dir_entries[0]`, `inodes[0]` and `directories[0]`. Note that the root directory implicitly has no name, so that `dir_entries[0].name_index` -shouldn't be used. +should not be used. To determine the contents of a directory, we determine the range of entries from `directories[inode_num].first_entry` to @@ -219,12 +226,12 @@ after adjusting the index: chunk_index = inode_num - file_inode_offset -For *shared* regular file inodes, you can index into the unpacked +For *shared* regular file inodes, you can index into the (unpacked) `shared_files_table`: shared_index = shared_files[inode_num - file_inode_offset - num_unique_files] -The, you can index into `chunk_table`, but you need to adjust the +Then, you can index into `chunk_table`, but you need to adjust the index once more: chunk_index = shared_index + num_unique_files @@ -244,3 +251,80 @@ Last but not least, to read the device id for a device inode, you can index into `devices`: device_id = devices[inode_num - device_inode_offset] + +### Optionally Packed Structures + +The overview above assumes metadata without any additional packing, +which can be produced using: + + mkdwarfs --pack-metadata=none --plain-string-tables + +However, this isn't the default, and parts of the metadata are +likely stored in a packed format. These are mostly easy to unpack. + +#### Shared Files Table Packing + +The `shared_files_table` can be stored in a packed format that +only encodes the number of shared links to a `chunk_table` index. +As the minimum number of links is always 2 (otherwise it wouldn't +be shared), the numbers in the packed format are additionally +offset by 2. So for example, a packed table like + + [0, 3, 1, 0, 1] + +would unpack to: + + [0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4] + +The packed format is used when `options.packed_shared_files_table` +is true. + +#### Directories Packing + +The `directories` table, when stored in packed format, omits +all `parent_entry` fields and uses delta compression for the +`first_entry` fields. + +In order to unpack all information, you first have to delta- +decompress the `first_entry` fields, then traverse the whole +directory tree once to fill in the `parent_entry` fields. +This sounds like a lot of work, but it's actually reasonably +fast. For example, for a file system with 15 million entries +in 90,000 directories, reconstructing the `directories` takes +only about 50 milliseconds. + +The packed format is used when `options.packed_directories` +is true. + +#### Chunk Table Packing + +The `chunk_table` can also be stored delta-compressed and +must be unpacked accordingly. + +The packed format is used when `options.packed_chunk_table` +is true. + +#### Names and Symlinks String Table Packing + +Both the `names` and `symlinks` tables can be stored in a +packed format in `compact_names` and `compact_symlinks`. + +There are two separate packing schemes that can be combined. +If none of these schemes is active, the difference between +e.g. `names` and `compact_names` is that the former is stored +as a "proper" list, whereas the latter is stored as a single +string plus an index of offsets. As lists of strings store +both offset and length for each element, this already saves +the storage for the length fields, which can easily be +determined from the offsets at run-time. + +If the `packed_index` scheme is used in addition, the index +is stored delta-compressed. + +Last but not least, the individual strings can be compressed +as well. The [fsst library](https://github.com/cwida/fsst) +allows for compression of short strings with random access +and is typically able to reduce the overall size of the +string tables by 50%, using a dictionary that is only a few +hundred bytes long. If a `symtab` is set for the string table, +this compression is used.