Update format docs

This commit is contained in:
Marcus Holland-Moritz 2021-03-23 12:46:31 +01:00
parent 88d684379e
commit df5de1f486

View File

@ -81,6 +81,8 @@ a collection of bit-packed arrays and structures. The exact layout of
each list and structure depends on the actual data and is stored
separately in `METADATA_V2_SCHEMA`.
## Metadata Format
Here is a high-level overview of how all the bits and pieces relate
to each other:
@ -152,7 +154,10 @@ to each other:
Thanks to the bit-packing, fields that are unused or only contain a
single (zero) value, e.g. a `group_index` that's always zero because
all files belong to the same group, do not occupy any space.
all files belong to the same group, do not occupy any space in the
metadata block.
### Determining Inode Offsets
Before you can start traversing the metadata, you need to determine
the offsets for symlinks, regular files, devices etc. in the `inodes`
@ -172,29 +177,31 @@ The `inodes` list is strictly in the following order:
* socket/pipe inodes (`S_IFSOCK`, `S_IFIFO`)
The offsets can thus be found using a simple binary search.
The offsets can thus be found by using a binary search with a
predicate on the inode more. The shared file offset can be found
by subtracting the length of `shared_files_table` from the total
number of regular files.
### Unique and Shared File Inodes
The difference between *unique* and *shared* file inodes is that
there is only one *unique* file inode that references a particular
index in the `chunk_table`, whereas there are multiple *shared*
file inodes that will reference the same index. This is how DwarFS
implements file-level de-duplication beyond hardlinks. Hardlinks
share the same inode. Duplicate files that are not hardlinked all
share the same inode. Duplicate files that are not hardlinked each
have a unique inode, but still reference the same content through
the `chunk_table`.
The `shared_files_table` provides the necessary indirection that
maps a *shared* file inode to a `chunk_table` index. However, the
`shared_files_table` is stored in a packed format that only encodes
the number of shared links to a `chunk_table` index, so it must be
unpacked first.
maps a *shared* file inode to a `chunk_table` index.
Once the offsets have been determined and the `shared_files_table`
is unpacked, you can start traversing the metadata. Typically, you
would start a the root directory which is at `dir_entries[0]`,
### Traversing the Metadata
You typically start at the root directory which is at `dir_entries[0]`,
`inodes[0]` and `directories[0]`. Note that the root directory
implicitly has no name, so that `dir_entries[0].name_index`
shouldn't be used.
should not be used.
To determine the contents of a directory, we determine the range
of entries from `directories[inode_num].first_entry` to
@ -219,12 +226,12 @@ after adjusting the index:
chunk_index = inode_num - file_inode_offset
For *shared* regular file inodes, you can index into the unpacked
For *shared* regular file inodes, you can index into the (unpacked)
`shared_files_table`:
shared_index = shared_files[inode_num - file_inode_offset - num_unique_files]
The, you can index into `chunk_table`, but you need to adjust the
Then, you can index into `chunk_table`, but you need to adjust the
index once more:
chunk_index = shared_index + num_unique_files
@ -244,3 +251,80 @@ Last but not least, to read the device id for a device inode, you
can index into `devices`:
device_id = devices[inode_num - device_inode_offset]
### Optionally Packed Structures
The overview above assumes metadata without any additional packing,
which can be produced using:
mkdwarfs --pack-metadata=none --plain-string-tables
However, this isn't the default, and parts of the metadata are
likely stored in a packed format. These are mostly easy to unpack.
#### Shared Files Table Packing
The `shared_files_table` can be stored in a packed format that
only encodes the number of shared links to a `chunk_table` index.
As the minimum number of links is always 2 (otherwise it wouldn't
be shared), the numbers in the packed format are additionally
offset by 2. So for example, a packed table like
[0, 3, 1, 0, 1]
would unpack to:
[0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4]
The packed format is used when `options.packed_shared_files_table`
is true.
#### Directories Packing
The `directories` table, when stored in packed format, omits
all `parent_entry` fields and uses delta compression for the
`first_entry` fields.
In order to unpack all information, you first have to delta-
decompress the `first_entry` fields, then traverse the whole
directory tree once to fill in the `parent_entry` fields.
This sounds like a lot of work, but it's actually reasonably
fast. For example, for a file system with 15 million entries
in 90,000 directories, reconstructing the `directories` takes
only about 50 milliseconds.
The packed format is used when `options.packed_directories`
is true.
#### Chunk Table Packing
The `chunk_table` can also be stored delta-compressed and
must be unpacked accordingly.
The packed format is used when `options.packed_chunk_table`
is true.
#### Names and Symlinks String Table Packing
Both the `names` and `symlinks` tables can be stored in a
packed format in `compact_names` and `compact_symlinks`.
There are two separate packing schemes that can be combined.
If none of these schemes is active, the difference between
e.g. `names` and `compact_names` is that the former is stored
as a "proper" list, whereas the latter is stored as a single
string plus an index of offsets. As lists of strings store
both offset and length for each element, this already saves
the storage for the length fields, which can easily be
determined from the offsets at run-time.
If the `packed_index` scheme is used in addition, the index
is stored delta-compressed.
Last but not least, the individual strings can be compressed
as well. The [fsst library](https://github.com/cwida/fsst)
allows for compression of short strings with random access
and is typically able to reduce the overall size of the
string tables by 50%, using a dictionary that is only a few
hundred bytes long. If a `symtab` is set for the string table,
this compression is used.