mirror of
https://github.com/mhx/dwarfs.git
synced 2025-08-04 02:06:22 -04:00
376 lines
22 KiB
Markdown
376 lines
22 KiB
Markdown
# dwarfs-format(5) -- DwarFS File System Format v2.5
|
|
|
|
## DESCRIPTION
|
|
|
|
This document describes the DwarFS file system format, version 2.5.
|
|
|
|
## FILE STRUCTURE
|
|
|
|
A DwarFS file system image is just a sequence of blocks, optionally
|
|
prefixed by a "header", which is typically some sort of shell script.
|
|
Each block has the following format:
|
|
|
|
┌───┬───┬───┬───┬───┬───┬───┬───┐
|
|
0x00 │'D'│'W'│'A'│'R'│'F'│'S'│MAJ│MIN│ MAJ=0x02, MIN=0x05 for v2.5
|
|
├───┴───┴───┴───┴───┴───┴───┴───┤
|
|
0x08 │ │ Used for full (slow) integrity
|
|
├─ SHA-512/256 integrity hash ─┤ check with `dwarfsck`.
|
|
0x10 │ over the remainder of the │
|
|
├─ block data, starting at ─┤
|
|
0x18 │ offset 0x28. │
|
|
├─ ─┤
|
|
0x20 │ │
|
|
├───────────────────────────────┤
|
|
0x28 │ XXH3-64 hash over remainder │ Used for fast integrity check.
|
|
├───────────────┬───────┬───────┤
|
|
0x30 │Section Number │SecType│CompAlg│ All integer fields are in LE
|
|
├───────────────┴───────┴───────┤ byte order.
|
|
0x38 │ Length of remaining data │
|
|
├───────────────────────────────┤
|
|
0x40 │ │
|
|
│ Section data compressed using │
|
|
│ CompAlg algorithm. │
|
|
│ │
|
|
│ │
|
|
│ │
|
|
└───────────────────────────────┘
|
|
|
|
A couple of notes:
|
|
|
|
- No padding is added between blocks.
|
|
|
|
- The list of blocks can easily be traversed by using the length field
|
|
to skip to the start of the next section.
|
|
|
|
- Corruption can easily be detected using the XXH3-64 hash. Computation
|
|
of this hash is so fast that it is in fact checked every single time a
|
|
file system block is loaded.
|
|
|
|
- Integrity can furthermore be checked using the SHA-512/256 hash. This
|
|
is much slower, but should rarely be needed.
|
|
|
|
- All header fields, except for the magic and version number, are
|
|
protected by the hashes.
|
|
|
|
- In case of corruption, sections can easily be retrieved by scanning
|
|
for the magic. The version number can be recovered by looking at all
|
|
sections and choosing the majority. The explicit section number helps
|
|
to recover data if multiple sections are missing.
|
|
|
|
- A major version number change will render the format incompatible.
|
|
|
|
- A minor version number change will be backwards compatible, i.e. an
|
|
old program will refuse to read a file system with a minor version
|
|
larger than the one it supports. However, a new program will still
|
|
read all file systems with a smaller minor version number.
|
|
|
|
### Header Detection
|
|
|
|
In order to access the file system data when it is prefixed by a header,
|
|
the size of the header must be known. It can either be given to the
|
|
tools or the FUSE driver explicitly (using e.g. the `--image-offset` or
|
|
`-o offset` options), or it can be determined automatically (by passing
|
|
`auto` as the argument to the aforementioned options).
|
|
|
|
Automatic detection works by scanning the file for the section header
|
|
magic (`DWARFS`) and validating the match by looking up the second
|
|
section header using the length of the first section and also checking
|
|
its magic. It is rather unlikely that a file is created accidentally
|
|
that would pass this check, although one could be crafted manually
|
|
without any problems.
|
|
|
|
### Section Types
|
|
|
|
There are currently 4 different section types.
|
|
|
|
- `BLOCK` (0):
|
|
A block of data. This is where all file data is stored. There can be
|
|
an arbitrary number of blocks of this type.
|
|
|
|
- `METADATA_V2_SCHEMA` (7):
|
|
The schema used to layout the `METADATA_V2` block contents. This is
|
|
stored in "compact" thrift encoding.
|
|
|
|
- `METADATA_V2` (8):
|
|
This section contains the bulk of the metadata. It's essentially just
|
|
a collection of bit-packed arrays and structures. The exact layout of
|
|
each list and structure depends on the actual data and is stored
|
|
separately in `METADATA_V2_SCHEMA`.
|
|
|
|
- `SECTION_INDEX` (9):
|
|
The section index is, well, an index of all sections in the file
|
|
system. If present (creation of the index can be suppressed with
|
|
`--no-section-index`), this is *required* to be the last section.
|
|
Each entry in the section index is a 64-bit value with the upper
|
|
16 bits being the section type and the lower 48 bits being the
|
|
offset relative to the first section. That is, the section index
|
|
is independent of whether or not a header is present before the
|
|
first section. The whole point of the section index is to avoid
|
|
having to build an index by visiting all section headers.
|
|
In order to find the start of the section index, you only have
|
|
to read the last 64-bit value from the file, check if the upper
|
|
16 bits match the `SECTION_INDEX`, then add the image offset
|
|
(header size) to the lower 48 bits. At that position in the
|
|
file, you should find a valid section header for the section
|
|
index.
|
|
|
|
- `HISTORY` (10):
|
|
File system history information as defined `thrift/history.thrift`.
|
|
This is stored in "compact" thrift encoding. Zero or more history
|
|
sections are supported.
|
|
|
|
## METADATA FORMAT
|
|
|
|
Here is a high-level overview of how all the bits and pieces relate
|
|
to each other:
|
|
|
|
═════════════ ┌─────────────────────────────────────────────────────────────────────────┐
|
|
DwarFS v2.5 │ │
|
|
═════════════ │ ┌───────────────────────────────────────────┐ │
|
|
│ │ │ │
|
|
dir_entries[] ▼ │ inodes[] │ directories[] │
|
|
╔════╗ ┌────────────────┐ │ S_IFDIR ──►┌───────────────────┐ │ ┌────────────────┴─┐
|
|
║root╟──►│ name_index: 0 │ │ │ mode_index: 0 ├──────┐ └─►│ parent_entry: 0 │
|
|
╚════╝ │ inode_num: 0 ├───────┴────────────►│ owner_index: 0 │ │ │ first_entry: 1 │
|
|
├────────────────┤ │ group_index: 0 │ │ ├──────────────────┤
|
|
┌───┤ name_index: 2 │ │ atime_offset: 0 │ │ │ parent_entry: 0 │
|
|
┌────┼───┤ inode_num: 5 ├───────┐ │ mtime_offset: 417 │ │ │ first_entry: 11 │
|
|
│ │ ├────────────────┤ │ │ ctime_offset: 0 │ │ ├──────────────────┤
|
|
│ ┌──┼───┤ name_index: 3 │ │ ├───────────────────┤ │ │ parent_entry: 5 │
|
|
│ │ │ │ inode_num: 9 ├────┐ │ │ ... │ │ │ first_entry: 12 │
|
|
│ │ │ ├────────────────┤ │ │ S_IFLNK ──►├───────────────────┤ │ ├──────────────────┤
|
|
│ │ │ │ │ │ │ │ mode_index: 2 │ │ │ │
|
|
│ │ │ │ ... │ │ └────────────►│ owner_index: 2 │ │ │ ... │
|
|
│ │ │ │ │ │ │ group_index: 0 │ │ │ │
|
|
│ │ │ └────────────────┘ │ │ atime_offset: 0 │ │ └──────────────────┘
|
|
│ │ │ │ │ mtime_offset: 298 │ │
|
|
│ │ │ │ │ ctime_offset: 0 │ │
|
|
│ │ │ names[] │ ├───────────────────┤ │ modes[]
|
|
│ │ │ ┌────────────┐ │ │ ... │ │ ┌─────────────┐
|
|
│ │ │ │ "usr" │ │ S_IFREG ──►├───────────────────┤ └────►│ 0040775 │
|
|
│ │ │ ├────────────┤ │ (unique) │ mode_index: 1 │ ├─────────────┤
|
|
│ │ │ │ "share" │ ├───────────────►│ owner_index: 0 ├──────┐ │ 0100644 │
|
|
│ │ │ ├────────────┤ │ │ group_index: 0 │ │ ├─────────────┤
|
|
│ │ └──►│ "words" │ │ │ atime_offset: 0 │ │ │ ... │
|
|
│ │ ├────────────┤ │ │ mtime_offset: 298 │ │ └─────────────┘
|
|
│ └─────►│ "lib" │ │ │ ctime_offset: 0 │ │
|
|
│ ├────────────┤ │ ├───────────────────┤ │ uids[]
|
|
│ │ "ls" │ │ │ ... │ │ ┌─────────────┐
|
|
│ ├────────────┤ │ S_IFREG ──►├───────────────────┤ └────►│ 0 │
|
|
│ │ ... │ │ ┌──(shared) │ mode_index: 4 │ ├─────────────┤
|
|
▼ └────────────┘ │ │ │ owner_index: 2 │ │ 1000 │
|
|
(inode-off) │ │ │ group_index: 1 ├──────┐ ├─────────────┤
|
|
│ │ │ │ atime_offset: 0 │ │ │ ... │
|
|
│ symlink_table[] │ │ │ mtime_offset: 298 │ │ └─────────────┘
|
|
│ ┌────────────┐ │ │ │ ctime_offset: 0 │ │
|
|
│ │ 1 ├───┐ │ │ ├───────────────────┤ │ gids[]
|
|
│ ├────────────┤ │ │ │ │ ... │ │ ┌─────────────┐
|
|
└───────►│ 0 │ │ │ │ S_IFBLK ──►├───────────────────┤ │ │ 0 │
|
|
├────────────┤ │ │ │ S_IFCHR │ │ │ ├─────────────┤
|
|
│ ... │ │ ┌─┼──┼─────────────┤ ... │ └────►│ 100 │
|
|
└────────────┘ │ │ │ │ │ │ ├─────────────┤
|
|
│ │ │ │ S_IFSOCK ──►├───────────────────┤ │ ... │
|
|
│ │ │ │ S_IFIFO │ │ └─────────────┘
|
|
symlinks[] │ │ │ │ │ ... │
|
|
┌────────────┐ │ │ │ │ │ │
|
|
│ "../foo" │ │ │ │ │ └───────────────────┘ chunks[]
|
|
├────────────┤ │ │ │ │ ┌──────────────┐
|
|
│ "foo/bar" │◄──┘ │ │ │ ┌────►│ block: 0 │
|
|
├────────────┤ │ └──┼──────────►(inode-off) │ │ offset: 1698 │
|
|
│ ... │ │ │ │ chunk_table[] │ │ size: 1012 │
|
|
└────────────┘ ▼ ▼ │ ┌─────────────┐ │ ├──────────────┤
|
|
(inode-off) (inode-off) └──────────►│ 0 ├─┘ ┌──►│ block: 0 │
|
|
│ │ ├─────────────┤ │ │ offset: 1604 │
|
|
devices[] │ │ shared_files_table[] │ 1 ├───┘ │ size: 94 │
|
|
┌────────────┐ │ │ ┌───────────┐ ├─────────────┤ ├──────────────┤
|
|
│ 0x0107 │ │ └────►│ 0 ├───┬─────►│ 2 ├───┬──►│ block: 0 │
|
|
├────────────┤ │ ├───────────┤ │ ├─────────────┤ │ │ offset: 0 │
|
|
│ 0x0502 │◄─────┘ │ 0 ├───┘ │ 2 ├───┘ │ size: 1517 │
|
|
├────────────┤ ├───────────┤ ├─────────────┤ ├──────────────┤
|
|
│ ... │ │ ... │ │ ... │ │ ... │
|
|
└────────────┘ └───────────┘ └─────────────┘ └──────────────┘
|
|
|
|
Thanks to the bit-packing, fields that are unused or only contain a
|
|
single (zero) value, e.g. a `group_index` that's always zero because
|
|
all files belong to the same group, does not occupy any space in the
|
|
metadata block.
|
|
|
|
### Determining Inode Offsets
|
|
|
|
Before you can start traversing the metadata, you need to determine
|
|
the offsets for symlinks, regular files, devices etc. in the `inodes`
|
|
list. The index into this list is the `inode_num` from `dir_entries`,
|
|
but you can perform direct lookups based on the inode number as well.
|
|
The `inodes` list is strictly in the following order:
|
|
|
|
- directory inodes (`S_IFDIR`)
|
|
- symlink inodes (`S_IFLNK`)
|
|
- regular *unique* file inodes (`S_IREG`)
|
|
- regular *shared* file inodes (`S_IREG`)
|
|
- character/block device inodes (`S_IFCHR`, `S_IFBLK`)
|
|
- socket/pipe inodes (`S_IFSOCK`, `S_IFIFO`)
|
|
|
|
The offsets can thus be found by using a binary search with a
|
|
predicate on the inode mode. The shared file offset can be found
|
|
by subtracting the length of `shared_files_table` from the total
|
|
number of regular files.
|
|
|
|
### Unique and Shared File Inodes
|
|
|
|
The difference between *unique* and *shared* file inodes is that
|
|
there is only one *unique* file inode that references a particular
|
|
index in the `chunk_table`, whereas there are multiple *shared*
|
|
file inodes that will reference the same index. This is how DwarFS
|
|
implements file-level de-duplication beyond hardlinks. Hardlinks
|
|
share the same inode. Duplicate files that are not hardlinked each
|
|
have a unique inode, but still reference the same content through
|
|
the `chunk_table`.
|
|
|
|
The `shared_files_table` provides the necessary indirection that
|
|
maps a *shared* file inode to a `chunk_table` index.
|
|
|
|
### Traversing the Metadata
|
|
|
|
You typically start at the root directory which is at `dir_entries[0]`,
|
|
`inodes[0]` and `directories[0]`. Note that the root directory
|
|
implicitly has no name, so that `dir_entries[0].name_index`
|
|
should not be used.
|
|
|
|
To determine the contents of a directory, we determine the range
|
|
of entries from `directories[inode_num].first_entry` to
|
|
`directories[inode_num + 1].first_entry`. If both values are equal,
|
|
the directory is empty. Otherwise, we can look up the entries in
|
|
`dir_entries[]`.
|
|
|
|
So for directory inodes, you can directly index into `directories`
|
|
using the inode number.
|
|
|
|
For link inodes, you can index into `symlink_table`, but you have
|
|
to adjust the index for the link inode offset determined before:
|
|
|
|
link_index = symlink_table[inode_num - link_inode_offset]
|
|
|
|
With that, you can look up the contents of the symlink:
|
|
|
|
contents = symlinks[link_index]
|
|
|
|
For *unique* regular file inodes, you can index into `chunk_table`
|
|
after adjusting the index:
|
|
|
|
chunk_index = inode_num - file_inode_offset
|
|
|
|
For *shared* regular file inodes, you can index into the (unpacked)
|
|
`shared_files_table`:
|
|
|
|
shared_index = shared_files[inode_num - file_inode_offset - num_unique_files]
|
|
|
|
Then, you can index into `chunk_table`, but you need to adjust the
|
|
index once more:
|
|
|
|
chunk_index = shared_index + num_unique_files
|
|
|
|
The range of chunks that make up a regular file inode is
|
|
`chunk_table[chunk_index]` to `chunk_table[chunk_index + 1]`. If
|
|
these values are equal, the file is empty. Otherwise, you need
|
|
to look up the range of chunks in `chunks`.
|
|
|
|
Each chunk references a range of bytes in one file system `BLOCK`.
|
|
These need to be concatenated to produce the file contents.
|
|
|
|
Both `chunk_table` and `directories` have a sentinel entry at the
|
|
end to make sure you can perform range lookups for all indices.
|
|
|
|
Last but not least, to read the device id for a device inode, you
|
|
can index into `devices`:
|
|
|
|
device_id = devices[inode_num - device_inode_offset]
|
|
|
|
## OPTIONALLY PACKED STRUCTURES
|
|
|
|
The overview above assumes metadata without any additional packing,
|
|
which can be produced using:
|
|
|
|
mkdwarfs --pack-metadata=none,plain
|
|
|
|
However, this isn't the default, and parts of the metadata are
|
|
likely stored in a packed format. These are mostly easy to unpack.
|
|
|
|
### Shared Files Table Packing
|
|
|
|
The `shared_files_table` can be stored in a packed format that
|
|
only encodes the number of shared links to a `chunk_table` index.
|
|
As the minimum number of links is always 2 (otherwise it wouldn't
|
|
be shared), the numbers in the packed format are additionally
|
|
offset by 2. So for example, a packed table like
|
|
|
|
[0, 3, 1, 0, 1]
|
|
|
|
would unpack to:
|
|
|
|
[0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4]
|
|
|
|
The packed format is used when `options.packed_shared_files_table`
|
|
is true.
|
|
|
|
### Directories Packing
|
|
|
|
The `directories` table, when stored in packed format, omits
|
|
all `parent_entry` fields and uses delta compression for the
|
|
`first_entry` fields.
|
|
|
|
In order to unpack all information, you first have to delta-
|
|
decompress the `first_entry` fields, then traverse the whole
|
|
directory tree once to fill in the `parent_entry` fields.
|
|
This sounds like a lot of work, but it's actually reasonably
|
|
fast. For example, for a file system with 15 million entries
|
|
in 90,000 directories, reconstructing the `directories` takes
|
|
only about 50 milliseconds.
|
|
|
|
The packed format is used when `options.packed_directories`
|
|
is true.
|
|
|
|
### Chunk Table Packing
|
|
|
|
The `chunk_table` can also be stored delta-compressed and
|
|
must be unpacked accordingly.
|
|
|
|
The packed format is used when `options.packed_chunk_table`
|
|
is true.
|
|
|
|
### Names and Symlinks String Table Packing
|
|
|
|
Both the `names` and `symlinks` tables can be stored in a
|
|
packed format in `compact_names` and `compact_symlinks`.
|
|
|
|
There are two separate packing schemes which can be combined.
|
|
If none of these schemes is active, the difference between
|
|
e.g. `names` and `compact_names` is that the former is stored
|
|
as a "proper" list, whereas the latter is stored as a single
|
|
string plus an index of offsets. As lists of strings store
|
|
both offset and length for each element, this already saves
|
|
the storage for the length fields, which can easily be
|
|
determined from the offsets at run-time.
|
|
|
|
If the `packed_index` scheme is used in addition, the index
|
|
is stored delta-compressed.
|
|
|
|
Last but not least, the individual strings can be compressed
|
|
as well. The [fsst library](https://github.com/cwida/fsst)
|
|
allows for compression of short strings with random access
|
|
and is typically able to reduce the overall size of the
|
|
string tables by 50%, using a dictionary that is only a few
|
|
hundred bytes long. If a `symtab` is set for the string table,
|
|
this compression is used.
|
|
|
|
## AUTHOR
|
|
|
|
Written by Marcus Holland-Moritz.
|
|
|
|
## COPYRIGHT
|
|
|
|
Copyright (C) Marcus Holland-Moritz.
|
|
|
|
## SEE ALSO
|
|
|
|
mkdwarfs(1), dwarfs(1), dwarfsextract(1), dwarfsck(1)
|