Add file system format documentation

2025-09-13 14:27:30 -04:00 · 2021-03-17 18:06:01 +01:00 · 2021-03-17 18:06:01 +01:00 · 282fc33ca3
commit 282fc33ca3
parent 6ef5361fc5
1 changed files with 246 additions and 0 deletions
--- a/doc/dwarfs-format.md
+++ b/doc/dwarfs-format.md
@ -0,0 +1,246 @@
+# DwarFS File System Format v2.3
+
+## File Structure
+
+A DwarFS file system image is just a sequence of blocks. Each block has the
+following format:
+
+         ┌───┬───┬───┬───┬───┬───┬───┬───┐
+    0x00 │'D'│'W'│'A'│'R'│'F'│'S'│MAJ│MIN│  MAJ=0x02, MIN=0x03 for v2.3
+         ├───┴───┴───┴───┴───┴───┴───┴───┤
+    0x08 │                               │  Used for full (slow) integrity
+         ├─ SHA-512/256 integrity hash  ─┤  check with `dwarfsck`.
+    0x10 │  over the remainder of the    │
+         ├─ block data, starting at     ─┤
+    0x18 │  offset 0x28.                 │
+         ├─                             ─┤
+    0x20 │                               │
+         ├───────────────────────────────┤
+    0x28 │  XXH3-64 hash over remainder  │  Used for fast integrity check.
+         ├───────────────┬───────┬───────┤
+    0x30 │Section Number │SecType│CompAlg│  All integer fields are in LE
+         ├───────────────┴───────┴───────┤  byte order.
+    0x38 │   Length of remaining data    │
+         ├───────────────────────────────┤
+    0x40 │                               │
+         │ Section data compressed using │
+         │ CompAlg algorithm.            │
+         │                               │
+         │                               │
+         │                               │
+         └───────────────────────────────┘
+
+A couple of notes:
+
+- No padding is added between blocks.
+
+- The list of blocks can easily be traversed by using the length field
+  to skip to the start of the next section.
+
+- Corruption can easily be detected using the XXH3-64 hash. Computation
+  of this hash is so fast that it is in fact checked every single time a
+  file system block is loaded.
+
+- Integrity can furthermore be checked using the SHA-512/256 hash. This
+  is much slower, but should rarely be needed.
+
+- All header fields, except for the magic and version number, are
+  protected by the hashes.
+
+- In case of corruption, sections can easily be retrieved by scanning
+  for the magic. The version number can be recovered by looking at all
+  sections and choosing the majority. The explicit section number helps
+  to recover data if multiple sections are missing.
+
+- A major version number change will render the format incompatible.
+
+- A minor version number change will be backwards compatible, i.e. an
+  old program will refuse to read a file system with a minor version
+  larger than the one it supports. However, a new program will still
+  read all file systems with a smaller minor version number.
+
+
+### Section Types
+
+There are currently 3 different section types.
+
+#### `BLOCK` (0)
+
+A block of data. This is where all file data is stored. There can be
+an arbitrary number of blocks of this type.
+
+#### `METADATA_V2_SCHEMA` (7)
+
+The schema used to layout the `METADATA_V2` block contents. This is
+stored in "compact" thrift encoding.
+
+#### `METADATA_V2` (8)
+
+This section contains the bulk of the metadata. It's essentially just
+a collection of bit-packed arrays and structures. The exact layout of
+each list and structure depends on the actual data and is stored
+separately in `METADATA_V2_SCHEMA`.
+
+Here is a high-level overview of how all the bits and pieces relate
+to each other:
+
+    ═════════════           ┌─────────────────────────────────────────────────────────────────────────┐
+     DwarFS v2.3            │                                                                         │
+    ═════════════           │         ┌───────────────────────────────────────────┐                   │
+                            │         │                                           │                   │
+              dir_entries[] ▼         │              inodes[]                     │   directories[]   │
+    ╔════╗   ┌────────────────┐       │  S_IFDIR ──►┌───────────────────┐         │  ┌────────────────┴─┐
+    ║root╟──►│ name_index:  0 │       │             │ mode_index:     0 ├──────┐  └─►│ parent_entry:  0 │
+    ╚════╝   │ inode_num:   0 ├───────┴────────────►│ owner_index:    0 │      │     │ first_entry:   1 │
+             ├────────────────┤                     │ group_index:    0 │      │     ├──────────────────┤
+         ┌───┤ name_index:  2 │                     │ atime_offset:   0 │      │     │ parent_entry:  0 │
+    ┌────┼───┤ inode_num:   5 ├───────┐             │ mtime_offset: 417 │      │     │ first_entry:  11 │
+    │    │   ├────────────────┤       │             │ ctime_offset:   0 │      │     ├──────────────────┤
+    │ ┌──┼───┤ name_index:  3 │       │             ├───────────────────┤      │     │ parent_entry:  5 │
+    │ │  │   │ inode_num:   9 ├────┐  │             │        ...        │      │     │ first_entry:  12 │
+    │ │  │   ├────────────────┤    │  │  S_IFLNK ──►├───────────────────┤      │     ├──────────────────┤
+    │ │  │   │                │    │  │             │ mode_index:     2 │      │     │                  │
+    │ │  │   │      ...       │    │  └────────────►│ owner_index:    2 │      │     │       ...        │
+    │ │  │   │                │    │                │ group_index:    0 │      │     │                  │
+    │ │  │   └────────────────┘    │                │ atime_offset:   0 │      │     └──────────────────┘
+    │ │  │                         │                │ mtime_offset: 298 │      │
+    │ │  │                         │                │ ctime_offset:   0 │      │
+    │ │  │    names[]              │                ├───────────────────┤      │      modes[]
+    │ │  │   ┌────────────┐        │                │        ...        │      │     ┌─────────────┐
+    │ │  │   │ "usr"      │        │     S_IFREG ──►├───────────────────┤      └────►│   0040775   │
+    │ │  │   ├────────────┤        │     (unique)   │ mode_index:     1 │            ├─────────────┤
+    │ │  │   │ "share"    │        ├───────────────►│ owner_index:    0 ├──────┐     │   0100644   │
+    │ │  │   ├────────────┤        │                │ group_index:    0 │      │     ├─────────────┤
+    │ │  └──►│ "words"    │        │                │ atime_offset:   0 │      │     │     ...     │
+    │ │      ├────────────┤        │                │ mtime_offset: 298 │      │     └─────────────┘
+    │ └─────►│ "lib"      │        │                │ ctime_offset:   0 │      │
+    │        ├────────────┤        │                ├───────────────────┤      │      uids[]
+    │        │ "ls"       │        │                │        ...        │      │     ┌─────────────┐
+    │        ├────────────┤        │     S_IFREG ──►├───────────────────┤      └────►│       0     │
+    │        │    ...     │        │  ┌──(shared)   │ mode_index:     4 │            ├─────────────┤
+    ▼        └────────────┘        │  │             │ owner_index:    2 │            │    1000     │
+    (inode-off)                    │  │             │ group_index:    1 ├──────┐     ├─────────────┤
+    │                              │  │             │ atime_offset:   0 │      │     │     ...     │
+    │         symlink_table[]      │  │             │ mtime_offset: 298 │      │     └─────────────┘
+    │        ┌────────────┐        │  │             │ ctime_offset:   0 │      │
+    │        │      1     ├───┐    │  │             ├───────────────────┤      │      gids[]
+    │        ├────────────┤   │    │  │             │        ...        │      │     ┌─────────────┐
+    └───────►│      0     │   │    │  │  S_IFBLK ──►├───────────────────┤      │     │       0     │
+             ├────────────┤   │    │  │  S_IFCHR    │                   │      │     ├─────────────┤
+             │    ...     │   │  ┌─┼──┼─────────────┤        ...        │      └────►│     100     │
+             └────────────┘   │  │ │  │             │                   │            ├─────────────┤
+                              │  │ │  │ S_IFSOCK ──►├───────────────────┤            │     ...     │
+                              │  │ │  │  S_IFIFO    │                   │            └─────────────┘
+              symlinks[]      │  │ │  │             │        ...        │
+             ┌────────────┐   │  │ │  │             │                   │
+             │ "../foo"   │   │  │ │  │             └───────────────────┘                 chunks[]
+             ├────────────┤   │  │ │  │                                                  ┌──────────────┐
+             │ "foo/bar"  │◄──┘  │ │  │                                            ┌────►│ block:     0 │
+             ├────────────┤      │ └──┼──────────►(inode-off)                      │     │ offset: 1698 │
+             │    ...     │      │    │                │            chunk_table[]  │     │ size:   1012 │
+             └────────────┘      │    ▼                │           ┌─────────────┐ │     ├──────────────┤
+                       (inode-off)    (inode-off)      └──────────►│      0      ├─┘ ┌──►│ block:     0 │
+                                 │    │                            ├─────────────┤   │   │ offset: 1604 │
+              devices[]          │    │      shared_files_table[]  │      1      ├───┘   │ size:     94 │
+             ┌────────────┐      │    │     ┌───────────┐          ├─────────────┤       ├──────────────┤
+             │   0x0107   │      │    └────►│     0     ├───┬─────►│      2      ├───┬──►│ block:     0 │
+             ├────────────┤      │          ├───────────┤   │      ├─────────────┤   │   │ offset:    0 │
+             │   0x0502   │◄─────┘          │     0     ├───┘      │      2      ├───┘   │ size:   1517 │
+             ├────────────┤                 ├───────────┤          ├─────────────┤       ├──────────────┤
+             │    ...     │                 │    ...    │          │     ...     │       │     ...      │
+             └────────────┘                 └───────────┘          └─────────────┘       └──────────────┘
+
+Thanks to the bit-packing, fields that are unused or only contain a
+single (zero) value, e.g. a `group_index` that's always zero because
+all files belong to the same group, do not occupy any space.
+
+Before you can start traversing the metadata, you need to determine
+the offsets for symlinks, regular files, devices etc. in the `inodes`
+list. The index into this list is the `inode_num` from `dir_entries`,
+but you can perform direct lookups based on the inode number as well.
+The `inodes` list is strictly in the following order:
+
+* directory inodes (`S_IFDIR`)
+
+* symlink inodes (`S_IFLNK`)
+
+* regular *unique* file inodes (`S_IREG`)
+
+* regular *shared* file inodes (`S_IREG`)
+
+* character/block device inodes
+
+* socket/pipe inodes
+
+The offsets can thus be found using a simple binary search.
+
+The difference between *unique* and *shared* file inodes is that
+there is only one *unique* file inode that references a particular
+index in the `chunk_table`, whereas there are multiple *shared*
+file inodes that will reference the same index. This is how DwarFS
+implements file-level de-duplication beyond hardlinks. Hardlinks
+share the same inode. Duplicate files that are not hardlinked all
+have a unique inode, but still reference the same content through
+the `chunk_table`.
+
+The `shared_files_table` provides the necessary indirection that
+maps a *shared* file inode to a `chunk_table` index. However, the
+`shared_files_table` is stored in a packed format that only encodes
+the number of shared links to a `chunk_table` index, so it must be
+unpacked first.
+
+Once the offsets have been determined and the `shared_files_table`
+is unpacked, you can start traversing the metadata. Typically, you
+would start a the root directory which is at `dir_entries[0]`,
+`inodes[0]` and `directories[0]`. Note that the root directory
+implicitly has no name, so that `dir_entries[0].name_index`
+shouldn't be used.
+
+To determine the contents of a directory, we determine the range
+of entries from `directories[inode_num].first_entry` to
+`directories[inode_num + 1].first_entry`. If both values are equal,
+the directory is empty. Otherwise, we can look up the entries in
+`dir_entries[]`.
+
+So for directory inodes, you can directly index into `directories`
+using the inode number.
+
+For link inodes, you can index into `symlink_table`, but you have
+to adjust the index for the link inode offset determined before:
+
+    link_index = symlink_table[inode_num - link_inode_offset]
+
+With that, you can look up the contents of the symlink:
+
+    contents = symlinks[link_index]
+
+For *unique* regular file inodes, you can index into `chunk_table`
+after adjusting the index:
+
+    chunk_index = inode_num - file_inode_offset
+
+For *shared* regular file inodes, you can index into the unpacked
+`shared_files_table`:
+
+    shared_index = shared_files[inode_num - file_inode_offset - num_unique_files]
+
+The, you can index into `chunk_table`, but you need to adjust the
+index once more:
+
+    chunk_index = shared_index + num_unique_files
+
+The range of chunks that make up a regular file inode is
+`chunk_table[chunk_index]` to `chunk_table[chunk_index + 1]`. If
+these values are equal, the file is empty. Otherwise, you need
+to look up the range of chunks in `chunks`.
+
+Each chunk references a range of bytes in one file system `BLOCK`.
+These need to be concatenated to produce the file contents.
+
+Both `chunk_table` and `directories` have a sentinel entry at the
+end to make sure you can perform range lookups for all indices.
+
+Last but not least, to read the device id for a device inode, you
+can index into `devices`:
+
+    device_id = devices[inode_num - device_inode_offset]