Add file system format documentation

2025-09-14 06:48:39 -04:00 · 2021-03-17 18:06:01 +01:00 · 2021-03-17 18:06:01 +01:00 · 282fc33ca3
commit 282fc33ca3
parent 6ef5361fc5
1 changed files with 246 additions and 0 deletions
--- a/doc/dwarfs-format.md
+++ b/doc/dwarfs-format.md
@ -0,0 +1,246 @@
 # DwarFS File System Format v2.3
 ## File Structure
 A DwarFS file system image is just a sequence of blocks. Each block has the
 following format:
         ┌───┬───┬───┬───┬───┬───┬───┬───┐
    0x00 │'D'│'W'│'A'│'R'│'F'│'S'│MAJ│MIN│  MAJ=0x02, MIN=0x03 for v2.3
         ├───┴───┴───┴───┴───┴───┴───┴───┤
    0x08 │                               │  Used for full (slow) integrity
         ├─ SHA-512/256 integrity hash  ─┤  check with `dwarfsck`.
    0x10 │  over the remainder of the    │
         ├─ block data, starting at     ─┤
    0x18 │  offset 0x28.                 │
         ├─                             ─┤
    0x20 │                               │
         ├───────────────────────────────┤
    0x28 │  XXH3-64 hash over remainder  │  Used for fast integrity check.
         ├───────────────┬───────┬───────┤
    0x30 │Section Number │SecType│CompAlg│  All integer fields are in LE
         ├───────────────┴───────┴───────┤  byte order.
    0x38 │   Length of remaining data    │
         ├───────────────────────────────┤
    0x40 │                               │
         │ Section data compressed using │
         │ CompAlg algorithm.            │
         │                               │
         │                               │
         │                               │
         └───────────────────────────────┘
 A couple of notes:
 - No padding is added between blocks.
 - The list of blocks can easily be traversed by using the length field
  to skip to the start of the next section.
 - Corruption can easily be detected using the XXH3-64 hash. Computation
  of this hash is so fast that it is in fact checked every single time a
  file system block is loaded.
 - Integrity can furthermore be checked using the SHA-512/256 hash. This
  is much slower, but should rarely be needed.
 - All header fields, except for the magic and version number, are
  protected by the hashes.
 - In case of corruption, sections can easily be retrieved by scanning
  for the magic. The version number can be recovered by looking at all
  sections and choosing the majority. The explicit section number helps
  to recover data if multiple sections are missing.
 - A major version number change will render the format incompatible.
 - A minor version number change will be backwards compatible, i.e. an
  old program will refuse to read a file system with a minor version
  larger than the one it supports. However, a new program will still
  read all file systems with a smaller minor version number.
 ### Section Types
 There are currently 3 different section types.
 #### `BLOCK` (0)
 A block of data. This is where all file data is stored. There can be
 an arbitrary number of blocks of this type.
 #### `METADATA_V2_SCHEMA` (7)
 The schema used to layout the `METADATA_V2` block contents. This is
 stored in "compact" thrift encoding.
 #### `METADATA_V2` (8)
 This section contains the bulk of the metadata. It's essentially just
 a collection of bit-packed arrays and structures. The exact layout of
 each list and structure depends on the actual data and is stored
 separately in `METADATA_V2_SCHEMA`.
 Here is a high-level overview of how all the bits and pieces relate
 to each other:
    ═════════════           ┌─────────────────────────────────────────────────────────────────────────┐
     DwarFS v2.3            │                                                                         │
    ═════════════           │         ┌───────────────────────────────────────────┐                   │
                            │         │                                           │                   │
              dir_entries[] ▼         │              inodes[]                     │   directories[]   │
    ╔════╗   ┌────────────────┐       │  S_IFDIR ──►┌───────────────────┐         │  ┌────────────────┴─┐
    ║root╟──►│ name_index:  0 │       │             │ mode_index:     0 ├──────┐  └─►│ parent_entry:  0 │
    ╚════╝   │ inode_num:   0 ├───────┴────────────►│ owner_index:    0 │      │     │ first_entry:   1 │
             ├────────────────┤                     │ group_index:    0 │      │     ├──────────────────┤
         ┌───┤ name_index:  2 │                     │ atime_offset:   0 │      │     │ parent_entry:  0 │
    ┌────┼───┤ inode_num:   5 ├───────┐             │ mtime_offset: 417 │      │     │ first_entry:  11 │
    │    │   ├────────────────┤       │             │ ctime_offset:   0 │      │     ├──────────────────┤
    │ ┌──┼───┤ name_index:  3 │       │             ├───────────────────┤      │     │ parent_entry:  5 │
    │ │  │   │ inode_num:   9 ├────┐  │             │        ...        │      │     │ first_entry:  12 │
    │ │  │   ├────────────────┤    │  │  S_IFLNK ──►├───────────────────┤      │     ├──────────────────┤
    │ │  │   │                │    │  │             │ mode_index:     2 │      │     │                  │
    │ │  │   │      ...       │    │  └────────────►│ owner_index:    2 │      │     │       ...        │
    │ │  │   │                │    │                │ group_index:    0 │      │     │                  │
    │ │  │   └────────────────┘    │                │ atime_offset:   0 │      │     └──────────────────┘
    │ │  │                         │                │ mtime_offset: 298 │      │
    │ │  │                         │                │ ctime_offset:   0 │      │
    │ │  │    names[]              │                ├───────────────────┤      │      modes[]
    │ │  │   ┌────────────┐        │                │        ...        │      │     ┌─────────────┐
    │ │  │   │ "usr"      │        │     S_IFREG ──►├───────────────────┤      └────►│   0040775   │
    │ │  │   ├────────────┤        │     (unique)   │ mode_index:     1 │            ├─────────────┤
    │ │  │   │ "share"    │        ├───────────────►│ owner_index:    0 ├──────┐     │   0100644   │
    │ │  │   ├────────────┤        │                │ group_index:    0 │      │     ├─────────────┤
    │ │  └──►│ "words"    │        │                │ atime_offset:   0 │      │     │     ...     │
    │ │      ├────────────┤        │                │ mtime_offset: 298 │      │     └─────────────┘
    │ └─────►│ "lib"      │        │                │ ctime_offset:   0 │      │
    │        ├────────────┤        │                ├───────────────────┤      │      uids[]
    │        │ "ls"       │        │                │        ...        │      │     ┌─────────────┐
    │        ├────────────┤        │     S_IFREG ──►├───────────────────┤      └────►│       0     │
    │        │    ...     │        │  ┌──(shared)   │ mode_index:     4 │            ├─────────────┤
    ▼        └────────────┘        │  │             │ owner_index:    2 │            │    1000     │
    (inode-off)                    │  │             │ group_index:    1 ├──────┐     ├─────────────┤
    │                              │  │             │ atime_offset:   0 │      │     │     ...     │
    │         symlink_table[]      │  │             │ mtime_offset: 298 │      │     └─────────────┘
    │        ┌────────────┐        │  │             │ ctime_offset:   0 │      │
    │        │      1     ├───┐    │  │             ├───────────────────┤      │      gids[]
    │        ├────────────┤   │    │  │             │        ...        │      │     ┌─────────────┐
    └───────►│      0     │   │    │  │  S_IFBLK ──►├───────────────────┤      │     │       0     │
             ├────────────┤   │    │  │  S_IFCHR    │                   │      │     ├─────────────┤
             │    ...     │   │  ┌─┼──┼─────────────┤        ...        │      └────►│     100     │
             └────────────┘   │  │ │  │             │                   │            ├─────────────┤
                              │  │ │  │ S_IFSOCK ──►├───────────────────┤            │     ...     │
                              │  │ │  │  S_IFIFO    │                   │            └─────────────┘
              symlinks[]      │  │ │  │             │        ...        │
             ┌────────────┐   │  │ │  │             │                   │
             │ "../foo"   │   │  │ │  │             └───────────────────┘                 chunks[]
             ├────────────┤   │  │ │  │                                                  ┌──────────────┐
             │ "foo/bar"  │◄──┘  │ │  │                                            ┌────►│ block:     0 │
             ├────────────┤      │ └──┼──────────►(inode-off)                      │     │ offset: 1698 │
             │    ...     │      │    │                │            chunk_table[]  │     │ size:   1012 │
             └────────────┘      │    ▼                │           ┌─────────────┐ │     ├──────────────┤
                       (inode-off)    (inode-off)      └──────────►│      0      ├─┘ ┌──►│ block:     0 │
                                 │    │                            ├─────────────┤   │   │ offset: 1604 │
              devices[]          │    │      shared_files_table[]  │      1      ├───┘   │ size:     94 │
             ┌────────────┐      │    │     ┌───────────┐          ├─────────────┤       ├──────────────┤
             │   0x0107   │      │    └────►│     0     ├───┬─────►│      2      ├───┬──►│ block:     0 │
             ├────────────┤      │          ├───────────┤   │      ├─────────────┤   │   │ offset:    0 │
             │   0x0502   │◄─────┘          │     0     ├───┘      │      2      ├───┘   │ size:   1517 │
             ├────────────┤                 ├───────────┤          ├─────────────┤       ├──────────────┤
             │    ...     │                 │    ...    │          │     ...     │       │     ...      │
             └────────────┘                 └───────────┘          └─────────────┘       └──────────────┘
 Thanks to the bit-packing, fields that are unused or only contain a
 single (zero) value, e.g. a `group_index` that's always zero because
 all files belong to the same group, do not occupy any space.
 Before you can start traversing the metadata, you need to determine
 the offsets for symlinks, regular files, devices etc. in the `inodes`
 list. The index into this list is the `inode_num` from `dir_entries`,
 but you can perform direct lookups based on the inode number as well.
 The `inodes` list is strictly in the following order:
 * directory inodes (`S_IFDIR`)
 * symlink inodes (`S_IFLNK`)
 * regular *unique* file inodes (`S_IREG`)
 * regular *shared* file inodes (`S_IREG`)
 * character/block device inodes
 * socket/pipe inodes
 The offsets can thus be found using a simple binary search.
 The difference between *unique* and *shared* file inodes is that
 there is only one *unique* file inode that references a particular
 index in the `chunk_table`, whereas there are multiple *shared*
 file inodes that will reference the same index. This is how DwarFS
 implements file-level de-duplication beyond hardlinks. Hardlinks
 share the same inode. Duplicate files that are not hardlinked all
 have a unique inode, but still reference the same content through
 the `chunk_table`.
 The `shared_files_table` provides the necessary indirection that
 maps a *shared* file inode to a `chunk_table` index. However, the
 `shared_files_table` is stored in a packed format that only encodes
 the number of shared links to a `chunk_table` index, so it must be
 unpacked first.
 Once the offsets have been determined and the `shared_files_table`
 is unpacked, you can start traversing the metadata. Typically, you
 would start a the root directory which is at `dir_entries[0]`,
 `inodes[0]` and `directories[0]`. Note that the root directory
 implicitly has no name, so that `dir_entries[0].name_index`
 shouldn't be used.
 To determine the contents of a directory, we determine the range
 of entries from `directories[inode_num].first_entry` to
 `directories[inode_num + 1].first_entry`. If both values are equal,
 the directory is empty. Otherwise, we can look up the entries in
 `dir_entries[]`.
 So for directory inodes, you can directly index into `directories`
 using the inode number.
 For link inodes, you can index into `symlink_table`, but you have
 to adjust the index for the link inode offset determined before:
    link_index = symlink_table[inode_num - link_inode_offset]
 With that, you can look up the contents of the symlink:
    contents = symlinks[link_index]
 For *unique* regular file inodes, you can index into `chunk_table`
 after adjusting the index:
    chunk_index = inode_num - file_inode_offset
 For *shared* regular file inodes, you can index into the unpacked
 `shared_files_table`:
    shared_index = shared_files[inode_num - file_inode_offset - num_unique_files]
 The, you can index into `chunk_table`, but you need to adjust the
 index once more:
    chunk_index = shared_index + num_unique_files
 The range of chunks that make up a regular file inode is
 `chunk_table[chunk_index]` to `chunk_table[chunk_index + 1]`. If
 these values are equal, the file is empty. Otherwise, you need
 to look up the range of chunks in `chunks`.
 Each chunk references a range of bytes in one file system `BLOCK`.
 These need to be concatenated to produce the file contents.
 Both `chunk_table` and `directories` have a sentinel entry at the
 end to make sure you can perform range lookups for all indices.
 Last but not least, to read the device id for a device inode, you
 can index into `devices`:
    device_id = devices[inode_num - device_inode_offset]