Document mkdwarfs internals

2025-09-14 06:48:39 -04:00 · 2021-04-01 17:52:43 +02:00 · 2021-04-01 17:52:43 +02:00 · 8e376d77ad
commit 8e376d77ad
parent 89df6add69
1 changed files with 50 additions and 0 deletions
--- a/doc/mkdwarfs.md
+++ b/doc/mkdwarfs.md
@ -321,6 +321,56 @@ is left uncompressed. This can be useful if mounting speed of the file
 system is important, as the uncompressed metadata part of the file can
 then simply be mapped into memory.
 ## INTERNAL OPERATION
 Internally, `mkdwarfs` run in two completely separate phases. The first
 phase is scanning the input data, the second phase is building the file
 system.
 ### Scanning
 The scanning process is driven by the main thread which traverses the
 input directory recursively and builds an internal representation of the
 directory structure. Traversal is breadth-first and single-threaded.
 When a regular file is discovered, its hardlink count is checked and
 if non-zero, its inode is looked up in a hardlink cache. If the inode
 has not been scanned yet, a scanning job will be added to a pool of
 `--num-workers` worker threads. These will perform a SHA1 checksum scan
 first, which is then used to determine duplicate files, as these will
 share the same data in the final DwarFS image. If a file is found not
 to be a duplicate, it will now potentially be scanned again (by the
 same worker threads and using the same memory mapping) to generate a
 similarity hash value. This only happens if `--order` is set to one
 of the two similary order modes.
 Once all file contents have been scanned by the worker threads, all
 unique files will be assigned an internal inode number.
 ### Building
 Building the filesystem image uses a number of separate threads. If
 `nilsimsa` ordering is selected, the ordering algorithm runs in its
 own thread and continuously emits file inodes. These will be picked
 up by the segmenter thread, which scans the inode contents using a
 cyclic hash and determines overlapping segments between previously
 written data and new incoming data. The segmenter can look at up to
 `--max-lookback-block` previous filesystem blocks to find overlaps.
 Once the segmenter has produced enough data to fill a filesystem
 block, the block is added to a queue where from which the blocks
 will be picked up by a pool of `--num-workers` worker threads whose
 only job is to compress the block using the `--compression` algorithm.
 Blocks that have been compressed will be added to the next queue,
 in the original order, and will be picked up by the filesystem writer
 thread that will ultimately produce the final filesystem image.
 When all data has been segmented, the filesystem metadata is being
 finalized and frozen into a compact representation. If metadata
 compression is enabled, the metadata is sent to the worker thread
 pool.
 ## AUTHOR
 Written by Marcus Holland-Moritz.