Update internal operation section of mkdwarfs manpage

2025-09-08 11:59:48 -04:00 · 2022-11-08 21:54:36 +01:00 · 2022-11-08 21:54:36 +01:00 · ee39c3eef7
commit ee39c3eef7
parent f95844a35d
1 changed files with 24 additions and 11 deletions
--- a/doc/mkdwarfs.md
+++ b/doc/mkdwarfs.md
@ -558,23 +558,32 @@ input directory recursively and builds an internal representation of the
 directory structure. Traversal is breadth-first and single-threaded.

 When a regular file is discovered, its hardlink count is checked and
-if non-zero, its inode is looked up in a hardlink cache. If the inode
-has not been scanned yet, a scanning job will be added to a pool of
-`--num-workers` worker threads. These will perform a SHA1 checksum scan
-first, which is then used to determine duplicate files, as these will
-share the same data in the final DwarFS image. If a file is found not
-to be a duplicate, it will now potentially be scanned again (by the
-same worker threads and using the same memory mapping) to generate a
-similarity hash value. This only happens if `--order` is set to one
-of the two similarity order modes.
+if greater than one, its inode is looked up in a hardlink cache. Another
+lookup is performed to see if this is the first file/inode of a particular
+size. If it's the first file, we just keep track of the file. If it's not
+the first file, we add a jobs to a pool of `--num-scanner-workers` worker
+threads to compute a hash (determined by the the `--file-hash` option)
+of the file. We also add a hash-computing job for the first file. These
+hashes will be used for de-duplicating files. If `--order` is set to one
+of the similarity order modes, for each unique file, a further job is
+added to the pool to compute a similarity hash. This happens immediately
+for each inode of a unique size, but it is guaranteed that duplicates
+don't trigger another similarity hash scan (the implementation for this
+is indeed a bit tricky).

 Once all file contents have been scanned by the worker threads, all
 unique files will be assigned an internal inode number.

+This behaviour can be customized. When using `--file-hash=none`,
+de-duplication is completely disabled. Using `--max-similarity-size`,
+it is possible to prevent computation of similarity hashes for huge
+files. These huge files will then be stored separately before all other
+files in the image.
+
 ### Building

-Building the filesystem image uses a number of separate threads. If
-`nilsimsa` ordering is selected, the ordering algorithm runs in its
+Building the filesystem image uses a `--num-workers` separate threads.
+If `nilsimsa` ordering is selected, the ordering algorithm runs in its
 own thread and continuously emits file inodes. These will be picked
 up by the segmenter thread, which scans the inode contents using a
 cyclic hash and determines overlapping segments between previously
@ -595,6 +604,10 @@ finalized and frozen into a compact representation. If metadata
 compression is enabled, the metadata is sent to the worker thread
 pool.

+When using different ordering schemes, the file inodes will be
+either sorted upfront, or just sent to the segmenter in the order
+in which they were discovered.
+
 ## AUTHOR

 Written by Marcus Holland-Moritz.