Update internal operation section of mkdwarfs manpage

This commit is contained in:
Marcus Holland-Moritz 2022-11-08 21:54:36 +01:00
parent f95844a35d
commit ee39c3eef7

View File

@ -558,23 +558,32 @@ input directory recursively and builds an internal representation of the
directory structure. Traversal is breadth-first and single-threaded. directory structure. Traversal is breadth-first and single-threaded.
When a regular file is discovered, its hardlink count is checked and When a regular file is discovered, its hardlink count is checked and
if non-zero, its inode is looked up in a hardlink cache. If the inode if greater than one, its inode is looked up in a hardlink cache. Another
has not been scanned yet, a scanning job will be added to a pool of lookup is performed to see if this is the first file/inode of a particular
`--num-workers` worker threads. These will perform a SHA1 checksum scan size. If it's the first file, we just keep track of the file. If it's not
first, which is then used to determine duplicate files, as these will the first file, we add a jobs to a pool of `--num-scanner-workers` worker
share the same data in the final DwarFS image. If a file is found not threads to compute a hash (determined by the the `--file-hash` option)
to be a duplicate, it will now potentially be scanned again (by the of the file. We also add a hash-computing job for the first file. These
same worker threads and using the same memory mapping) to generate a hashes will be used for de-duplicating files. If `--order` is set to one
similarity hash value. This only happens if `--order` is set to one of the similarity order modes, for each unique file, a further job is
of the two similarity order modes. added to the pool to compute a similarity hash. This happens immediately
for each inode of a unique size, but it is guaranteed that duplicates
don't trigger another similarity hash scan (the implementation for this
is indeed a bit tricky).
Once all file contents have been scanned by the worker threads, all Once all file contents have been scanned by the worker threads, all
unique files will be assigned an internal inode number. unique files will be assigned an internal inode number.
This behaviour can be customized. When using `--file-hash=none`,
de-duplication is completely disabled. Using `--max-similarity-size`,
it is possible to prevent computation of similarity hashes for huge
files. These huge files will then be stored separately before all other
files in the image.
### Building ### Building
Building the filesystem image uses a number of separate threads. If Building the filesystem image uses a `--num-workers` separate threads.
`nilsimsa` ordering is selected, the ordering algorithm runs in its If `nilsimsa` ordering is selected, the ordering algorithm runs in its
own thread and continuously emits file inodes. These will be picked own thread and continuously emits file inodes. These will be picked
up by the segmenter thread, which scans the inode contents using a up by the segmenter thread, which scans the inode contents using a
cyclic hash and determines overlapping segments between previously cyclic hash and determines overlapping segments between previously
@ -595,6 +604,10 @@ finalized and frozen into a compact representation. If metadata
compression is enabled, the metadata is sent to the worker thread compression is enabled, the metadata is sent to the worker thread
pool. pool.
When using different ordering schemes, the file inodes will be
either sorted upfront, or just sent to the segmenter in the order
in which they were discovered.
## AUTHOR ## AUTHOR
Written by Marcus Holland-Moritz. Written by Marcus Holland-Moritz.