Tweak the internal operation documentation

This commit is contained in:
Marcus Holland-Moritz 2023-07-06 21:06:38 +02:00
parent 7c1eee8129
commit 8fcb03e8b7

View File

@ -586,27 +586,33 @@ and excluded files without building an actual file system.
Internally, `mkdwarfs` runs in two completely separate phases. The first Internally, `mkdwarfs` runs in two completely separate phases. The first
phase is scanning the input data, the second phase is building the file phase is scanning the input data, the second phase is building the file
system. system. Both phases try to do as little work as possible, and try to run
as much of the remaining work as possible in parallel, while still making
sure that the file system images produced are reproducible (see the
`--order` option documentation for details on reproducible images).
### Scanning ### Scanning
The scanning process is driven by the main thread which traverses the The scanning process is driven by the main thread which traverses the
input directory recursively and builds an internal representation of the input directory recursively and builds an internal representation of the
directory structure. Traversal is breadth-first and single-threaded. directory structure. Traversal is breadth-first and single-threaded.
Filter rules as specified by `--filter` are handled immediately during
traversal.
When a regular file is discovered, its hardlink count is checked and When a regular file is discovered, its hardlink count is checked and
if greater than one, its inode is looked up in a hardlink cache. Another if greater than one, its inode is looked up in a hardlink cache. Another
lookup is performed to see if this is the first file/inode of a particular lookup is performed to see if this is the first file/inode of a particular
size. If it's the first file, we just keep track of the file. If it's not size. If it's the first file, we just keep track of the file. If it's not
the first file, we add a jobs to a pool of `--num-scanner-workers` worker the first file, we add a job to a pool of `--num-scanner-workers` worker
threads to compute a hash (determined by the the `--file-hash` option) threads to compute a hash (which hash function is used is determined by
of the file. We also add a hash-computing job for the first file. These the the `--file-hash` option) of the file. We also add a hash-computing
hashes will be used for de-duplicating files. If `--order` is set to one job for the first file we found with this size earlier. These hashes will
of the similarity order modes, for each unique file, a further job is then be used for de-duplicating files. If `--order` is set to one of the
added to the pool to compute a similarity hash. This happens immediately similarity order modes, for each unique file, a further job is added to
for each inode of a unique size, but it is guaranteed that duplicates the pool to compute a similarity hash. This happens immediately for each
don't trigger another similarity hash scan (the implementation for this inode of a unique size, but it is guaranteed that duplicates don't trigger
is indeed a bit tricky). another similarity hash scan (the implementation for this is actually a bit
tricky).
Once all file contents have been scanned by the worker threads, all Once all file contents have been scanned by the worker threads, all
unique files will be assigned an internal inode number. unique files will be assigned an internal inode number.
@ -620,11 +626,12 @@ files in the image.
### Building ### Building
Building the filesystem image uses a `--num-workers` separate threads. Building the filesystem image uses a `--num-workers` separate threads.
If `nilsimsa` ordering is selected, the ordering algorithm runs in its If `nilsimsa` ordering is selected, the ordering algorithm runs in its
own thread and continuously emits file inodes. These will be picked own thread and continuously emits file inodes. These will be picked up
up by the segmenter thread, which scans the inode contents using a by the segmenter thread, which scans the inode contents using a cyclic
cyclic hash and determines overlapping segments between previously hash and determines overlapping segments between previously written
written data and new incoming data. The segmenter can look at up to data and new incoming data. The segmenter will look at up to
`--max-lookback-block` previous filesystem blocks to find overlaps. `--max-lookback-block` previous filesystem blocks to find overlaps.
Once the segmenter has produced enough data to fill a filesystem Once the segmenter has produced enough data to fill a filesystem
@ -639,7 +646,7 @@ thread that will ultimately produce the final filesystem image.
When all data has been segmented, the filesystem metadata is being When all data has been segmented, the filesystem metadata is being
finalized and frozen into a compact representation. If metadata finalized and frozen into a compact representation. If metadata
compression is enabled, the metadata is sent to the worker thread compression is enabled, the metadata is sent to the worker thread
pool. pool for compression.
When using different ordering schemes, the file inodes will be When using different ordering schemes, the file inodes will be
either sorted upfront, or just sent to the segmenter in the order either sorted upfront, or just sent to the segmenter in the order