mirror of
https://github.com/mhx/dwarfs.git
synced 2025-09-14 06:48:39 -04:00
Document mkdwarfs internals
This commit is contained in:
parent
89df6add69
commit
8e376d77ad
@ -321,6 +321,56 @@ is left uncompressed. This can be useful if mounting speed of the file
|
|||||||
system is important, as the uncompressed metadata part of the file can
|
system is important, as the uncompressed metadata part of the file can
|
||||||
then simply be mapped into memory.
|
then simply be mapped into memory.
|
||||||
|
|
||||||
|
## INTERNAL OPERATION
|
||||||
|
|
||||||
|
Internally, `mkdwarfs` run in two completely separate phases. The first
|
||||||
|
phase is scanning the input data, the second phase is building the file
|
||||||
|
system.
|
||||||
|
|
||||||
|
### Scanning
|
||||||
|
|
||||||
|
The scanning process is driven by the main thread which traverses the
|
||||||
|
input directory recursively and builds an internal representation of the
|
||||||
|
directory structure. Traversal is breadth-first and single-threaded.
|
||||||
|
|
||||||
|
When a regular file is discovered, its hardlink count is checked and
|
||||||
|
if non-zero, its inode is looked up in a hardlink cache. If the inode
|
||||||
|
has not been scanned yet, a scanning job will be added to a pool of
|
||||||
|
`--num-workers` worker threads. These will perform a SHA1 checksum scan
|
||||||
|
first, which is then used to determine duplicate files, as these will
|
||||||
|
share the same data in the final DwarFS image. If a file is found not
|
||||||
|
to be a duplicate, it will now potentially be scanned again (by the
|
||||||
|
same worker threads and using the same memory mapping) to generate a
|
||||||
|
similarity hash value. This only happens if `--order` is set to one
|
||||||
|
of the two similary order modes.
|
||||||
|
|
||||||
|
Once all file contents have been scanned by the worker threads, all
|
||||||
|
unique files will be assigned an internal inode number.
|
||||||
|
|
||||||
|
### Building
|
||||||
|
|
||||||
|
Building the filesystem image uses a number of separate threads. If
|
||||||
|
`nilsimsa` ordering is selected, the ordering algorithm runs in its
|
||||||
|
own thread and continuously emits file inodes. These will be picked
|
||||||
|
up by the segmenter thread, which scans the inode contents using a
|
||||||
|
cyclic hash and determines overlapping segments between previously
|
||||||
|
written data and new incoming data. The segmenter can look at up to
|
||||||
|
`--max-lookback-block` previous filesystem blocks to find overlaps.
|
||||||
|
|
||||||
|
Once the segmenter has produced enough data to fill a filesystem
|
||||||
|
block, the block is added to a queue where from which the blocks
|
||||||
|
will be picked up by a pool of `--num-workers` worker threads whose
|
||||||
|
only job is to compress the block using the `--compression` algorithm.
|
||||||
|
|
||||||
|
Blocks that have been compressed will be added to the next queue,
|
||||||
|
in the original order, and will be picked up by the filesystem writer
|
||||||
|
thread that will ultimately produce the final filesystem image.
|
||||||
|
|
||||||
|
When all data has been segmented, the filesystem metadata is being
|
||||||
|
finalized and frozen into a compact representation. If metadata
|
||||||
|
compression is enabled, the metadata is sent to the worker thread
|
||||||
|
pool.
|
||||||
|
|
||||||
## AUTHOR
|
## AUTHOR
|
||||||
|
|
||||||
Written by Marcus Holland-Moritz.
|
Written by Marcus Holland-Moritz.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user