mirror of
https://github.com/mhx/dwarfs.git
synced 2025-09-10 21:10:02 -04:00
docs(mkdwarfs): update man page with all new options
This commit is contained in:
parent
79d3302990
commit
dd58e74594
138
doc/mkdwarfs.md
138
doc/mkdwarfs.md
@ -65,6 +65,10 @@ Most other options are concerned with compression tuning:
|
|||||||
`--window-step` and `--order`. See the output of `mkdwarfs --help` for
|
`--window-step` and `--order`. See the output of `mkdwarfs --help` for
|
||||||
a table listing the exact defaults used for each compression level.
|
a table listing the exact defaults used for each compression level.
|
||||||
|
|
||||||
|
- `--categorize`[`=`*categorizer*[`,`...]]:
|
||||||
|
Enable one or more categorizers in the given order.
|
||||||
|
See [CATEGORIZERS](#categorizers) for more details.
|
||||||
|
|
||||||
- `-S`, `--block-size-bits=`*value*:
|
- `-S`, `--block-size-bits=`*value*:
|
||||||
The block size used for the compressed filesystem. The actual block size
|
The block size used for the compressed filesystem. The actual block size
|
||||||
is two to the power of this value. Larger block sizes will offer better
|
is two to the power of this value. Larger block sizes will offer better
|
||||||
@ -91,9 +95,8 @@ Most other options are concerned with compression tuning:
|
|||||||
and values greater than 10 are mapped to "background" priority.
|
and values greater than 10 are mapped to "background" priority.
|
||||||
|
|
||||||
- `--num-scanner-workers=`*value*:
|
- `--num-scanner-workers=`*value*:
|
||||||
Number of worker threads used for building the filesystem. This defaults
|
Number of worker threads used for scanning the filesystem. Use this option
|
||||||
to the number of processors available on your system. Use this option if
|
if you want to limit the resources used by `mkdwarfs` or to optimize build
|
||||||
you want to limit the resources used by `mkdwarfs` or to optimize build
|
|
||||||
speed. This option affects only the scanning phase. By default, the same
|
speed. This option affects only the scanning phase. By default, the same
|
||||||
value is used as for `--num-workers`.
|
value is used as for `--num-workers`.
|
||||||
In the scanning phase, the worker threads are used to scan files in the
|
In the scanning phase, the worker threads are used to scan files in the
|
||||||
@ -102,7 +105,20 @@ Most other options are concerned with compression tuning:
|
|||||||
computation, depending on the `--order` option. File discovery itself
|
computation, depending on the `--order` option. File discovery itself
|
||||||
is single-threaded and runs independently from the scanning threads.
|
is single-threaded and runs independently from the scanning threads.
|
||||||
|
|
||||||
- `-B`, `--max-lookback-blocks=`*value*:
|
- `--num-segmenter-workers=`*value*:
|
||||||
|
Number of worker threads used for segmenting the input data. By default,
|
||||||
|
the same value is used as for `--num-workers`.
|
||||||
|
Segmenting the input data is one of the most time consuming tasks when
|
||||||
|
building a file system, and cannot easily be parallelized. However, when
|
||||||
|
using the categorizer, a separate segmenter will be used for each category
|
||||||
|
(and subcategory, if present). This option controls how many segmenters
|
||||||
|
can run simultaneously. When `--compress-niceness` is set to the default,
|
||||||
|
segmenter threads will always have a higher priority than compression
|
||||||
|
threads, making sure that compression doesn't slow down segmentation.
|
||||||
|
This option also controls the number of threads used for ordering the
|
||||||
|
input to the segmenter.
|
||||||
|
|
||||||
|
- `-B`, `--max-lookback-blocks=[*category*`::`]`*value*:
|
||||||
Specify how many of the most recent blocks to scan for duplicate segments.
|
Specify how many of the most recent blocks to scan for duplicate segments.
|
||||||
By default, only the current block will be scanned. The larger this number,
|
By default, only the current block will be scanned. The larger this number,
|
||||||
the more duplicate segments will likely be found, which may further improve
|
the more duplicate segments will likely be found, which may further improve
|
||||||
@ -111,7 +127,7 @@ Most other options are concerned with compression tuning:
|
|||||||
files can now potentially span multiple filesystem blocks. Passing `-B0`
|
files can now potentially span multiple filesystem blocks. Passing `-B0`
|
||||||
will completely disable duplicate segment search.
|
will completely disable duplicate segment search.
|
||||||
|
|
||||||
- `-W`, `--window-size=`*value*:
|
- `-W`, `--window-size=[*category*`::`]`*value*:
|
||||||
Window size of cyclic hash used for segmenting. This is an exponent
|
Window size of cyclic hash used for segmenting. This is an exponent
|
||||||
to a base of two. Cyclic hashes are used by `mkdwarfs` for finding
|
to a base of two. Cyclic hashes are used by `mkdwarfs` for finding
|
||||||
identical segments across multiple files. This is done on top of duplicate
|
identical segments across multiple files. This is done on top of duplicate
|
||||||
@ -128,7 +144,7 @@ Most other options are concerned with compression tuning:
|
|||||||
size will grow. Passing `-W0` will completely disable duplicate segment
|
size will grow. Passing `-W0` will completely disable duplicate segment
|
||||||
search.
|
search.
|
||||||
|
|
||||||
- `-w`, `--window-step=`*value*:
|
- `-w`, `--window-step=[*category*`::`]`*value*:
|
||||||
This option specifies how often cyclic hash values are stored for lookup.
|
This option specifies how often cyclic hash values are stored for lookup.
|
||||||
It is specified relative to the window size, as a base-2 exponent that
|
It is specified relative to the window size, as a base-2 exponent that
|
||||||
divides the window size. As a concrete example, if `--window-size=16`
|
divides the window size. As a concrete example, if `--window-size=16`
|
||||||
@ -141,7 +157,7 @@ Most other options are concerned with compression tuning:
|
|||||||
If you use a larger value for this option, the increments become *smaller*,
|
If you use a larger value for this option, the increments become *smaller*,
|
||||||
and `mkdwarfs` will be slightly slower and use more memory.
|
and `mkdwarfs` will be slightly slower and use more memory.
|
||||||
|
|
||||||
- `--bloom-filter-size`=*value*:
|
- `--bloom-filter-size`=[*category*`::`]*value*:
|
||||||
The segmenting algorithm uses a bloom filter to determine quickly if
|
The segmenting algorithm uses a bloom filter to determine quickly if
|
||||||
there is *no* match at a given position. This will filter out more than
|
there is *no* match at a given position. This will filter out more than
|
||||||
90% of bad matches quickly with the default bloom filter size. The default
|
90% of bad matches quickly with the default bloom filter size. The default
|
||||||
@ -161,7 +177,7 @@ Most other options are concerned with compression tuning:
|
|||||||
algorithms, so if you're short on memory it might be worth tweaking the
|
algorithms, so if you're short on memory it might be worth tweaking the
|
||||||
compression options.
|
compression options.
|
||||||
|
|
||||||
- `-C`, `--compression=`*algorithm*[`:`*algopt*[`=`*value*][`:`...]]:
|
- `-C`, `--compression=`[*category*`::`]*algorithm*[`:`*algopt*[`=`*value*][`:`...]]:
|
||||||
The compression algorithm and configuration used for file system data.
|
The compression algorithm and configuration used for file system data.
|
||||||
The value for this option is a colon-separated list. The first item is
|
The value for this option is a colon-separated list. The first item is
|
||||||
the compression algorithm, the remaining item are its options. Options
|
the compression algorithm, the remaining item are its options. Options
|
||||||
@ -211,6 +227,12 @@ Most other options are concerned with compression tuning:
|
|||||||
metadata to uncompressed metadata without having to rebuild or recompress
|
metadata to uncompressed metadata without having to rebuild or recompress
|
||||||
all the other data.
|
all the other data.
|
||||||
|
|
||||||
|
- `--recompress-categories=`[`!`]*category*[`,`...]:
|
||||||
|
When `--recompress` is set to `all` or `block`, this option controls
|
||||||
|
which categories of blocks will be recompressed. Adding a `!` in front
|
||||||
|
of the list allows you to specify which categories will *not* be
|
||||||
|
recompressed.
|
||||||
|
|
||||||
- `-P`, `--pack-metadata=auto`|`none`|[`all`|`chunk_table`|`directories`|`shared_files`|`names`|`names_index`|`symlinks`|`symlinks_index`|`force`|`plain`[`,`...]]:
|
- `-P`, `--pack-metadata=auto`|`none`|[`all`|`chunk_table`|`directories`|`shared_files`|`names`|`names_index`|`symlinks`|`symlinks_index`|`force`|`plain`[`,`...]]:
|
||||||
Which metadata information to store in packed format. This is primarily
|
Which metadata information to store in packed format. This is primarily
|
||||||
useful when storing metadata uncompressed, as it allows for smaller
|
useful when storing metadata uncompressed, as it allows for smaller
|
||||||
@ -258,7 +280,7 @@ Most other options are concerned with compression tuning:
|
|||||||
"normalize" the permissions across the file system; this is equivalent to
|
"normalize" the permissions across the file system; this is equivalent to
|
||||||
using `--chmod=ug-st,=Xr`.
|
using `--chmod=ug-st,=Xr`.
|
||||||
|
|
||||||
- `--order=none`|`path`|`revpath`|`similarity`|`nilsimsa`[`:`*max-children*[`:`*max-cluster-size*]]:
|
- `--order=`[*category*`::`]`none`|`path`|`revpath`|`similarity`|`nilsimsa`[`:`*max-children*[`:`*max-cluster-size*]]:
|
||||||
The order in which inodes will be written to the file system. Choosing `none`,
|
The order in which inodes will be written to the file system. Choosing `none`,
|
||||||
the inodes will be stored in the order in which they are discovered. With
|
the inodes will be stored in the order in which they are discovered. With
|
||||||
`path`, they will be sorted asciibetically by path name of the first file
|
`path`, they will be sorted asciibetically by path name of the first file
|
||||||
@ -367,6 +389,24 @@ Most other options are concerned with compression tuning:
|
|||||||
you can switch to `ascii`, which is like `unicode`, but looks less
|
you can switch to `ascii`, which is like `unicode`, but looks less
|
||||||
fancy.
|
fancy.
|
||||||
|
|
||||||
|
- `--incompressible-min-input-size=`*value*
|
||||||
|
The minimum size of a file to be checked for incompressibility when
|
||||||
|
the `incompressible` categorizer is active.
|
||||||
|
|
||||||
|
- `--incompressible-block-size=`*value*
|
||||||
|
The block size used to test data for compressibility. This will also
|
||||||
|
be the size of the fragments when `--incompressible-fragments` is used.
|
||||||
|
|
||||||
|
- `--incompressible-fragments`
|
||||||
|
Categorize individual fragments of a file as incompressible instead of
|
||||||
|
only the file as a whole.
|
||||||
|
|
||||||
|
- `--incompressible-ratio=`*value*
|
||||||
|
The ratio above which a file or fragment is categorized as `incompressible`.
|
||||||
|
|
||||||
|
- `--incompressible-zstd-level=`*value*
|
||||||
|
The ZSTD compression level used for incompressible categorization.
|
||||||
|
|
||||||
- `-h`, `--help`:
|
- `-h`, `--help`:
|
||||||
Show usage and the most common basic options.
|
Show usage and the most common basic options.
|
||||||
|
|
||||||
@ -374,8 +414,82 @@ Most other options are concerned with compression tuning:
|
|||||||
Show full usage with all options, including defaults, compression level
|
Show full usage with all options, including defaults, compression level
|
||||||
detail and supported compression algorithms.
|
detail and supported compression algorithms.
|
||||||
|
|
||||||
If experimental Python support was compiled into `mkdwarfs`, you can use the
|
## CATEGORIZERS
|
||||||
following option to enable customizations via the scripting interface:
|
|
||||||
|
Categorizers will inspect the input files in the scanning phase and try to
|
||||||
|
assign them a category. Each categorizer can define a set of categories,
|
||||||
|
and each of these categories can optionally support subcategories.
|
||||||
|
|
||||||
|
Running `mkdwarfs` with the `-H` or `--long-help` option will display the
|
||||||
|
list of available categorizers and the categories they emit. At the moment,
|
||||||
|
`mkdwarfs` supports two categorizers, `incompressible` and `pcmaudio`. The
|
||||||
|
`incompressible` categorizer comes with its own set of options while the
|
||||||
|
`pcmaudio` categorizer doesn't need any further configuration.
|
||||||
|
|
||||||
|
Categorizers are only useful if at least some of the `mkdwarfs` configuration
|
||||||
|
is category-dependent. The options that can be configured per category are
|
||||||
|
`--compression`, `--order`, `--max-lookback-blocks`, `--window-size`,
|
||||||
|
`--window-step`, and `--bloom-filter-size`.
|
||||||
|
|
||||||
|
The resulting configuration matrix can be quite overwhelming, which is why
|
||||||
|
`mkdwarfs` will run with a reasonable set of defaults if you specify the
|
||||||
|
`--categorize` option with no arguments. These defaults are currently:
|
||||||
|
|
||||||
|
--categorize=pcmaudio,incompressible
|
||||||
|
--compression incompressible::null
|
||||||
|
--compression pcmaudio/waveform::flac
|
||||||
|
--order pcmaudio/waveform::revpath
|
||||||
|
--max-lookback-blocks pcmaudio/waveform::0
|
||||||
|
--window-size pcmaudio/waveform::0
|
||||||
|
|
||||||
|
Note that in case of the `pcmaudio` categorizer, you can override each
|
||||||
|
option per category (in this case `pcmaudio/waveform`).
|
||||||
|
|
||||||
|
It's also worth noting that the order in which the categorizers are given
|
||||||
|
is important. The first categorizer that will successfully categorize a
|
||||||
|
file wins and, if possible, no other categorizers will run on the same
|
||||||
|
file.
|
||||||
|
|
||||||
|
### `incompressible` Categorizer
|
||||||
|
|
||||||
|
The `incompressible` categorizer will try to compress each input with a
|
||||||
|
very fast compression algorithm (`zstd` using a negative compression level
|
||||||
|
by default). If it turns out that this doesn't reduce the size of the input
|
||||||
|
significantly, the input will be categorized as `incompressible`.
|
||||||
|
|
||||||
|
You can use the `incompressible` categorizer in two modes: whole-file or
|
||||||
|
fragmented categorization. In the former, the whole input file will be
|
||||||
|
categorized, whereas in the latter, each file can be further broken down
|
||||||
|
into compressible and incompressible fragments. The size of these fragments
|
||||||
|
will be equal to the block size used for categorization.
|
||||||
|
|
||||||
|
It makes sense to use this categorizer as the last in a list of multiple
|
||||||
|
categorizers. Not only because it'll likely have the biggest overhead, but
|
||||||
|
also because it can wrongly classify data as incompressible that can be
|
||||||
|
compressed properly with a specialized algorithm (e.g. audio data).
|
||||||
|
|
||||||
|
### `pcmaudio` Categorizer
|
||||||
|
|
||||||
|
The `pcmaudio` categorizer can identify and categorize a wide range of
|
||||||
|
uncompressed audio data such as `.wav`, `.aiff` and more obscure formats.
|
||||||
|
|
||||||
|
It produces two different categories: `pcmaudio/waveform` for the actual
|
||||||
|
waveform data, and `pcmaudio/metadata` for all other data in the file such
|
||||||
|
as the file header. The `pcmaudio/waveform` category is again divided into
|
||||||
|
many subcategories depending on the type of waveform data (e.g. number of
|
||||||
|
channels, bit depth, byte order, etc.).
|
||||||
|
|
||||||
|
In order to efficiently compress `pcmaudio/waveform` data, a suitable
|
||||||
|
compression algorithm must be selected for this category. `mkdwarfs`
|
||||||
|
currently supports `flac` compression, which offers the best ratio of
|
||||||
|
compression speed and achievable compression ratio.
|
||||||
|
|
||||||
|
It is worth noting that options such as `--window-size` will operate on
|
||||||
|
*sample* granularity instead of *byte* granularity when processing
|
||||||
|
`pcmaudio/waveform` data, where *sample* granularity means one sample
|
||||||
|
for each channel. For example, a 16-bit stereo file would have a
|
||||||
|
granularity of 4 bytes and thus `--window-size=10` would refer to a
|
||||||
|
4 KiB window instead of a 1 KiB windows.
|
||||||
|
|
||||||
## TIPS & TRICKS
|
## TIPS & TRICKS
|
||||||
|
|
||||||
@ -642,7 +756,7 @@ own thread and continuously emits file inodes. These will be picked up
|
|||||||
by the segmenter thread, which scans the inode contents using a cyclic
|
by the segmenter thread, which scans the inode contents using a cyclic
|
||||||
hash and determines overlapping segments between previously written
|
hash and determines overlapping segments between previously written
|
||||||
data and new incoming data. The segmenter will look at up to
|
data and new incoming data. The segmenter will look at up to
|
||||||
`--max-lookback-block` previous filesystem blocks to find overlaps.
|
`--max-lookback-blocks` previous filesystem blocks to find overlaps.
|
||||||
|
|
||||||
Once the segmenter has produced enough data to fill a filesystem
|
Once the segmenter has produced enough data to fill a filesystem
|
||||||
block, the block is added to a queue where from which the blocks
|
block, the block is added to a queue where from which the blocks
|
||||||
|
Loading…
x
Reference in New Issue
Block a user