mirror of
https://github.com/mhx/dwarfs.git
synced 2025-08-03 09:47:01 -04:00
377 lines
12 KiB
Plaintext
377 lines
12 KiB
Plaintext
- More "real" benchmarks
|
|
|
|
- Entry v2
|
|
|
|
- Mounting lots of images with shared cache?
|
|
- Communication could be implemented using xattrs
|
|
- Does WinFsp support multiple mountpoints in a
|
|
single process?
|
|
- Seems it does!
|
|
- https://github.com/winfsp/winfsp/commit/ae8e4e61f77f24c6267d6fc5fa14bd567c7a88ea
|
|
- Second option would be handling the images internally
|
|
- However, inode numbers aren't straightforward in
|
|
case of adding/removing images.
|
|
|
|
- Metadata rebuilding (see also below). It should be
|
|
possible to rebuild the metadata block and change
|
|
features like packing without having to rebuild the
|
|
whole image.
|
|
|
|
- Allow changing the block size during recompression.
|
|
This should be possible, along with updating the chunks
|
|
in the metadata, and the operation should be perfectly
|
|
reversible. At block boundaries, chunks must be split
|
|
or merged, but that should be easily doable as well.
|
|
|
|
- Sparse files: implement by adding a "chunk type" field
|
|
to each chunk; for now, there are only "normal" and
|
|
"sequence" chunks, with sequence chunks encoding as
|
|
follows:
|
|
|
|
block: byte value
|
|
offset: sequence_length / block_size
|
|
size: sequence_length % block_size
|
|
|
|
Will need a feature bit. Also, we probably want a
|
|
sparse file categorizer? (see further below)
|
|
|
|
We'll probably have to do this along with refactoring
|
|
the code to support both mmap- and read-based APIs.
|
|
|
|
- Add support for logging to file (with different level?)
|
|
|
|
- Add support for libarchive filters in dwarfsextract.
|
|
|
|
- When hashing, start by only hashing the first, say, 4KiB,
|
|
and only if the hashes are identical, hash the whole file
|
|
|
|
- Add support for compressing uncompressed image formats,
|
|
currently primarily FITS, using different formats (e.g.
|
|
JPEG2K, Rice, Hcompress).
|
|
|
|
- Fragment counts don't necessarily match up in presence
|
|
of errors.
|
|
|
|
- When opening a file again, check that its timestamp,
|
|
size and potentially checksum did not change from
|
|
when we first saw it.
|
|
|
|
- Option to log to a file instead of stderr?
|
|
|
|
- Use thrift definitions for all options to make them
|
|
easily printable/storable?
|
|
|
|
- Use Elias-Fano for delta-encoded lists in metadata?
|
|
|
|
- Implement rewriting properly; keep order of blocks etc;
|
|
ability to remove history?; ability to re-pack metadata?;
|
|
ability to change other metadata properties (e.g. stuff
|
|
like --set-owner, --set-group, --chmod, --no-create-ts,
|
|
--set-time could all be done on existing metadata, but
|
|
obviously wouldn't be undo-able)
|
|
|
|
- Packaging of libs added via FetchContent
|
|
|
|
- Re-assemble global bloom filter rather than merging?
|
|
- Use smaller bloom filters for individual blocks?
|
|
- Use bigger (non-resettable?) global bloom filter?
|
|
|
|
- file discovery progress?
|
|
|
|
- show defaults for categorized options
|
|
|
|
- take a look at CPU measurements, those for nilsimsa
|
|
ordering are probably wrong
|
|
|
|
- segmenter tests with different granularities, block sizes,
|
|
any other options
|
|
|
|
- Bloom filters can be wasteful if lookback gets really long.
|
|
Maybe we can use smaller bloom filters for individual blocks
|
|
and one or two larger "global" bloom filters? It's going to
|
|
be impossible to rebuild those from the smaller filters,
|
|
though.
|
|
|
|
- Compress long repetitions of the same byte more efficiently.
|
|
Currently, segmentation finds an overlap after about one
|
|
window size. This goes on and on repeatedly. So we end up
|
|
with a *lot* of chunks pointing to the same segment. The
|
|
smaller the window size, the larger the number of chunks.
|
|
It's definitely a trade off, as storing large segments of
|
|
repeating bytes is wasteful when mounting the image.
|
|
|
|
Intriguing idea: pre-compute 256 (or just 2, for 0x00 and 0xFF)
|
|
hash values for window_size bytes to detect long sequences of
|
|
identical bytes.
|
|
|
|
OTHER intriguing idea: let a categorizer (could even be the
|
|
incompressible categorizer, but also "sparse file" categorizer
|
|
or something like that) detect these repetitions up front so
|
|
the segmenter doesn't have to do it (and it can be optional).
|
|
Then, we can customize the segmenter to run *extremely* fast
|
|
in this case.
|
|
|
|
|
|
- Wiki with use cases
|
|
- Perl releases
|
|
- Videos with shared streams
|
|
- Backups of audio files
|
|
- Compression of filesystem images for forensic purposes
|
|
|
|
- different scenarios for categorized files / chunks:
|
|
|
|
- Video files
|
|
- just store without compression, but perform segmentation, nothing special
|
|
- keep in lookback buffer for longer, as it doesn't cost memory
|
|
- needs parallelized segmenter (see below)
|
|
|
|
- PCM audio (see also github #95)
|
|
- segment first in case of e.g. different trims or other types of overlap
|
|
- split into chunks (for parallel decompression)
|
|
- compress each chunk as flac
|
|
- headers to be saved separately
|
|
- need to store original size and other information
|
|
|
|
This is actually quite easy:
|
|
|
|
- Identify PCM audio files (libmagic?)
|
|
- Nilsimsa similarity works surprisingly well
|
|
- We can potentially switch to larger window size for segmentation and use
|
|
larger lookback
|
|
- Run segmentation as usual
|
|
- Compress each block using FLAC (hopefully we can configure how much header data
|
|
and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
|
|
- I *think* this can be done even with the current metadata format without any
|
|
additions
|
|
- The features needed should be largely orthogonal to the features needed for the
|
|
scenarios below
|
|
|
|
- Executables, Shared Libs, ...
|
|
- run filter over blocks that contain a certain kind of binary data before
|
|
compression
|
|
- a single binary may contain machine code for different architectures,
|
|
so we may have to store different parts of the binary in different blocks
|
|
- actually quite similar to audio files above, except for the additional
|
|
filter used during compression/decompression
|
|
|
|
- JPG
|
|
- just store a recompressed version
|
|
- need to store original size
|
|
- no need for segmentation except for exact
|
|
|
|
- PDF
|
|
- decompress contents
|
|
- then segment
|
|
- then compress along with other PDFs (or documents in general)
|
|
|
|
- Other compressed format (gz, xz, ...)
|
|
- decompress
|
|
- segment
|
|
- compress
|
|
- essentially like PDF
|
|
- maybe only do this for small files? (option for size limit?)
|
|
|
|
- It should be possible to treat individual chunks differently, e.g.
|
|
WAV-header should be stored independently from contents; at some
|
|
point, we might look deeper into tar files and compress individual
|
|
contents differently.
|
|
|
|
- in the metadata, we need to know:
|
|
|
|
- the fact that a stored inode is "special" (can be reflected in a single bit)
|
|
- the type of "specialness"
|
|
- the original file size
|
|
|
|
|
|
- multi-threaded pre-matcher (for -Bn with n > 0)
|
|
- pre-compute matches/cyclic hashes for completed blocks; these don't
|
|
change and so we can do this with very little synchronization
|
|
- there are two possible strategies:
|
|
- split the input stream into chunks and then process each chunk in
|
|
a separate thread, checking all n blocks
|
|
- process the input stream in each thread and then only checking a
|
|
subset of past blocks (this seems more wasteful, but each thread
|
|
would only operate on a few instead of all bloom filters, which
|
|
could be better from a cache locality pov)
|
|
|
|
- similarity size limit to avoid similarity computation for huge files
|
|
- store files without similarity hash first, sorted descending by size
|
|
|
|
|
|
- use streaming interface for zstd decompressor
|
|
- json metadata recovery
|
|
- handle sparse files?
|
|
- try to be more resilient to modifications of the input while creating fs
|
|
|
|
- dwarfsck:
|
|
- show which entries a block references
|
|
- show partial metadata dumps at lower detail levels
|
|
|
|
- make dwarfsck more usable
|
|
- cleanup TODOs
|
|
|
|
- folly: dynamic should support string_view
|
|
- frozen: ViewBase.getPosition() should be const
|
|
|
|
- docs, moar tests
|
|
|
|
- extended attributes:
|
|
- number of blocks
|
|
- number of chunks
|
|
- number of times opened?
|
|
|
|
- per-file "hotness" (how often was a file opened); dump to file upon umount
|
|
|
|
- --unpack option
|
|
|
|
- readahead?
|
|
|
|
- window-increment-shift seems silly to configure?
|
|
|
|
- identify blocks that contain mostly binary data and adjust compressor?
|
|
|
|
- metadata stripping (i.e. re-write metadata without owner/time info)
|
|
|
|
- metadata repacking (e.g. just recompress/decompress the metadata block)
|
|
|
|
|
|
/*
|
|
|
|
scanner:
|
|
bhw= - 388.3s 13.07 GiB
|
|
bhw= 8 812.9s 7.559 GiB
|
|
bhw= 9 693.1s 7.565 GiB
|
|
bhw=10 651.8s 7.617 GiB
|
|
bhw=11 618.7s 7.313 GiB
|
|
bhw=12 603.6s 7.625 GiB
|
|
bhw=13 591.2s 7.858 GiB
|
|
bhw=14 574.1s 8.306 GiB
|
|
bhw=15 553.8s 8.869 GiB
|
|
bhw=16 541.9s 9.529 GiB
|
|
|
|
|
|
lz4:
|
|
<---- 1m29.535s / 9m31.212s
|
|
|
|
lz4hc:
|
|
1 - 20.94s - 2546 MiB
|
|
2 - 21.67s - 2441 MiB
|
|
3 - 24.19s - 2377 MiB
|
|
4 - 27.29s - 2337 MiB
|
|
5 - 31.49s - 2311 MiB
|
|
6 - 36.39s - 2294 MiB
|
|
7 - 42.04s - 2284 MiB
|
|
8 - 48.67s - 2277 MiB
|
|
9 - 56.94s - 2273 MiB <---- 1m27.979s / 9m20.637s
|
|
10 - 68.03s - 2271 MiB
|
|
11 - 79.54s - 2269 MiB
|
|
12 - 94.84s - 2268 MiB
|
|
|
|
zstd:
|
|
1 - 11.42s - 1667 MiB
|
|
2 - 12.95s - 1591 MiB <---- 2m8.351s / 15m25.752s
|
|
3 - 22.03s - 1454 MiB
|
|
4 - 25.64s - 1398 MiB
|
|
5 - 32.34s - 1383 MiB
|
|
6 - 41.45s - 1118 MiB <---- 2m4.258s / 14m28.627s
|
|
7 - 46.26s - 1104 MiB
|
|
8 - 53.34s - 1077 MiB
|
|
9 - 59.99s - 1066 MiB
|
|
10 - 63.3s - 1066 MiB
|
|
11 - 66.97s - 956 MiB <---- 2m3.496s / 14m17.862s
|
|
12 - 79.89s - 953 MiB
|
|
13 - 89.8s - 943 MiB
|
|
14 - 118.1s - 941 MiB
|
|
15 - 230s - 951 MiB
|
|
16 - 247.4s - 863 MiB <---- 2m11.202s / 14m57.245s
|
|
17 - 294.5s - 854 MiB
|
|
18 - 634s - 806 MiB
|
|
19 - 762.5s - 780 MiB
|
|
20 - 776.8s - 718 MiB <---- 2m16.448s / 15m43.923s
|
|
21 - 990.4s - 716 MiB
|
|
22 - 984.3s - 715 MiB <---- 2m18.133s / 15m55.263s
|
|
|
|
lzma:
|
|
level=6:dict_size=21 921.9s - 838.8 MiB <---- 5m11.219s / 37m36.002s
|
|
|
|
*/
|
|
|
|
Perl:
|
|
542 versions of perl
|
|
found/scanned: 152809/152809 dirs, 0/0 links, 1325098/1325098 files
|
|
original size: 32.03 GiB, saved: 19.01 GiB by deduplication (1133032 duplicate files), 5.835 GiB by segmenting
|
|
filesystem size: 7.183 GiB in 460 blocks (499389 chunks, 192066/192066 inodes), 460 blocks/662.3 MiB written
|
|
|
|
bench
|
|
build real user
|
|
-----------------------------------------------------------------------------------------------------
|
|
-rw-r--r-- 1 mhx users 14G Jul 27 23:11 perl-install-0.dwarfs 8:05 0:38 0:45
|
|
-rw-r--r-- 1 mhx users 4.8G Jul 27 23:18 perl-install-1.dwarfs 6:34 0:14 1:24
|
|
-rw-r--r-- 1 mhx users 3.8G Jul 27 23:26 perl-install-2.dwarfs 7:31 0:17 1:11
|
|
-rw-r--r-- 1 mhx users 3.2G Jul 27 23:36 perl-install-3.dwarfs 10:11 0:11 0:59
|
|
-rw-r--r-- 1 mhx users 1.8G Jul 27 23:47 perl-install-4.dwarfs 11:05 0:14 1:24
|
|
-rw-r--r-- 1 mhx users 1.2G Jul 27 23:59 perl-install-5.dwarfs 11:53 0:13 1:15
|
|
-rw-r--r-- 1 mhx users 901M Jul 28 00:16 perl-install-6.dwarfs 17:42 0:14 1:25
|
|
-rw-r--r-- 1 mhx users 704M Jul 28 00:37 perl-install-7.dwarfs 20:52 0:20 2:14
|
|
-rw-r--r-- 1 mhx users 663M Jul 28 04:04 perl-install-8.dwarfs 24:13 0:50 6:02
|
|
-rw-r--r-- 1 mhx users 615M Jul 28 02:50 perl-install-9.dwarfs 34:40 0:51 5:50
|
|
|
|
-rw-r--r-- 1 mhx users 3.6G Jul 28 09:13 perl-install-defaults.squashfs 17:20
|
|
-rw-r--r-- 1 mhx users 2.4G Jul 28 10:42 perl-install-opt.squashfs 71:49
|
|
|
|
|
|
|
|
soak:
|
|
|
|
-7 (cache=1g)
|
|
|
|
Passed with 542 of 542 combinations.
|
|
|
|
real 75m21.191s
|
|
user 68m3.903s
|
|
sys 6m21.020s
|
|
|
|
-9 (cache=1g)
|
|
|
|
Passed with 542 of 542 combinations.
|
|
|
|
real 118m48.371s
|
|
user 107m35.685s
|
|
sys 7m16.438s
|
|
|
|
squashfs-opt
|
|
|
|
real 81m36.957s
|
|
user 62m37.369s
|
|
sys 20m52.367s
|
|
|
|
|
|
-1 (cache=2g)
|
|
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
|
|
|
|
real 2m19.927s
|
|
user 0m16.813s
|
|
sys 2m4.293s
|
|
|
|
-7 (cache=2g)
|
|
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
|
|
|
|
real 2m24.346s
|
|
user 0m17.007s
|
|
sys 1m59.823s
|
|
|
|
squash-default
|
|
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
|
|
|
|
real 8m41.594s
|
|
user 1m25.346s
|
|
sys 19m12.036s
|
|
|
|
squash-opt
|
|
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
|
|
|
|
real 141m41.092s
|
|
user 1m12.650s
|
|
sys 59m18.194s
|
|
|