dwarfs/TODO

- More "real" benchmarks

- Entry v2

- Mounting lots of images with shared cache?
  - Communication could be implemented using xattrs
  - Does WinFsp support multiple mountpoints in a
    single process?
    - Seems it does!
    - https://github.com/winfsp/winfsp/commit/ae8e4e61f77f24c6267d6fc5fa14bd567c7a88ea
  - Second option would be handling the images internally
    - However, inode numbers aren't straightforward in
      case of adding/removing images.

- Metadata rebuilding (see also below). It should be
  possible to rebuild the metadata block and change
  features like packing without having to rebuild the
  whole image.

- Allow changing the block size during recompression.
  This should be possible, along with updating the chunks
  in the metadata, and the operation should be perfectly
  reversible. At block boundaries, chunks must be split
  or merged, but that should be easily doable as well.

- Sparse files: implement by adding a "chunk type" field
  to each chunk; for now, there are only "normal" and
  "sequence" chunks, with sequence chunks encoding as
  follows:

    block: byte value
    offset: sequence_length / block_size
    size: sequence_length % block_size

  Will need a feature bit. Also, we probably want a
  sparse file categorizer? (see further below)

  We'll probably have to do this along with refactoring
  the code to support both mmap- and read-based APIs.

- Add support for logging to file (with different level?)

- Add support for libarchive filters in dwarfsextract.

- When hashing, start by only hashing the first, say, 4KiB,
  and only if the hashes are identical, hash the whole file

- Add support for compressing uncompressed image formats,
  currently primarily FITS, using different formats (e.g.
  JPEG2K, Rice, Hcompress).

- Fragment counts don't necessarily match up in presence
  of errors.

- When opening a file again, check that its timestamp,
  size and potentially checksum did not change from
  when we first saw it.

- Option to log to a file instead of stderr?

- Use thrift definitions for all options to make them
  easily printable/storable?

- Use Elias-Fano for delta-encoded lists in metadata?

- Implement rewriting properly; keep order of blocks etc;
  ability to remove history?; ability to re-pack metadata?;
  ability to change other metadata properties (e.g. stuff
  like --set-owner, --set-group, --chmod, --no-create-ts,
  --set-time could all be done on existing metadata, but
  obviously wouldn't be undo-able)

- Packaging of libs added via FetchContent

- Re-assemble global bloom filter rather than merging?
- Use smaller bloom filters for individual blocks?
- Use bigger (non-resettable?) global bloom filter?

- file discovery progress?

- show defaults for categorized options

- take a look at CPU measurements, those for nilsimsa
  ordering are probably wrong

- segmenter tests with different granularities, block sizes,
  any other options

- Bloom filters can be wasteful if lookback gets really long.
  Maybe we can use smaller bloom filters for individual blocks
  and one or two larger "global" bloom filters? It's going to
  be impossible to rebuild those from the smaller filters,
  though.

- Compress long repetitions of the same byte more efficiently.
  Currently, segmentation finds an overlap after about one
  window size. This goes on and on repeatedly. So we end up
  with a *lot* of chunks pointing to the same segment. The
  smaller the window size, the larger the number of chunks.
  It's definitely a trade off, as storing large segments of
  repeating bytes is wasteful when mounting the image.

  Intriguing idea: pre-compute 256 (or just 2, for 0x00 and 0xFF)
  hash values for window_size bytes to detect long sequences of
  identical bytes.

  OTHER intriguing idea: let a categorizer (could even be the
  incompressible categorizer, but also "sparse file" categorizer
  or something like that) detect these repetitions up front so
  the segmenter doesn't have to do it (and it can be optional).
  Then, we can customize the segmenter to run *extremely* fast
  in this case.


- Wiki with use cases
  - Perl releases
  - Videos with shared streams
  - Backups of audio files
  - Compression of filesystem images for forensic purposes

- different scenarios for categorized files / chunks:

  - Video files
    - just store without compression, but perform segmentation, nothing special
    - keep in lookback buffer for longer, as it doesn't cost memory
    - needs parallelized segmenter (see below)

  - PCM audio (see also github #95)
    - segment first in case of e.g. different trims or other types of overlap
    - split into chunks (for parallel decompression)
    - compress each chunk as flac
    - headers to be saved separately
    - need to store original size and other information

    This is actually quite easy:

    - Identify PCM audio files (libmagic?)
    - Nilsimsa similarity works surprisingly well
    - We can potentially switch to larger window size for segmentation and use
      larger lookback
    - Run segmentation as usual
    - Compress each block using FLAC (hopefully we can configure how much header data
      and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
    - I *think* this can be done even with the current metadata format without any
      additions
    - The features needed should be largely orthogonal to the features needed for the
      scenarios below

  - Executables, Shared Libs, ...
    - run filter over blocks that contain a certain kind of binary data before
      compression
    - a single binary may contain machine code for different architectures,
      so we may have to store different parts of the binary in different blocks
    - actually quite similar to audio files above, except for the additional
      filter used during compression/decompression

  - JPG
    - just store a recompressed version
    - need to store original size
    - no need for segmentation except for exact

  - PDF
    - decompress contents
    - then segment
    - then compress along with other PDFs (or documents in general)

  - Other compressed format (gz, xz, ...)
    - decompress
    - segment
    - compress
    - essentially like PDF
    - maybe only do this for small files? (option for size limit?)

  - It should be possible to treat individual chunks differently, e.g.
    WAV-header should be stored independently from contents; at some
    point, we might look deeper into tar files and compress individual
    contents differently.

- in the metadata, we need to know:

  - the fact that a stored inode is "special" (can be reflected in a single bit)
  - the type of "specialness"
  - the original file size


- multi-threaded pre-matcher (for -Bn with n > 0)
  - pre-compute matches/cyclic hashes for completed blocks; these don't
    change and so we can do this with very little synchronization
  - there are two possible strategies:
    - split the input stream into chunks and then process each chunk in
      a separate thread, checking all n blocks
    - process the input stream in each thread and then only checking a
      subset of past blocks (this seems more wasteful, but each thread
      would only operate on a few instead of all bloom filters, which
      could be better from a cache locality pov)

- similarity size limit to avoid similarity computation for huge files
- store files without similarity hash first, sorted descending by size


- use streaming interface for zstd decompressor
- json metadata recovery
- handle sparse files?
- try to be more resilient to modifications of the input while creating fs

- dwarfsck:
  - show which entries a block references
  - show partial metadata dumps at lower detail levels

- make dwarfsck more usable
- cleanup TODOs

- folly: dynamic should support string_view
- frozen: ViewBase.getPosition() should be const

- docs, moar tests

- extended attributes:
  - number of blocks
  - number of chunks
  - number of times opened?

- per-file "hotness" (how often was a file opened); dump to file upon umount

- --unpack option

- readahead?

- window-increment-shift seems silly to configure?

- identify blocks that contain mostly binary data and adjust compressor?

- metadata stripping (i.e. re-write metadata without owner/time info)

- metadata repacking (e.g. just recompress/decompress the metadata block)


/*

scanner:
bhw= -   388.3s  13.07  GiB
bhw= 8   812.9s   7.559 GiB
bhw= 9   693.1s   7.565 GiB
bhw=10   651.8s   7.617 GiB
bhw=11   618.7s   7.313 GiB
bhw=12   603.6s   7.625 GiB
bhw=13   591.2s   7.858 GiB
bhw=14   574.1s   8.306 GiB
bhw=15   553.8s   8.869 GiB
bhw=16   541.9s   9.529 GiB


lz4:
                          <----  1m29.535s / 9m31.212s

lz4hc:
 1 -  20.94s - 2546 MiB
 2 -  21.67s - 2441 MiB
 3 -  24.19s - 2377 MiB
 4 -  27.29s - 2337 MiB
 5 -  31.49s - 2311 MiB
 6 -  36.39s - 2294 MiB
 7 -  42.04s - 2284 MiB
 8 -  48.67s - 2277 MiB
 9 -  56.94s - 2273 MiB  <----  1m27.979s / 9m20.637s
10 -  68.03s - 2271 MiB
11 -  79.54s - 2269 MiB
12 -  94.84s - 2268 MiB

zstd:
 1 -  11.42s - 1667 MiB
 2 -  12.95s - 1591 MiB  <----  2m8.351s / 15m25.752s
 3 -  22.03s - 1454 MiB
 4 -  25.64s - 1398 MiB
 5 -  32.34s - 1383 MiB
 6 -  41.45s - 1118 MiB  <----  2m4.258s / 14m28.627s
 7 -  46.26s - 1104 MiB
 8 -  53.34s - 1077 MiB
 9 -  59.99s - 1066 MiB
10 -  63.3s  - 1066 MiB
11 -  66.97s -  956 MiB  <----  2m3.496s / 14m17.862s
12 -  79.89s -  953 MiB
13 -  89.8s  -  943 MiB
14 - 118.1s  -  941 MiB
15 - 230s    -  951 MiB
16 - 247.4s  -  863 MiB  <----  2m11.202s / 14m57.245s
17 - 294.5s  -  854 MiB
18 - 634s    -  806 MiB
19 - 762.5s  -  780 MiB
20 - 776.8s  -  718 MiB  <----  2m16.448s / 15m43.923s
21 - 990.4s  -  716 MiB
22 - 984.3s  -  715 MiB  <----  2m18.133s / 15m55.263s

lzma:
level=6:dict_size=21  921.9s  - 838.8 MiB  <----  5m11.219s / 37m36.002s

*/

Perl:
542 versions of perl
found/scanned: 152809/152809 dirs, 0/0 links, 1325098/1325098 files
original size: 32.03 GiB, saved: 19.01 GiB by deduplication (1133032 duplicate files), 5.835 GiB by segmenting
filesystem size: 7.183 GiB in 460 blocks (499389 chunks, 192066/192066 inodes), 460 blocks/662.3 MiB written

                                                                                   bench
                                                                         build   real  user
-----------------------------------------------------------------------------------------------------
-rw-r--r-- 1 mhx users  14G Jul 27 23:11 perl-install-0.dwarfs            8:05   0:38  0:45
-rw-r--r-- 1 mhx users 4.8G Jul 27 23:18 perl-install-1.dwarfs            6:34   0:14  1:24
-rw-r--r-- 1 mhx users 3.8G Jul 27 23:26 perl-install-2.dwarfs            7:31   0:17  1:11
-rw-r--r-- 1 mhx users 3.2G Jul 27 23:36 perl-install-3.dwarfs           10:11   0:11  0:59
-rw-r--r-- 1 mhx users 1.8G Jul 27 23:47 perl-install-4.dwarfs           11:05   0:14  1:24
-rw-r--r-- 1 mhx users 1.2G Jul 27 23:59 perl-install-5.dwarfs           11:53   0:13  1:15
-rw-r--r-- 1 mhx users 901M Jul 28 00:16 perl-install-6.dwarfs           17:42   0:14  1:25
-rw-r--r-- 1 mhx users 704M Jul 28 00:37 perl-install-7.dwarfs           20:52   0:20  2:14
-rw-r--r-- 1 mhx users 663M Jul 28 04:04 perl-install-8.dwarfs           24:13   0:50  6:02
-rw-r--r-- 1 mhx users 615M Jul 28 02:50 perl-install-9.dwarfs           34:40   0:51  5:50

-rw-r--r-- 1 mhx users 3.6G Jul 28 09:13 perl-install-defaults.squashfs  17:20
-rw-r--r-- 1 mhx users 2.4G Jul 28 10:42 perl-install-opt.squashfs       71:49


soak:

-7  (cache=1g)

Passed with 542 of 542 combinations.

real    75m21.191s
user    68m3.903s
sys     6m21.020s

-9  (cache=1g)

Passed with 542 of 542 combinations.

real    118m48.371s
user    107m35.685s
sys     7m16.438s

squashfs-opt

real    81m36.957s
user    62m37.369s
sys     20m52.367s


-1  (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    2m19.927s
user    0m16.813s
sys     2m4.293s

-7  (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    2m24.346s
user    0m17.007s
sys     1m59.823s

squash-default
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    8m41.594s
user    1m25.346s
sys     19m12.036s

squash-opt
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    141m41.092s
user    1m12.650s
sys     59m18.194s