doc: update TODOs

2025-09-10 13:04:15 -04:00 · 2023-11-08 22:33:15 +01:00 · 2023-11-08 22:33:15 +01:00 · f2249f3b6c
commit f2249f3b6c
parent d59ff62ad7
1 changed files with 54 additions and 0 deletions
--- a/54
+++ b/54
@ -1,3 +1,46 @@
 - Use Elias-Fano for delta-encoded lists in metadata?
 - Packaging of libs added via FetchContent
 - Remove [ MiB, MiB, MiB ]
 - Generic hashing / scanning / categorizing progress?
 - Re-assemble global bloom filter rather than merging?
 - Use smaller bloom filters for individual blocks?
 - Use bigger (non-resettable?) global bloom filter?
 - hashing progress? => yes
 - file discovery progress?
 - reasonable defaults when `--categorize` is given without
  any arguments
 - show defaults for categorized options
 - scanner / compressor progress contexts?
 - file system rewriting with categories :-)
 - file system block reordering for bit-identical images
  (does this require a new section type containing categories?)
 - take a look at CPU measurements, those for nilsimsa
  ordering are probably wrong
 - segmenter tests with different granularities, block sizes,
  any other options
 - configurable number of threads for ordering/segmenting
 - Bloom filters can be wasteful if lookback gets really long.
  Maybe we can use smaller bloom filters for individual blocks
  and one or two larger "global" bloom filters? It's going to
  be impossible to rebuild those from the smaller filters,
  though.
 - Compress long repetitions of the same byte more efficiently.
  Currently, segmentation finds an overlap after about one
  window size. This goes on and on repeatedly. So we end up
@ -6,6 +49,17 @@
  It's definitely a trade off, as storing large segments of
  repeating bytes is wasteful when mounting the image.
  Intriguing idea: pre-compute 256 (or just 2, for 0x00 and 0xFF)
  hash values for window_size bytes to detect long sequences of
  identical bytes.
  OTHER intriguing idea: let a categorizer (could even be the
  incompressible categorizer, but also "sparse file" categorizer
  or something like that) detect these repetitions up front so
  the segmenter doesn't have to do it (and it can be optional).
  Then, we can customize the segmenter to run *extremely* fast
  in this case.
 - Forward compatibility