doc: update TODOs

2025-09-08 20:12:56 -04:00 · 2023-11-08 22:33:15 +01:00 · 2023-11-08 22:33:15 +01:00 · f2249f3b6c
commit f2249f3b6c
parent d59ff62ad7
1 changed files with 54 additions and 0 deletions
--- a/54
+++ b/54
@ -1,3 +1,46 @@
+- Use Elias-Fano for delta-encoded lists in metadata?
+
+- Packaging of libs added via FetchContent
+- Remove [ MiB, MiB, MiB ]
+- Generic hashing / scanning / categorizing progress?
+
+- Re-assemble global bloom filter rather than merging?
+- Use smaller bloom filters for individual blocks?
+- Use bigger (non-resettable?) global bloom filter?
+
+
+
+- hashing progress? => yes
+
+- file discovery progress?
+
+- reasonable defaults when `--categorize` is given without
+  any arguments
+
+- show defaults for categorized options
+
+- scanner / compressor progress contexts?
+
+- file system rewriting with categories :-)
+
+- file system block reordering for bit-identical images
+  (does this require a new section type containing categories?)
+
+- take a look at CPU measurements, those for nilsimsa
+  ordering are probably wrong
+
+- segmenter tests with different granularities, block sizes,
+  any other options
+
+- configurable number of threads for ordering/segmenting
+
+
+- Bloom filters can be wasteful if lookback gets really long.
+  Maybe we can use smaller bloom filters for individual blocks
+  and one or two larger "global" bloom filters? It's going to
+  be impossible to rebuild those from the smaller filters,
+  though.
+
 - Compress long repetitions of the same byte more efficiently.
  Currently, segmentation finds an overlap after about one
  window size. This goes on and on repeatedly. So we end up
@ -6,6 +49,17 @@
  It's definitely a trade off, as storing large segments of
  repeating bytes is wasteful when mounting the image.

+  Intriguing idea: pre-compute 256 (or just 2, for 0x00 and 0xFF)
+  hash values for window_size bytes to detect long sequences of
+  identical bytes.
+
+  OTHER intriguing idea: let a categorizer (could even be the
+  incompressible categorizer, but also "sparse file" categorizer
+  or something like that) detect these repetitions up front so
+  the segmenter doesn't have to do it (and it can be optional).
+  Then, we can customize the segmenter to run *extremely* fast
+  in this case.
+

 - Forward compatibility