From f2249f3b6cb0a70c3b4f30f5882205a9e0283db0 Mon Sep 17 00:00:00 2001
From: Marcus Holland-Moritz <github@mhxnet.de>
Date: Wed, 8 Nov 2023 22:33:15 +0100
Subject: [PATCH] doc: update TODOs

---
 TODO | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/TODO b/TODO
index d32d61fc..20c08f05 100644
--- a/TODO
+++ b/TODO
@@ -1,3 +1,46 @@
+- Use Elias-Fano for delta-encoded lists in metadata?
+
+- Packaging of libs added via FetchContent
+- Remove [ MiB, MiB, MiB ]
+- Generic hashing / scanning / categorizing progress?
+
+- Re-assemble global bloom filter rather than merging?
+- Use smaller bloom filters for individual blocks?
+- Use bigger (non-resettable?) global bloom filter?
+
+
+
+- hashing progress? => yes
+
+- file discovery progress?
+
+- reasonable defaults when `--categorize` is given without
+  any arguments
+
+- show defaults for categorized options
+
+- scanner / compressor progress contexts?
+
+- file system rewriting with categories :-)
+
+- file system block reordering for bit-identical images
+  (does this require a new section type containing categories?)
+
+- take a look at CPU measurements, those for nilsimsa
+  ordering are probably wrong
+
+- segmenter tests with different granularities, block sizes,
+  any other options
+
+- configurable number of threads for ordering/segmenting
+
+
+- Bloom filters can be wasteful if lookback gets really long.
+  Maybe we can use smaller bloom filters for individual blocks
+  and one or two larger "global" bloom filters? It's going to
+  be impossible to rebuild those from the smaller filters,
+  though.
+
 - Compress long repetitions of the same byte more efficiently.
   Currently, segmentation finds an overlap after about one
   window size. This goes on and on repeatedly. So we end up
@@ -6,6 +49,17 @@
   It's definitely a trade off, as storing large segments of
   repeating bytes is wasteful when mounting the image.
 
+  Intriguing idea: pre-compute 256 (or just 2, for 0x00 and 0xFF)
+  hash values for window_size bytes to detect long sequences of
+  identical bytes.
+
+  OTHER intriguing idea: let a categorizer (could even be the
+  incompressible categorizer, but also "sparse file" categorizer
+  or something like that) detect these repetitions up front so
+  the segmenter doesn't have to do it (and it can be optional).
+  Then, we can customize the segmenter to run *extremely* fast
+  in this case.
+
 
 - Forward compatibility