Leave a few notes in the TODO file

2025-08-04 02:06:22 -04:00 · 2023-07-15 15:54:10 +02:00 · 2023-07-15 15:54:10 +02:00 · 276de32042
commit 276de32042
parent 6de6479ca1
1 changed files with 67 additions and 0 deletions
--- a/67
+++ b/67
@ -1,3 +1,70 @@
+- different scenarios for categorized files / chunks:
+
+  - Video files
+    - just store without compression, but perform segmentation, nothing special
+    - keep in lookback buffer for longer, as it doesn't cost memory
+    - needs parallelized segmenter (see below)
+
+  - PCM audio (see also github #95)
+    - segment first in case of e.g. different trims or other types of overlap
+    - split into chunks (for parallel decompression)
+    - compress each chunk as flac
+    - headers to be saved separately
+    - need to store original size and other information
+
+    This is actually quite easy:
+
+    - Identify PCM audio files (libmagic?)
+    - Use libsndfile for parsing
+    - Nilsimsa similarity works surprisingly well
+    - We can potentially switch to larger window size for segmentation and use
+      larger lookback
+    - Group by format (# of channels, resolution, endian-ness, signedness, sample rate)
+    - Run segmentation as usual
+    - Compress each block using FLAC (hopefully we can configure how much header data
+      and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
+    - I *think* this can be done even with the current metadata format without any
+      additions
+    - The features needed should be largely orthogonal to the features needed for the
+      scenarios below
+
+  - Executables, Shared Libs, ...
+    - run filter over blocks that contain a certain kind of binary data before
+      compression
+    - a single binary may contain machine code for different architectures,
+      so we may have to store different parts of the binary in different blocks
+    - actually quite similar to audio files above, except for the additional
+      filter used during compression/decompression
+
+  - JPG
+    - just store a recompressed version
+    - need to store original size
+    - no need for segmentation except for exact 
+
+  - PDF
+    - decompress contents
+    - then segment
+    - then compress along with other PDFs (or documents in general)
+
+  - Other compressed format (gz, xz, ...)
+    - decompress
+    - segment
+    - compress
+    - essentially like PDF
+    - maybe only do this for small files? (option for size limit?)
+
+  - It should be possible to treat individual chunks differently, e.g.
+    WAV-header should be stored independently from contents; at some
+    point, we might look deeper into tar files and compress individual
+    contents differently.
+
+- in the metadata, we need to know:
+
+  - the fact that a stored inode is "special" (can be reflected in a single bit)
+  - the type of "specialness"
+  - the original file size
+
+
 - multi-threaded pre-matcher (for -Bn with n > 0)
  - pre-compute matches/cyclic hashes for completed blocks; these don't
    change and so we can do this with very little synchronization