Leave a few notes in the TODO file

2025-08-04 02:06:22 -04:00 · 2023-07-15 15:54:10 +02:00 · 2023-07-15 15:54:10 +02:00 · 276de32042
commit 276de32042
parent 6de6479ca1
1 changed files with 67 additions and 0 deletions
--- a/67
+++ b/67
@ -1,3 +1,70 @@
 - different scenarios for categorized files / chunks:
  - Video files
    - just store without compression, but perform segmentation, nothing special
    - keep in lookback buffer for longer, as it doesn't cost memory
    - needs parallelized segmenter (see below)
  - PCM audio (see also github #95)
    - segment first in case of e.g. different trims or other types of overlap
    - split into chunks (for parallel decompression)
    - compress each chunk as flac
    - headers to be saved separately
    - need to store original size and other information
    This is actually quite easy:
    - Identify PCM audio files (libmagic?)
    - Use libsndfile for parsing
    - Nilsimsa similarity works surprisingly well
    - We can potentially switch to larger window size for segmentation and use
      larger lookback
    - Group by format (# of channels, resolution, endian-ness, signedness, sample rate)
    - Run segmentation as usual
    - Compress each block using FLAC (hopefully we can configure how much header data
      and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
    - I *think* this can be done even with the current metadata format without any
      additions
    - The features needed should be largely orthogonal to the features needed for the
      scenarios below
  - Executables, Shared Libs, ...
    - run filter over blocks that contain a certain kind of binary data before
      compression
    - a single binary may contain machine code for different architectures,
      so we may have to store different parts of the binary in different blocks
    - actually quite similar to audio files above, except for the additional
      filter used during compression/decompression
  - JPG
    - just store a recompressed version
    - need to store original size
    - no need for segmentation except for exact 
  - PDF
    - decompress contents
    - then segment
    - then compress along with other PDFs (or documents in general)
  - Other compressed format (gz, xz, ...)
    - decompress
    - segment
    - compress
    - essentially like PDF
    - maybe only do this for small files? (option for size limit?)
  - It should be possible to treat individual chunks differently, e.g.
    WAV-header should be stored independently from contents; at some
    point, we might look deeper into tar files and compress individual
    contents differently.
 - in the metadata, we need to know:
  - the fact that a stored inode is "special" (can be reflected in a single bit)
  - the type of "specialness"
  - the original file size
 - multi-threaded pre-matcher (for -Bn with n > 0)
  - pre-compute matches/cyclic hashes for completed blocks; these don't
    change and so we can do this with very little synchronization