mirror of
https://github.com/mhx/dwarfs.git
synced 2025-08-04 02:06:22 -04:00
Leave a few notes in the TODO file
This commit is contained in:
parent
6de6479ca1
commit
276de32042
67
TODO
67
TODO
@ -1,3 +1,70 @@
|
||||
- different scenarios for categorized files / chunks:
|
||||
|
||||
- Video files
|
||||
- just store without compression, but perform segmentation, nothing special
|
||||
- keep in lookback buffer for longer, as it doesn't cost memory
|
||||
- needs parallelized segmenter (see below)
|
||||
|
||||
- PCM audio (see also github #95)
|
||||
- segment first in case of e.g. different trims or other types of overlap
|
||||
- split into chunks (for parallel decompression)
|
||||
- compress each chunk as flac
|
||||
- headers to be saved separately
|
||||
- need to store original size and other information
|
||||
|
||||
This is actually quite easy:
|
||||
|
||||
- Identify PCM audio files (libmagic?)
|
||||
- Use libsndfile for parsing
|
||||
- Nilsimsa similarity works surprisingly well
|
||||
- We can potentially switch to larger window size for segmentation and use
|
||||
larger lookback
|
||||
- Group by format (# of channels, resolution, endian-ness, signedness, sample rate)
|
||||
- Run segmentation as usual
|
||||
- Compress each block using FLAC (hopefully we can configure how much header data
|
||||
and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
|
||||
- I *think* this can be done even with the current metadata format without any
|
||||
additions
|
||||
- The features needed should be largely orthogonal to the features needed for the
|
||||
scenarios below
|
||||
|
||||
- Executables, Shared Libs, ...
|
||||
- run filter over blocks that contain a certain kind of binary data before
|
||||
compression
|
||||
- a single binary may contain machine code for different architectures,
|
||||
so we may have to store different parts of the binary in different blocks
|
||||
- actually quite similar to audio files above, except for the additional
|
||||
filter used during compression/decompression
|
||||
|
||||
- JPG
|
||||
- just store a recompressed version
|
||||
- need to store original size
|
||||
- no need for segmentation except for exact
|
||||
|
||||
- PDF
|
||||
- decompress contents
|
||||
- then segment
|
||||
- then compress along with other PDFs (or documents in general)
|
||||
|
||||
- Other compressed format (gz, xz, ...)
|
||||
- decompress
|
||||
- segment
|
||||
- compress
|
||||
- essentially like PDF
|
||||
- maybe only do this for small files? (option for size limit?)
|
||||
|
||||
- It should be possible to treat individual chunks differently, e.g.
|
||||
WAV-header should be stored independently from contents; at some
|
||||
point, we might look deeper into tar files and compress individual
|
||||
contents differently.
|
||||
|
||||
- in the metadata, we need to know:
|
||||
|
||||
- the fact that a stored inode is "special" (can be reflected in a single bit)
|
||||
- the type of "specialness"
|
||||
- the original file size
|
||||
|
||||
|
||||
- multi-threaded pre-matcher (for -Bn with n > 0)
|
||||
- pre-compute matches/cyclic hashes for completed blocks; these don't
|
||||
change and so we can do this with very little synchronization
|
||||
|
Loading…
x
Reference in New Issue
Block a user