GET_NUM_COUNTERS() actually returns a value about 4 times higher than
intended, due to missing parentheses; ((num_syms) + 3 / 4) is supposed
to be (((num_syms) + 3) / 4), or equivalently DIV_ROUND_UP(num_syms, 4)
as it was in my original code from wimlib commit 394751ae1302. This
value affects the performance of Huffman code generation.
But oddly enough, I'm measuring that the actual behavior results in
better performance than the intended behavior. This could be due to
differences between DEFLATE and LZX (the compression format I had
originally tuned this value to), or a different CPU, or both. Moreover,
I can't make it significantly faster by trying other values.
So, make GET_NUM_COUNTERS() intentionally expand to simply num_syms.
Omit the rounding up to a multiple of 4, which doesn't seem helpful.
Solaris already defines 'len_t' in <sys/types.h>, which causes a build
error.
This typedef isn't important, so just remove it and use u8 directly.
Fixes https://github.com/ebiggers/libdeflate/issues/159
Currently, if the decompressor runs out of input, it may spin until it
completely fills the output buffer, at which point it returns
LIBDEFLATE_INSUFFICIENT_SPACE. This occurs because the decompressor
implicitly appends zeroes to the input. It's designed this way because
precisely detecting overreads in the inner loop would hurt performance,
as normally it's hard to distinguish between real overreads and the
expected overreads that occur due to reading variable-length codewords.
This behavior is fine for decompressing with a known output size, which
is the recommended use case of libdeflate. However, it's not so great
for decompressing with an unknown output size, as it may cause the user
to allocate an increasingly large buffer until they run out of memory.
Be more friendly to this use case by checking for excessive overreads
more often, and returning LIBDEFLATE_BAD_DATA if it happens. In theory
the new check won't hurt performance, as it only happens when the end of
the input has been reached, which was already being checked for.
Fixes https://github.com/ebiggers/libdeflate/issues/157
Don't allocate an obviously-too-large buffer for uncompressed data if
ISIZE becomes corrupted in a small file. This isn't too effective, as
we still must allow over 1000x expansion, but we might as well do this.
Update https://github.com/ebiggers/libdeflate/issues/157
The block splitting algorithm works by examining successive chunks of
data and ending the block when a chunk differs significantly from the
rest of the block. Currently, the data chunk where the change is
detected is included in the block, which is suboptimal -- after all, we
know that it's different from the rest of the block. Better results
could be achieved by ending the block just before the chunk.
Implement this in the near-optimal compressor. This slightly improves
its compression ratio.
Note: I also tested an implementation of this for the lazy compressor.
It improves compression ratio too, but it doesn't seem worthwhile there
from a performance and complexity standpoint.
The MIN_BLOCK_LENGTH of 10000 is too high; shorter blocks are sometimes
worthwhile. Reduce MIN_BLOCK_LENGTH to 5000, but penalize very short
blocks so they are only chosen when they clearly seem worthwhile.
When choosing which literals and matches to use for the block split
statistics in deflate_compress_near_optimal(), take into account the
min_len heuristic. This should give slightly more accurate results.
It doesn't seem worthwhile to have bt_matchfinder_get_matches() return
the best_len separately anymore, especially since it doesn't work as
expected due to it not handling length 3 matches.
Ensure that whenever deflate_choose_*() is called to choose a literal or
match, observe_*() is also called to tally it in the block split
statistics (except in deflate_compress_fastest(), which doesn't use the
block split statistics). Previously, the lazy compressor contained an
optimization that made the block split statistics differ from the actual
lazy parse. But that optimization no longer seems to be worthwhile.
For fastest, greedy, lazy, and lazy2: save memory by reducing the length
of the sequence store, and forcing a split if it is filled.
For fastest: increase the max block length, but use a relatively short
sequence store that will cause shorter blocks to be used often.
For all: allow the final block to exceed the soft maximum length if it
avoids having to create a block below the minimum length.
Introduce deflate_compress_fastest(), and use it for level 1. It uses
the ht_matchfinder instead of the hc_matchfinder, and it skips the block
splitting algorithm. This speeds up level 1, without changing the
compression ratio too much (relative to what is expected for level 1).
Further improve the way the near-optimal parser estimates symbol costs:
- When setting a block's initial costs, weigh the default costs and
previous block's costs differently, depending on how different the
current block seems to be from the previous block.
- When determining the "default" costs, take into account how many
literals appear in the block and how frequent matches seem to be.
- Increase BIT_COST from 8 to 16, to increase precision in calculations.
When the near-optimal parser sets the initial costs for a block, it
takes into account the costs of the previous block (if there is one).
However, the costs for the previous block were not being updated
following the final codes being built, so the costs from the last
optimization pass were being used instead of the final costs.
Make it so that the final costs are used, as intended.
Also, include the DEFLATE_END_OF_BLOCK symbol in the non-final codes.
In general, these changes improve the compression ratio slightly.
With deflate_compress_near_optimal(), some data benefits more than
originally thought from larger values of max_search_depth and
nice_match_length. Some data even needs these parameters to be fairly
high for deflate_compress_near_optimal() to compress more than
deflate_compress_lazy2(). Bump these parameters up a bit.
In the greedy and lazy compressors, automatically increase the minimum
match length from the default of 3 if the data doesn't contain many
different literals. This greatly improves the compression ratio of
levels 1-9 on certain types of data, such as DNA sequencing data, while
not worsening the ratio on other types of data.
The near-optimal compressor (used by compression levels 10-12) continues
to use a minimum match length of 3, since it already did a better job at
deciding when short matches are worthwhile. (The method for setting the
initial costs needs improvement; later commits address that.)
Resolves https://github.com/ebiggers/libdeflate/issues/57
Instead of switching directly from the lazy compressor at level 7 to the
near-optimal compressor at level 8, use the lazy2 compressor at levels
8-9 and don't switch to near-optimal until level 10.
This avoids poor compression ratio and bad performance (both
significantly worse than level 7, and significantly worse than zlib) at
levels 8-9 on data where the near-optimal compressor doesn't do well
until the parameters are cranked up.
On data where the near-optimal compressor *does* do well, this change
worsens the compression ratio of levels 8-9, but also speeds them up a
lot, thus positioning them similarly vs. zlib as the lower levels (i.e.
much faster and slightly stronger, rather than slightly faster and much
stronger). The difference between levels 9 and 10 is increased, but
that's perhaps the least bad place to have a discontinuity.
Resolves https://github.com/ebiggers/libdeflate/issues/85