Move program utility functions that are used only by "test programs"
(i.e. not by gzip/gunzip) from prog_util.{c,h} into test_util.{c,h}.
This reduces the code that is compiled for the default build target,
which excludes the test programs.
Another build_decode_table() optimization: rather than filling all the
entries for each codeword using strided stores, just fill one initially
and fill the rest by memcpy()s as the table is incrementally expanded.
Also make some other cleanups and small optimizations.
Further improve build_decode_table() performance by splitting the "fill
direct entries" and "fill subtable pointers and subtables" steps into
separate loops and making some other optimizations.
Improve libdeflate's worst-case performance decompressing malicious
DEFLATE streams by about 14x, bringing it within a factor of about 2x of
zlib, by skipping rebuilding the decode tables for the static Huffman
codes when they're already loaded into the decompressor.
This improves performance decompressing a stream of all empty static
Huffman blocks from about 0.36 MB/s to 175 MB/s, or the original
reproducer given on the Github issue from about 3.3 MB/s to 219 MB/s.
A regression test is added for these cases as well as the empty dynamic
Huffman blocks case to verify worst-case performance comparable to zlib.
Resolves https://github.com/ebiggers/libdeflate/issues/33
NEON intrinsics cannot be used when compiling for an ARM CPU without
hardware floating point support, e.g. the Debian armel port. In this
case arm_neon.h cannot even be included as it causes an #error.
[Based on a patch by Adrian Bunk <bunk@debian.org>, but changed to check
for __ARM_FP instead of !__SOFTFP__ to be consistent with arm_neon.h,
and added a comment.]
To match common shared library packaging conventions: name the shared
library libdeflate.so.0, with matching soname, and make libdeflate.so
a symlink that points to it.
All 64bit PowerPC CPUs handle unaligned accesses reasonably fast, so
set UNALIGNED_ACCESS_IS_FAST.
Decompression of the snappy html test case is almost 50% faster on
POWER9 with this patch applied.
Make it compatible with the new code organization, make it run the
test_checksums program for each implementation, and run each
implementation in both 64-bit and 32-bit modes.
Now that we detect CPU features on 32-bit x86, allow the SSE2
implementation of Adler-32 to be selected at runtime based on the
presence of the SSE2 feature.