A lot of the internal library headers don't have include guards because
they aren't needed. It might look like a bug, though, and it doesn't
hurt to add them. So do this.
Update https://github.com/ebiggers/libdeflate/issues/117
Remove the ability of matchfinder_init() and matchfinder_rebase() to
fail due to the matchfinder memory size being misaligned. Instead,
require that the size always be 128-byte aligned -- which is already the
case. Also, make the matchfinder memory always be 32-byte aligned --
which doesn't really have any downside.
The cutoff for outputting uncompressed data is currently < 16 bytes for
all compression levels. That isn't ideal, since the higher the
compression level, the more we should bother with very small inputs; and
the lower the compression level, the less we should bother.
Use a formula that produces the following cutoffs:
Level Cutoff
----- ------
0 56
1 52
2 48
3 44
4 40
5 36
6 32
7 28
8 24
9 20
10 16
11 12
12 8
Update https://github.com/ebiggers/libdeflate/issues/67
This is needed to avoid the following error when using
-fsanitize=undefined with gcc:
lib/x86/adler32_impl.h:214:2: runtime error: signed integer overflow:
1951294680 + 1956941400 cannot be represented in type 'int'
Note that this isn't seen when using -fsanitize=undefined with clang.
Old compilers don't have unsigned vector types, so work around that.
Add a CRC32 implementation that uses the ARM CRC32 instructions.
This is simpler and faster than the PMULL implementation. On AWS
Graviton2, the performance improvement is about 70%. On Hikey960, the
performance improvement is about 30% for the Cortex-A53 cores or about
5% for the Cortex-A73 cores.
Based on work by Greg V <greg@unrelenting.technology>
(https://github.com/ebiggers/libdeflate/pull/45)
and Andrew Steinborn <git@steinborn.me>
(https://github.com/ebiggers/libdeflate/pull/76).
If support for CRC32 instructions is detected, set
ARM_CPU_FEATURE_CRC32. Also define
COMPILER_SUPPORTS_CRC32_TARGET_INTRINSICS when appropriate, and update
run_tests.sh to toggle the crc32 feature for testing.
Some users may require a valid DEFLATE, zlib, or gzip stream but know
ahead of time that particular inputs are not compressible. zlib
supports "level 0" for this use case. Support this in libdeflate too.
Resolves https://github.com/ebiggers/libdeflate/issues/86
Make test-only builds of libdeflate support an environmental variable
LIBDEFLATE_DISABLE_CPU_FEATURES that contains a list of CPU features to
disable like "avx512bw,avx2,sse2".
This makes it possible to test all the variants of dynamically
dispatched code without editing the source code.
Note, this environmental variable is not a stable interface, so put the
support for it behind a scary-looking option TEST_SUPPORT__DO_NOT_USE.
In cpuid() in the '__i386__ && __PIC__' case, the second output operand
is written to before the input operands are used. So the second output
operand needs the earlyclobber constraint.
Don't assume that lib_common.h and libdeflate.h don't include
<stdlib.h>. Currently this change doesn't matter unless someone uses
-DFREESTANDING for a Windows build, which isn't supported anyway, but we
might as well clean this up.
Update https://github.com/ebiggers/libdeflate/pull/68
gcc 10 is miscompiling libdeflate on x86_64 at -O3 due to a regression
in how packed structs are handled
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94994).
Work around this by just always using memcpy() for unaligned accesses.
It's unclear that the "packed struct" approach is worthwhile to maintain
anymore. Currently I'm only aware that it's useful with old gcc's on
arm32. Hopefully, compilers are good enough now that we can simply use
memcpy() everywhere.
Update https://github.com/ebiggers/libdeflate/issues/64
For consistency, make the implementations of libdeflate_gzip_compress()
and libdeflate_zlib_compress() use the same parameter name that their
declarations and everywhere else use.
Allow building libdeflate without linking to any libc functions by using
'make FREESTANDING=1'. When using such a library build, the user will
need to call libdeflate_set_memory_allocator() before anything else,
since malloc() and free() will be unavailable.
[Folded in fix from Ingvar Stepanyan to use -nostdlib, and made
freestanding_tests() check that no libs are linked to.]
Update https://github.com/ebiggers/libdeflate/issues/62
In preparation for adding custom memory allocator support, don't call
the standard memory allocation functions directly but rather wrap them
with libdeflate_malloc() and libdeflate_free().
Additionally, libdeflate_zlib_decompress now returns successfully in
case there are additional trailing bytes in the input buffer after the
compressed stream.
Unfortunately, MSVC only accepts __stdcall after the return type, while
gcc only accepts __attribute__((visibility("default"))) before the
return type. So we need a macro in each location.
Also, MSVC doesn't define __i386__; that's gcc specific. So instead use
'_WIN32 && !_WIN64' to detect 32-bit Windows.
The API returns the compressed size as a size_t, so
deflate_flush_output() needs to return size_t as well, not u32.
Otherwise sizes >= 4 GiB are truncated.
This bug has been there since the beginning.
(This only matters if you compress a buffer that large in a single go,
which obviously is not a good idea, but no matter -- it's a bug.)
Resolves https://github.com/ebiggers/libdeflate/issues/44
Another build_decode_table() optimization: rather than filling all the
entries for each codeword using strided stores, just fill one initially
and fill the rest by memcpy()s as the table is incrementally expanded.
Also make some other cleanups and small optimizations.
Further improve build_decode_table() performance by splitting the "fill
direct entries" and "fill subtable pointers and subtables" steps into
separate loops and making some other optimizations.
Improve libdeflate's worst-case performance decompressing malicious
DEFLATE streams by about 14x, bringing it within a factor of about 2x of
zlib, by skipping rebuilding the decode tables for the static Huffman
codes when they're already loaded into the decompressor.
This improves performance decompressing a stream of all empty static
Huffman blocks from about 0.36 MB/s to 175 MB/s, or the original
reproducer given on the Github issue from about 3.3 MB/s to 219 MB/s.
A regression test is added for these cases as well as the empty dynamic
Huffman blocks case to verify worst-case performance comparable to zlib.
Resolves https://github.com/ebiggers/libdeflate/issues/33
Now that we detect CPU features on 32-bit x86, allow the SSE2
implementation of Adler-32 to be selected at runtime based on the
presence of the SSE2 feature.