Currently the optimized implementations of matchfinder_init() and
matchfinder_rebase() are chosen via static dispatch. That means that
the AVX-2 implementations usually aren't used.
Fix this by using dynamic dispatch, like what libdeflate does for the
Adler-32 and CRC-32 checksums and for DEFLATE decompression.
Based on work by Andrew Steinborn <git@steinborn.me>
(https://github.com/ebiggers/libdeflate/pull/77). He wrote:
"The main impact is on x86: the AVX2 matchfinder can now be properly
dynamically dispatched at runtime and if -mavx2 is included in CFLAGS
(or -march set to any platform with AVX2 support). On my Ryzen 9 3900X,
I got an approximately 1% boost in deflate time (measured with a
uncompressed tarball of the Silesia corpus) using just the changes in
this PR and the regular CFLAGS, and a 2.7% boost when specifying -mavx2
as CFLAGS. (I also tested with an Intel Xeon Skylake c5.large EC2
instance, and did not see any performance regression)."
Remove the ability of matchfinder_init() and matchfinder_rebase() to
fail due to the matchfinder memory size being misaligned. Instead,
require that the size always be 128-byte aligned -- which is already the
case. Also, make the matchfinder memory always be 32-byte aligned --
which doesn't really have any downside.
Move the x86 and ARM-specific code into their own directories to prevent
it from cluttering up the main library. This will make it a bit easier
to add new architecture-specific code.
But to avoid complicating things too much for people who aren't using
the provided Makefile, we still just compile all .c files for all
architectures (irrelevant ones end up #ifdef'ed out), and the headers
are included explicitly for each architecture so that an
architecture-specific include path isn't needed. So, now people just
need to compile both lib/*.c and lib/*/*.c instead of only lib/*.c.