Add a CRC32 implementation that uses the ARM CRC32 instructions.
This is simpler and faster than the PMULL implementation. On AWS
Graviton2, the performance improvement is about 70%. On Hikey960, the
performance improvement is about 30% for the Cortex-A53 cores or about
5% for the Cortex-A73 cores.
Based on work by Greg V <greg@unrelenting.technology>
(https://github.com/ebiggers/libdeflate/pull/45)
and Andrew Steinborn <git@steinborn.me>
(https://github.com/ebiggers/libdeflate/pull/76).