deflate_compress: fix up GET_NUM_COUNTERS()

GET_NUM_COUNTERS() actually returns a value about 4 times higher than
intended, due to missing parentheses; ((num_syms) + 3 / 4) is supposed
to be (((num_syms) + 3) / 4), or equivalently DIV_ROUND_UP(num_syms, 4)
as it was in my original code from wimlib commit 394751ae1302.  This
value affects the performance of Huffman code generation.

But oddly enough, I'm measuring that the actual behavior results in
better performance than the intended behavior.  This could be due to
differences between DEFLATE and LZX (the compression format I had
originally tuned this value to), or a different CPU, or both.  Moreover,
I can't make it significantly faster by trying other values.

So, make GET_NUM_COUNTERS() intentionally expand to simply num_syms.
Omit the rounding up to a multiple of 4, which doesn't seem helpful.
This commit is contained in:
Eric Biggers 2022-01-20 23:36:02 -08:00
parent dc64ccfb25
commit 241091e8a7

View File

@ -805,7 +805,7 @@ heap_sort(u32 A[], unsigned length)
#define NUM_SYMBOL_BITS 10
#define SYMBOL_MASK ((1 << NUM_SYMBOL_BITS) - 1)
#define GET_NUM_COUNTERS(num_syms) ((((num_syms) + 3 / 4) + 3) & ~3)
#define GET_NUM_COUNTERS(num_syms) (num_syms)
/*
* Sort the symbols primarily by frequency and secondarily by symbol value.
@ -843,26 +843,13 @@ sort_symbols(unsigned num_syms, const u32 freqs[restrict],
unsigned counters[GET_NUM_COUNTERS(DEFLATE_MAX_NUM_SYMS)];
/*
* We rely on heapsort, but with an added optimization. Since it's
* common for most symbol frequencies to be low, we first do a count
* sort using a limited number of counters. High frequencies will be
* counted in the last counter, and only they will be sorted with
* heapsort.
* We use heapsort, but with an added optimization. Since often most
* symbol frequencies are low, we first do a count sort using a limited
* number of counters. High frequencies are counted in the last
* counter, and only they will be sorted with heapsort.
*
* Note: with more symbols, it is generally beneficial to have more
* counters. About 1 counter per 4 symbols seems fast.
*
* Note: I also tested radix sort, but even for large symbol counts (>
* 255) and frequencies bounded at 16 bits (enabling radix sort by just
* two base-256 digits), it didn't seem any faster than the method
* implemented here.
*
* Note: I tested the optimized quicksort implementation from glibc
* (with indirection overhead removed), but it was only marginally
* faster than the simple heapsort implemented here.
*
* Tests were done with building the codes for LZX. Results may vary
* for different compression algorithms...!
* counters. About 1 counter per symbol seems fastest.
*/
num_counters = GET_NUM_COUNTERS(num_syms);