Agner wrote:
Regarding coding style,
I don't like the many helper #defines: they pollute
the global name space.
You mean HILO64 / 32 / 16? No they don't, I #undef then at the end of the hsums block for that reason.
I want to avoid optimizations that are tuned
specifically to a particular CPU model, because the
market develops so fast that CPU-specific code soon
becomes obsolete, and a burden to maintain.
Most of the tuning is for code-size and avoiding movdqa without AVX. This is good whether it's a Penryn, Haswell, or Silvermont running the SSE4 version. This was the main motivation for making HILO64 a macro, instead of using an ifdef in every block of code where it was used. There is a concrete gain in instruction bytes from using punpckhqdq for AVX, but pshufd for non-AVX. I think this is worth the readability impact of using a macro to wrap one instruction.
I don't expect any new CPUs to have slow shuffles like K8/PM/Merom. Using pshuflw is better for Merom, and it's reasonable to expect it to be the same as pshufd on every future CPU. So this uarch-specific tuning is NOT done at the expense of others, or likely to need any future maintenance. (And if it ever does, it's factored out into the HILO32 macro, so tuning is really easy. I also wanted to make it easier for users of the VCL to tune instruction choices for special cases.)
I cluttered things up with comments evaluating whether there was anything to gain from different choices on different CPUs, to decide which strategy should be the best choice for all CPUs. We don't even have a way to determine what uarch to tune for (separate from the target instruction set), so I didn't try to do that.
Those design notes and extra complexity should probably be removed as much as possible to make the code easier to follow, once I'm done choosing the best sequences for SSE2, SSE4, and AVX.
hadd is slow on every current CPU, and IDK if it will ever become faster. So that tuning decision seems reasonable across the board, even though it's an obvious savings in code-size. Saving uops takes less space in the out-of-order window. |