Vector Class Discussion

 
thread Test suite for VCL? And how to submit patches - Peter Cordes - 2016-05-31
replythread Test suite for VCL? And how to submit patches - Agner - 2016-05-31
replythread Factoring out boilerplate for integer vectors - Peter Cordes - 2016-05-31
last reply Factoring out boilerplate for integer vectors - Agner - 2016-06-01
last reply Test suite for VCL? And how to submit patches - Peter Cordes - 2016-06-05
last replythread Test suite for VCL? And how to submit patches - Peter Cordes - 2016-06-02
last replythread Test suite for VCL? And how to submit patches - Agner - 2016-06-03
last reply Test suite for VCL? And how to submit patches - Peter Cordes - 2016-06-03
 
Test suite for VCL? And how to submit patches
Author:  Date: 2016-05-31 15:46
I'm working on some improvements for VCL functions. I started out just looking at horizontal sums, using single-uop instructions instead of the slow haddps. I found some other things to improve, too (most dramatically, testing for all-ones or all-zero without PTEST).

Obviously every change needs to be tested to make sure I didn't break something; either correctness or portability to different compilers. I'm not looking forward to writing my own tests for functions I want to change, so is there an existing test suite? Should I just send code that compiles and looks right, and let you test it on your private test suite?

I put some of the changes I'm working on up on github where everyone can easily take a look at some of the changes I'm talking about. (I put a descriptive commit message in the hsum one that isn't finished to describe some code re-arrangement to factor out repeated code. The f256 hsum commit is the kind of change from hadd that I'm talking about.)

   
Test suite for VCL? And how to submit patches
Author: Agner Date: 2016-05-31 23:48
Thank you very much for your contributions to improving the vector class library.

Peter Cordes wrote:

is there an existing test suite?
Not a reliable one. I have a primitive test suite that somebody else has made, but it doesn't cover the necessary number of cases for data values, etc. For example, your 64-bit multiplication would certainly have to be tested with more than one pair of data values to make sure it handles all cases correctly.

I am mostly relying on manual testing based on the philosophy of white box testing where I make sure that all branches, special cases and boundary cases are covered.

   
Factoring out boilerplate for integer vectors
Author:  Date: 2016-05-31 23:57
I'm considering factoring out a lot of the boilerplate that's repeated for Vec16c, Vec8s, etc. Either with templates, or with CPP macros like
#define BOILERPLATE_ASSIGNMENTS(V_T)					\
/* vector operator += : add */					        \
static inline V_T & operator += (V_T & a, V_T const & b) {	        \
    a = a + b;								\
    return a;								\
}									\
/* vector operator -= : sub */						\
static inline V_T & operator -= (V_T & a, V_T const & b) {		\
    a = a - b;								\
    return a;								\
}									\
...

Either way might make users of the library get worse error messages, but I suspect that the template way would be even worse than with macros.

Did you purposely avoid factoring out functions that are just defined in terms of other functions, and are the same except for the type names? Or is that something you'd like to see done, but didn't have time for?

   
Factoring out boilerplate for integer vectors
Author: Agner Date: 2016-06-01 08:48
Peter Cordes wrote:
Did you purposely avoid factoring out functions that are just defined in terms of other functions, and are the same except for the type names?
Yes. I don't want to add more complexity than necessary. Defining each operator and function explicitly makes it easier for the user to debug. I am using templates only for long functions, such as the mathematical functions.
   
Test suite for VCL? And how to submit patches
Author:  Date: 2016-06-05 11:08
Agner wrote:

Peter Cordes wrote:

is there an existing test suite?
Not a reliable one. I have a primitive test suite that somebody else has made, but it doesn't cover the necessary number of cases for data values, etc.

Can I have a copy of the test-suite? It might be easier to start with your primitive one than to start completely from scratch. I definitely need to test my changes (for SSE2, SSE4, AVX, and AVX2). (including operator >>(Vec2q) and 4q which I pushed to github yesterday)

Also, have you looked into using immediate blends (pblendw, blendps, vpblendd) instead of variable-blends with a _mm_setr mask? clang-3.8 manages to optimize that to vpblendd, but gcc loads the 32B mask from memory and uses the 2-uop pblendvb. (And constant8i<0,-1,0,-1,0,-1,0,-1>() instead of _mm_setr also defeats clang).

I'm planning to have a look at template static inline Vec4q blend4q and so on, to add detection of patterns that can use vpblendd or pblendw. However, some of the implementations of other functions use select, which isn't able to notice compile-time-constant blends. It's probably not worth it to try to change select to use gcc's __builtin_constant_p, because then we'd still have to pick apart the vector mask, which might not optimize away at compile time. Probably better to simply change other functions to use pblendw, or vpblendd when AVX2 is available.

   
Test suite for VCL? And how to submit patches
Author:  Date: 2016-06-02 16:23
I'm mostly finished tuning hsums for __m128i vectors. For horizontal_add_x(Vec16c), we can range-shift to unsigned and use psadbw, so that's a huge improvement.

Many of the _x functions do one step of extend/add and then just call the normal horizontal_add function for the next wider width.

I removed all the slow phadd code. In some cases, I changed things to avoid movdqa in the SSE2 / SSE4 versions without AVX. With AVX, it mostly just saves code-size, and maybe increases ILP.

For CPUs with slow shuffles (like Merom), there should be nice improvements from using pshuflw instead of pshufd when possible.

Anyway, I pushed stuff up to github. I have *not* turned my changes into a nice patch-series, so all the mess of development is there. I can re-factor the commits into a series of clean commits if that's useful, but you don't use public version-control for the library so IDK if it would benefit anything long-term.

I still haven't really looked at float or 256b vectors yet, but I'd like your comments on coding-style and how much detail to put in comments before I start on those.

   
Test suite for VCL? And how to submit patches
Author: Agner Date: 2016-06-03 11:47
Peter Cordes wrote:
Anyway, I pushed stuff up to github. I have *not* turned my changes into a nice patch-series, so all the mess of development is there. I can re-factor the commits into a series of clean commits if that's useful, but you don't use public version-control for the library so IDK if it would benefit anything long-term.

I still haven't really looked at float or 256b vectors yet, but I'd like your comments on coding-style and how much detail to put in comments before I start on those.

Thank you for your contributions. I don't have the time to test and commit everything right now, but I will look at it before the next update. You don't have to make nice patch series, as I will review it anyway before I put it into my code. Regarding coding style, I don't like the many helper #defines: they pollute the global name space and they don't improve the readability in my opinion. It is better to have the code in a logical order so that it is easier to find things.

I want to avoid optimizations that are tuned specifically to a particular CPU model, because the market develops so fast that CPU-specific code soon becomes obsolete, and a burden to maintain.

   
Test suite for VCL? And how to submit patches
Author:  Date: 2016-06-03 12:59
Agner wrote:
Regarding coding style, I don't like the many helper #defines: they pollute the global name space.

You mean HILO64 / 32 / 16? No they don't, I #undef then at the end of the hsums block for that reason.

I want to avoid optimizations that are tuned specifically to a particular CPU model, because the market develops so fast that CPU-specific code soon becomes obsolete, and a burden to maintain.

Most of the tuning is for code-size and avoiding movdqa without AVX. This is good whether it's a Penryn, Haswell, or Silvermont running the SSE4 version. This was the main motivation for making HILO64 a macro, instead of using an ifdef in every block of code where it was used. There is a concrete gain in instruction bytes from using punpckhqdq for AVX, but pshufd for non-AVX. I think this is worth the readability impact of using a macro to wrap one instruction.

I don't expect any new CPUs to have slow shuffles like K8/PM/Merom. Using pshuflw is better for Merom, and it's reasonable to expect it to be the same as pshufd on every future CPU. So this uarch-specific tuning is NOT done at the expense of others, or likely to need any future maintenance. (And if it ever does, it's factored out into the HILO32 macro, so tuning is really easy. I also wanted to make it easier for users of the VCL to tune instruction choices for special cases.)

I cluttered things up with comments evaluating whether there was anything to gain from different choices on different CPUs, to decide which strategy should be the best choice for all CPUs. We don't even have a way to determine what uarch to tune for (separate from the target instruction set), so I didn't try to do that.

Those design notes and extra complexity should probably be removed as much as possible to make the code easier to follow, once I'm done choosing the best sequences for SSE2, SSE4, and AVX.

hadd is slow on every current CPU, and IDK if it will ever become faster. So that tuning decision seems reasonable across the board, even though it's an obvious savings in code-size. Saving uops takes less space in the out-of-order window.