The aligned vs unaligned results make intuitive sense. In recent processors, the penalty for unaligned access has been made faster: the penalty went to zero on Sandy Bridge (and perhaps earlier), at least for loads that didn't cross a 64B cache-line boundary. In Haswell, even the 64B latency penalty disappeared - although only for loads, not stores. You can see this all graphically here: blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ The 2D charts are trying to get at the penalty of store-to-load forwarding, but the cells off of the main diagonal do a great job of showing the unaligned load/store penalties as well. So you are finding that unaligned loads *still* have a penalty, even on Skylake - right? The key is loads that cross a 64B boundary. Fundamentally that requires bringing in two different lines from the L1, and merging the results so you get a word composed of some one line and some of another. The improvements culminating in Haswell reduced the latency of this operation to the point where it fits inside the standard 4 cycle latency for ideal L1 access, but it can't avoid the double bandwidth usage of the unaligned loads. In many algorithms, the maximum bandwidth of the L1 isn't approached (i.e,. the loads-per-cycle are 1 or less), so unaligned access ends up the same as aligned. In your loop, however, you do saturate the load bandwidth, so loads that cross a 64B boundary will cut your throughput in half, or worse. It doesn't explain the results you got by inverting the load order, but perhaps some of that can be explained by how the loads "pair up". That is, two aligned loads can pair up in the same cycle since each only needs 1 of the 2 "load paths" from L1. An unaligned load needs both, however. So if you have a load pattern like AAUUAAUU (where A is an aligned load and U is unaligned) you get: cycle loads
0 AA
1 U
2 U
3 AA
4 U
5 U
... So you get 4 loads every 3 cycles, because the aligned loads are always able to pair. On the other hand, if you have a load pattern like AUAUAUAUA, you get the following: cycle loads
0 A
1 U
2 A
3 U
.... I.e., only 3 loads every 3 cycles, or a 25% penalty to throughput, because the aligned loads end up being singletons as well. You might ask why OoO wouldn't solve this - well OoO is based on the scheduler which understands instruction dependencies, and has a few other special-case tricks to re-order things (e.g,. to avoid port retirement conflicts), but otherwise still does stuff in-order. So likely can't understand that it should try to reorder the loads to pair aligned loads. Furthermore the memory model imposes restrictions on reodering loads (but I don't fully grok how this actually falls out in practice when you consider load buffers and the coherency protocol and so on). All that to say that reordering the loads might easily swap the behavior from an AAUU behavior to an AUAU one. |