I have done testing with random permutations and with the hardware prefetchers disabled (and with both at the same time), and the simple stride results with no HW PF match the permuted results with HW PF enabled once the permutation block gets big enough.
I did these tests back in July, and we have changed a number of aspects of the system configuration since then, but I think that
Transparent Huge Pages were enabled when I did these tests. I don't recall if this was before or after we disabled some of the C states. The "untile" frequency may also make a difference -- it automatically ramps up to full speed when running bandwidth tests, but when running latency tests the Power Control Unit may not think that the "untile" is busy enough to justify ramping up the frequency.Without knowledge of the tag directory hash, the processor placement, the MCDRAM hash, etc, it is challenging to make a lot of sense of the results. On KNC the RDTSC instruction had about a 5 cycle latency, so I was able to do a lot more with timing individual loads, and the single-ring topology made the analysis easier. There are more performance counters in the "untile" on KNL, but there is no documentation on the various box numbers are located on the mesh. There is some evidence that the CHA box numbers are re-mapped -- on a 32-tile/64-core Xeon Phi 7210 all 38 CHAs are active, but the six CHA's with anomalous readings are numbered 32-37. The missing core APIC IDs are not bunched up in this way.
The stacked memory modules have slightly higher latency because they are typically run in "closed page" mode, and because there is an extra set of chip-to-chip crossings. HMC (and Intel's MCDRAM) have an extra SERDES step between the memory stack and the processor chip. There are many different approaches used to error-checking on SERDES, but it is probably safe to expect that error-checking will require at least some added latency. |