I just ran across some performance counter bugs on Haswell that may influence one's interpretation of instruction retirement rates and may bias measurements of uops per instruction. I put performance counters around 100 (outer) iterations of a simple 10-instruction loop that executed 1000 times. According to Agner's instruction tables this loop should have 12 uops. Both the fixed-function "instructions retired" and the programmable "INST_RETIRED.ANY_P" events report 12 instructions per loop iteration (not 10), while the UOPS_RETIRED.ALL programmable counter event reported 14 uops per loop iteration (not 12). While I could be misinterpreting the uop counts, there is no way that I could have mis-counted the instructions --- it took all of my fingers, but did not generate an overflow condition. ;-) It turns out that there are a number of errata for both the instructions retired events and the uops retired event on all Intel Haswell processors. Somewhat perversely, the different Haswell products have different errata listed, even though they have the same DISPLAYFAMILY_DISPLAYMODEL designation, but all of them that I checked (Xeon E5 v3 (HSE71 in doc 330785), Xeon E3 v3 (HSW141 in doc 328908), and 4th Generation Core Desktop (HSD140 in doc 328899)) include an errata to the effect that the "instructions retired" counts may overcount or undercount. This errata is also listed for the 5th Generation Core (Broadwell) processors (BDM61 in doc 330836), but is not listed in the "specification update" document for the Skylake processors (doc 332689). For this particular loop the counts are completely stable with respect to variations in loop length (e.g., from 500 to 11000 shows no effect other than asymptotically decreasing overhead). The machine is running with HyperThreading enabled, but there are no other users or non-OS tasks and this job was pinned to (local) core 4 on socket 1, so there is no way that interference with another thread (mentioned in several other errata) could account for seeing identical behavior over several hundred trials. Reading between the lines, the language that Intel uses in the descriptions of this performance counter errata seems consistent with the language used in other cases for which the errors are not "large" (not approaching 100%), but are also not "small" (not limited to single-digit percentages). It is very hard to decide whether I want to take the time to try to characterize or bound this particular performance counter error. It may end up having an easy story, or it may end up being completely inexplicable without inspection of the processor RTL. |