Agner's CPU blog

andreas

With 8000 8-byte NOPS, the bottleneck are not the decoders, but instruction cache misses. You can see this by looking at the L2_RQSTS.CODE_RD_HIT counter (24.C4): sudo ./nanoBench.sh -f -conf configs/cfg_AlderLakeP_all.txt -cpu 0 -basic -unroll 1000 -loop 1000 -asm "|8|8|8|8|8|8|8|8" | grep -v 0.00 ...

andreas

agner wrote: ↑
2022-05-16, 4:47:49
This is when your code is running out of the µop cache. The µops have already been decoded. The decoder throughput can only be measured when the loop is bigger than the µop cache.

My code is not running out of the µop cache. This can be seen from the UOPS_MITE count that is shown in the output.

andreas

The decoders can deliver a maximum of 4 µops per clock for a single thread According to my tests, the decoders on the P cores can decode 6 instructions per cycle. Here is an example for a sequence of NOP instructions that require, on average, 0.17 cycles: https://uops.info/html-tp/ADL-P/NOP-Measure...

andreas

According to your optimization guide, inc and dec cannot be macro fused on Tiger Lake. How do your tests for this look like? According to my tests (which are available here: https://www.uops.info/html-tp/TGL/DEC_R64-Measurements.html#macroFusion), they do macro fuse in the same way as on previous mi...

Agner's CPU blog

Search found 4 matches

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel Sunny Cove