News and research about CPU microarchitecture and software optimization
-
andreas
- Posts: 4
- Joined: 2021-04-06, 18:25:56
Post
by andreas » 2022-05-17, 15:54:54
With 8000 8-byte NOPS, the bottleneck are not the decoders, but instruction cache misses. You can see this by looking at the L2_RQSTS.CODE_RD_HIT counter (24.C4):
Code: Select all
sudo ./nanoBench.sh -f -conf configs/cfg_AlderLakeP_all.txt -cpu 0 -basic -unroll 1000 -loop 1000 -asm "|8|8|8|8|8|8|8|8" | grep -v 0.00
RDTSC: 3.01
Instructions retired: 8.00
Core cycles: 4.00
Reference cycles: 3.01
L2_RQSTS.CODE_RD_HIT: 1.00
L2_RQSTS.ALL_CODE_RD: 1.00
L2_REQUEST.ALL: 1.00
L2_REQUEST.ALL: 1.00
...
-
agner
- Site Admin
- Posts: 76
- Joined: 2019-12-27, 18:56:25
-
Contact:
Post
by agner » 2022-05-19, 5:41:49
We have now made some more tests on the P core after fixing the problem with overheating the CPU. The results are more stable now and basically confirming what Andreas wrote:
- The decoders can handle up to 6 µops per clock
- Simple integer instructions have a maximum throughput of 5 instructions per clock
- Integer additions with a small immediate constant have latencies near zero.
- Floating point addition has a latency of 2 clock cycles in chains of similar instructions, otherwise the latency is 3.
- Cache read throughput: 3 reads per clock with sizes ≤ 256 bits. 2 reads/clock with 512 bits
- Cache write throughput: 2 writes per clock with sizes ≤ 256 bits. 1 write/clock with 512 bits
- Mixed read/write throughput: 3 reads and 1 write per clock with sizes ≤ 128 bits
-
alfred0809
- Posts: 1
- Joined: 2022-05-22, 6:15:46
Post
by alfred0809 » 2022-05-22, 6:19:37
I have tested an Alder Lake, but I have not been able to get access to a setup that makes it possible to enable the AVX512 instructions. The performance of the P cores is improved somewhat over the Intel Ice Lake. The µop cache can hold 4k µops. The µop cache can deliver a maximum of 6 µops per clock cycle for a single thread
-
agner
- Site Admin
- Posts: 76
- Joined: 2019-12-27, 18:56:25
-
Contact:
Post
by agner » 2022-05-22, 10:06:47
The µop cache can hold 4k µops. The µop cache can deliver a maximum of 6 µops per clock cycle for a single thread
This agrees with my measurements.
-
agner
- Site Admin
- Posts: 76
- Joined: 2019-12-27, 18:56:25
-
Contact:
Post
by agner » 2022-06-28, 6:35:28
This Reddit post is reporting experiments with how to make sure heavy tasks are running in the P cores
https://www.reddit.com/r/XMG_gg/comment ... ores_when/.
I still think it is unreasonable to expect ordinary computer users to attend to processor-specific performance tuning details. Intel's confusing product names makes it difficult to even know what microarchitecture your computer is based on.