Intel's new Chimera: Alder Lake

Post by **agner** » 2022-04-24, 16:56:59

A chimera is a monster combining parts from different animals, or an organism containing multiple different sets of DNA. I am calling Intel's new Alder Lake processor a chimera because it is a hybrid containing two different kinds of CPU cores with very different designs.

The Alder Lake processor contains from 2 to 8 cores of the 'Golden Cove' architecture, called P cores, and from 0 to 8 cores of the 'Gracemont' architecture, called E cores. The P cores (Performance cores) are high-performance CPU cores using the latest state-of-the-art technology to get maximum performance. The E cores (Efficiency cores) use same technology as the 'Atom' series with low power consumption and lower performance. The idea behind this design is that the P cores can give a high performance for a limited number of threads, while the E cores allow the CPU to run many threads and still limit the power consumption. This may sound like a nice compromise in theory, but it involves a lot of problems when the same program or the same thread can jump arbitrarily between two very different kinds of cores.

The initial Alder Lake design had different CPUID numbers for the two kinds of cores. This gave problems with DRM software. If a program using DRM detects that the CPUID has changed, it will assume that the program has been moved to a different computer in violation of the license. This, of course, will stop the execution. Intel had to modify the Alder Lake and give it the same CPUID for all cores in order to fix this problem[1]. Now, it is difficult for a running program to detect what kind of core it is running on.

Another problem is that the P cores are designed for the latest instruction set extensions, including AVX512 and a new set of half-precision floating point instructions (AVX-512 FP16) that are useful for neural networks. The E cores only support AVX2, not the later instruction set extensions, such as AVX512. What would happen if a program that starts executing in a P core and detects that AVX512 instructions are available is moved by the operating system to an E core that doesn't support this instruction set? A smart operating system might catch the error when the program attempts to execute an AVX512 instruction and move it back to a P core. But this requires that the operating system is designed with special support for the Alder Lake. If the program is running on an older operating system, it will crash in this situation. Therefore, Intel had to disable all instructions that are not supported by the E cores. The AVX512 instructions are actually implemented in the hardware, but they are disabled. Some motherboards have a BIOS feature that makes it possible to disable the E cores and enable the AVX512 instructions[2]. This feature is not endorsed by Intel, and it has now been disabled in a microcode update, even for the i3 models that have no E cores[3]. Intel have actually sacrificed their flagship 512-bit instructions in order to run multiple threads in low-power cores.

It is very difficult to optimize the software execution for this hybrid system. A further complication is that a P core can run two threads in the same core so that each thread gets half of the resources. This is what Intel call hyperthreading. A program thread may run in three different configurations with different performance parameters:

Running alone in a P core with maximum performance
Sharing a P core with another thread, giving half the resources
Running in a low-power E core

It is completely unrealistic that an application program can handle this situation in a reasonable manner and optimally allocate different threads to the different cores. Hardly any software application company can afford to make different versions of their code for every new microprocessor model and verify, maintain, and support all these versions. The Alder Lake has implemented a special hardware solution to this problem called the 'Intel Thread Director'. The Intel Thread Director is an embedded microcontroller that monitors all threads and measures the resource use of each thread. The operating system can use this information to calculate the optimal allocation of P cores and E cores to the different threads[4]. Windows 11 has support for the Intel Thread Director. Future versions of Linux are planned to support it too[5], while there are no known plans to support it in MacOS[6].

The way that Windows 11 handles this problem is still flawed, however. The system is giving high priority only to the thread that has the user focus. This ignores the behavior of many users. A user who is waiting for the computer to finish a heavy duty task is typically not just sitting and waiting. He/she is more likely to do something else during the waiting time, for example checking mails[2]. There are various technical options that the user can use to control the prioritization of threads, but it is unreasonable to require that the user understands and masters such options when the user's attention is on a complicated calculation task rather than on the hardware details of a specific computer. It is already quite difficult to optimize for hyperthreading, as I have argued before[7]. The hybrid design of the Alder Lake just makes the optimization an order of magnitide more complicated. It looks like the hardware designers have unrealistic expectations of how much software designs can be attuned to processor-specific peculiarities.

I have tested an Alder Lake, but I have not been able to get access to a setup that makes it possible to enable the AVX512 instructions. The performance of the P cores is improved somewhat over the Intel Ice Lake. The µop cache can hold 4k µops. The µop cache can deliver a maximum of 6 µops per clock cycle for a single thread or 3 µops per thread when running two threads. This throughput is not limited by code cache lines. The decoders can deliver a maximum of 4 µops per clock for a single thread or 2 µops per thread when running two threads. The decoders can handle a maximum of 16 bytes per clock, or 2x16 bytes when running two threads. The figures of 6 decoders and 8 µops per clock published elsewhere[4] are not confirmed by my measurements.

Instruction latencies and throughputs are similar to the Ice Lake for most instructions, but the latency for floating point addition is reduced from 4 to 2 clock cycles. I have not published instruction tables for the Alder Lake. I prefer to wait until a pure Golden Cove with all instructions enabled becomes available.

References:

1. Kyle Orland: Faulty DRM breaks dozens of games on Intel’s Alder Lake CPUs. Ars Technica, 2021

2. Ian Cutress and Andrei Frumusanu: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity. Anandtech, 2021

3. Xaver Amberger: Intel completely disables AVX-512 on Alder Lake after all – Questionable interpretation of “efficiency”. IgorsLab, 2021

4. Ian Cutress and Andrei Frumusanu: Intel Architecture Day 2021: Alder Lake, Golden Cove, and Gracemont Detailed. Anandtech, 2021

5. Michael Larabel: Intel HFI To Premiere In Linux 5.18 For Improving Hybrid CPU Performance/Efficiency. Phoronix, 2022

6. Andrew Cunningham: Apple may be done with Intel Macs, but Hackintoshes can still use the newest CPUs. Ars Technica, 2022

7. Agner Fog: How good is hyperthreading? Agner's CPU blog, 2009

... · Post by **...** » 2022-04-25, 6:48:11

Thanks for the writeup.
If it helps, I have an AVX512 enabled 12700K if there's some program/code you want me to run on it.

If you want to get such a chip yourself, I wrote an article regarding requirements.

Post by **agner** » 2022-04-25, 13:00:47

Thank you for your proposal. The easiest way is if I can test it by remote access (Linux). Please send me an email agner_at-agner.org

davidbak · Post by **davidbak** » 2022-04-26, 1:33:59

Isn't there also software that would have been pessimized in the other direction: By starting on an E-core w/o instruction set extensions like AVX-512 and moving to a P-core (after having looked to see what instructions were available)? (There is linear algebra software that tunes chooses code to run at runtime based on processor capabilities, right? Perhaps other software too?)

Post by **agner** » 2022-04-26, 4:26:47

davidbak wrote:

Isn't there also software that would have been pessimized in the other direction

Yes

Post by **agner** » 2022-05-13, 9:01:33

I have now been able to test the Alder Lake P cores with full access to the AVX512 instructions thanks to a lot of help from Zingaburga.

I can confirm that the full AVX512 instruction set is working on the Alder Lake with an early microcode and a certain BIOS when only the P cores are enabled.

The instruction timings are very similar to the Ice Lake and Tiger Lake processors. Most floating point vector instructions have a latency of 4 clock cycles and a maximum throughput of 2 vectors per clock cycle for vector sizes up to 256 bits and 1 vector per clock for 512 bits. The Alder Lake was aggressively saving power during my tests. The throughput was 1/4 of the maximum most of the time. I am not going to publish the complete timing tables because this has little use when the P cores and the E cores have different timings and many of the instructions are only available under very special conditions.

The Alder Lake P cores have support for a new instruction set extension named AFX512_FP16. This includes 110 new instructions for calculations with half precision floating point vectors (Description, Specification).

Half precision is useful for media applications and artificial intelligence applications where the lower precision is acceptable. The use of half precision is doubling the throughput of most vector operations. Half precision is already supported by some ARM processors.

The latencies and throughput for half precision vector add, subtract, and multiply are the same as for single precision. Division, however, has poor performance. The latencies for half precision division, reciprocal, and square root are about double the latencies for single precision.

The AFX512_FP16 instruction set extension also includes new instructions for calculations with complex numbers in half precision. These instructions are intended mainly for Fourier analysis according to an Intel technology guide. Complex number multiplication and fused multiply-and-add are implemented as single instructions, while complex number division requires a series of instructions.

Instructions for complex number multiplication have a latency of approximately 8 clock cycles. These instructions are using the multiplication hardware twice, yet they are issued as a single µop. The complex number multiplication instructions have the weird restriction that the destination register must be different from the source registers. This restriction makes no sense to me since the processor has register renaming so that the destination register is always renamed to a different physical register. No other x86 instructions have this restriction.

The precision of complex number multiplication is reduced. A complex number multiplication is calculated with the formula: (Ar, Ai) * (Br, Bi) = (Ar*Br-Ai*Bi, Ar*Bi+Ai*Br). Each part of the result can be calculated as a multiplication followed by a fused multiply-and-add (FMA) operation. FMA operations are normally calculated with extended precision on the intermediate multiplication result according the the floating point standard. However, this would give rise to an asymmetry in complex number multiplication because the two intermediate multiplications would have different precisions. This would have the unfortunate consequence that A*B and B*A might give slightly different results. The hardware implementation avoids this asymmetry by rounding all intermediate results to half precision. It would be better to do all intermediate calculations with extended precision, but this would make the hardware implementation more complicated. It is necessary to take the loss of precision in complex number multiplication into account when deciding whether half precision is sufficient.

The AFX512_FP16 instructions are supported by Gnu, Clang, and Intel compilers, Intel's emulator, and my disassembler. I have plans to support half precision in my vector class library.

The AFX512_FP16 extension adds 110 new instructions to the already bloated x86 instruction set. There are now more than 2000 instructions in the x86 instruction set with all its extensions. Not all 2000 instructions are supported in all processors, but the hardware complexity needed to support all these instructions must be enormous. Historically, every major update to the x86 instruction set has involved new patches to the way instructions are coded and new complications, as I have discussed elsewhere. The AFX512_FP16 extension new adds yet another novelty. A hitherto unused m-bit in the EVEX prefix is used for indicating half precision. This opens up two new code pages with space for 2048 new instructions where only few of these are actually used. Instructions with an immediate operand still use the legacy coding scheme with the m-bits equal to 3. The restriction that source and destination registers must be different for a few of the new instructions, as mentioned above, is yet another new complication.

andreas · Post by **andreas** » 2022-05-15, 21:04:11

agner wrote: ↑
2022-04-24, 16:56:59
The decoders can deliver a maximum of 4 µops per clock for a single thread

According to my tests, the decoders on the P cores can decode 6 instructions per cycle. Here is an example for a sequence of NOP instructions that require, on average, 0.17 cycles: https://uops.info/html-tp/ADL-P/NOP-Measurements.html

agner wrote: ↑
2022-04-24, 16:56:59
The decoders can handle a maximum of 16 bytes per clock, or 2x16 bytes when running two threads.

According to my test, they can handle 32 bytes per clock. A sequence of two 15-byte NOPs followed by a 2-byte NOP can be executed in a single cycle:

Code: Select all

> sudo ./nanoBench.sh -conf configs/cfg_AlderLakeP_common.txt -cpu 0 -df -unroll 100 -asm "|15|15|2"

CORE_CYCLES: 1.00
INST_RETIRED: 3.00
IDQ.MITE_UOPS: 3.00
IDQ.DSB_UOPS: 0.02
IDQ.MS_UOPS: 0.00
LSD.UOPS: 0.00
UOPS_ISSUED: 3.00
UOPS_EXECUTED: 0.00
UOPS_RETIRED.SLOTS: 3.00
...

agner wrote: ↑
2022-04-24, 16:56:59
Instruction latencies and throughputs are similar to the Ice Lake for most instructions

There are quite a few differences.

Due to the additional ALU, many integer instructions now have a throughput of 0.2 cycles instead of 0.25 cycles, for example https://uops.info/html-instr/ADD_03_R64_R64.html#ADL-P.

Due to the additional load port, three loads can now be performed per cycle, for example: https://uops.info/html-instr/VMOVSD_XMM_M64.html#ADL-P.

There is also an interesting new optimization that can perform additions with small immediates with zero latency: https://twitter.com/uops_info/status/14 ... 4490672130

Post by **agner** » 2022-05-16, 4:47:49

Thank you for contributing with your measurements.

Andreas wrote:
The decoders on the P cores can decode 6 instructions per cycle.
..they can handle 32 bytes per clock

This is when your code is running out of the µop cache. The µops have already been decoded. The decoder throughput can only be measured when the loop is bigger than the µop cache.

There is also an interesting new optimization that can perform additions with small immediates with zero latency

I can confirm this.

The ALU and load throughputs in my tests were inferior to your measurements. I had problems with poor performance in many of my measurements. Perhaps the CPU was overheated or running in power-saving mode most of the time. I didn't care to refine my measurements because I had no direct access to the machine. I had to send test scripts to Zingaburga and ask him to do the tests for me. I will wait till the Golden Cove becomes available.

andreas · Post by **andreas** » 2022-05-16, 14:06:34

agner wrote: ↑
2022-05-16, 4:47:49
This is when your code is running out of the µop cache. The µops have already been decoded. The decoder throughput can only be measured when the loop is bigger than the µop cache.

My code is not running out of the µop cache. This can be seen from the UOPS_MITE count that is shown in the output.

Post by **agner** » 2022-05-17, 4:18:07

Andreas wrote:
This can be seen from the UOPS_MITE count that is shown in the output

With 4000 8-byte NOPS I see 32 bytes per clock. With 8000 8-byte NOPS I see 16 bytes per clock.
MITE_UOPS keeps counting (event 0x79, umask=4). DSB_UOPS (umask=8) stop counting.

Agner's CPU blog

Intel's new Chimera: Alder Lake

Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Test results for Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake

Re: Intel's new Chimera: Alder Lake