Agner`s CPU blog

Stop the instruction set war

Author: Agner Fog Date: 2009-12-05 10:43

There is an almost invisible war going on between Intel and AMD. It's the game of who is defining the new additions to the x86 instruction set. This war has been going on behind the scenes for years without being noticed by the majority IT professionals. Most programmers don't care what is going on at the machine code level, so they can't see all the ridiculous consequences that this war has. Those working with virtualization may have noticed that Intel and AMD processors are incompatible when it comes to virtualization software, but this is only one of the more visible consequences of the conflict.

Some important battles

Traditionally, Intel has been the market leader, defining the instruction set for each new generation of microprocessors: 8086, 80186, 80286, 80386, etc. Each new instruction set is a superset of the previous one so that the backwards compatibility is maintained.

Intel's main competitor, AMD, has tried several times to gain the lead by defining their own extensions to the x86 instruction set. In 1998, AMD was the first to introduce Single-Instruction-Multiple-Data (SIMD) instructions in their so-called 3DNow instruction set. Intel never supported the 3DNow instructions. Instead, they introduced the SSE instruction set a few years later. SSE does essentially the same thing as 3DNow, but with a larger register size. Clearly, Intel had won and AMD had to support SSE because it was better than 3DNow.

In 2001, Intel launched their first 64-bit processor named Itanium with a new parallel instruction set. Instead of accepting the new Itanium instruction set, AMD developed their own 64-bit instruction set which - unlike the Itanium - was backwards compatible with the x86 instruction set. The market favored the backwards compatibility so AMD won this time and Intel had to support the AMD64, or x86-64, instruction set in their next processor.

The next important battle is going on right now. It's about instructions with more than two operands. The industry has recognized a need for fused multiply-and-add instructions (e.g.: D=A*B+C) and several other instructions with more than two operands. The current coding scheme supports only instructions with two operands, so a new coding scheme has to be invented in order to support instructions with more than two operands. AMD came first with a proposal. In August 2007, AMD announced a future instruction set called SSE5 with a new coding scheme. The early disclosure of AMD's intentions was a break with the previous policy where both companies had kept their intentions secret as long as possible. Intel's reply came in April 2008 with an early (probably premature) disclosure of their planned AVX instruction set. Intel's AVX coding scheme was much more flexible and future-oriented than AMD's SSE5 scheme, as I argued in a public discussion forum. Most importantly, the AVX scheme has room for future extensions of the size of the SIMD vector registers, while the SSE5 scheme has little room for any future extensions. It was pretty obvious that Intel had won this time, and thanks to the early disclosure of Intel's AVX instructions, it was not too late for AMD to change their plans. In May 2009, AMD published a revision of their plans where they modified the coding scheme for better compatibility with AVX. In addition to a full support of AVX, the revised AMD plan contains most of the original SSE5 instructions under the new name XOP and with the new coding scheme. Unfortunately, Intel had changed their plans in the meantime! In December 2008, Intel published a revision of their plans which involved a change of the coding of the fused multiply-and-add (FMA) instructions. Now it was too late for AMD to change their design once more, so the first AMD processors with FMA will follow the premature Intel specification rather than Intel's later revision. It is difficult to obtain compatibility when you are following a moving target.

Can our software deal with incompatible CPUs?

Software programmers may expect the compilers and software libraries to take care of all the intricacies of instruction sets for them. And the obvious way to deal with incompatible instruction sets is to make multiple branches of the code. Ideally, you would have one branch of code optimized for the latest Intel instruction set, another branch for the latest AMD instruction set, and one or more branches for older CPUs with older instruction sets. The software should detect which CPU it is running on and then choose the appropriate version of the code. This is called CPU dispatching. If the compiler can put a CPU dispatching mechanism into your code then you don't have to care about incompatible instruction sets - or do you?

The only compiler I have found that has such a feature for automatic CPU dispatching is Intel's compiler. The Intel compiler can put a CPU dispatcher into your code so that it checks which instruction set (SSE, SSE2, SSE3, etc.) is supported by the CPU and chooses a branch of code that is optimized for that instruction set - but only as long as it is running on an Intel CPU! It refuses to choose the optimal branch if the CPU doesn't have the "GenuineIntel" mark, even if the non-Intel CPU if fully compatible with the optimized code. And who would want to sell a software package that works poorly on AMD and VIA processors?

The situation is only slightly better when it comes to software libraries. Most compilers are equipped with libraries of standard functions, or you can use third party libraries. Some of the best optimized software libraries are published by Intel, but again they are optimized for Intel processors, and some of the functions work sub-optimally or not at all on non-Intel processors. AMD also publishes software libraries, and the AMD libraries work well on Intel processors, but of course the AMD libraries don't have a code branch that is optimized for instructions that are only available on Intel processors. There are many other libraries available, but they are typically less optimized and have little or no CPU dispatching. The GNU people are beginning to build a - long overdue - CPU dispatch mechanism into the GNU C library. The GNU library is open source, and of course it must support all x86 CPUs. But this work is done mostly by an Intel guy who has his natural focus on the latest Intel instruction sets and who has so far tested his improvements mainly on Intel processors. The best optimized code branches will work on AMD and VIA processors only with a few years delay when AMD and VIA have copied the Intel instruction sets into their processors. I am not aware of any AMD people contributing the GNU C library.

Of course, a programmer can make his own CPU dispatching, but this is a lot of work. The programmer would have to identify the most critical part of his program and divide it into multiple branches. There is no AMD compiler for Windows, so we would have to use assembly code or intrinsic functions to take advantage of AMD-specific instructions in Windows software. Each branch has to be tested separately on different computers. And the maintenance of the code will be a nightmare. Every change in the code has to be implemented in each branch separately and tested on a separate computer.

The disadvantages of CPU dispatching are clear. It makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs.

The convoluted evolution of the x86 instruction set

Historically, AMD and other companies have copied almost all instructions that Intel have invented in order to maintain compatibility, but they have always lagged a few years behind because of the long development process. On the other side, Intel have never copied the instructions of other companies, except for the x86-64 instructions. For example, AMD were the first to make a prefetch instruction. When Intel made a prefetch instruction shortly after, they used a different code for essentially the same instruction, and AMD had to support the Intel code as well. Likewise, VIA/Centaur were first to make an x86 instruction for AES encryption. Several years later, Intel made a different instruction for the same purpose.

This asymmetry, which is due to Intel's market dominance, has forced software developers to use Intel instructions rather than AMD or VIA instructions when they want compatibility.

The current x86 instruction set is the result of a long evolution which has involved many short-sighted decisions and patches. An instruction is coded as one or more bytes of eight bits each. On the original 8086 processor, all instructions had a single byte indicating the type of instruction, possibly followed by one or more bytes indicating the operands (registers, memory operands, or constants). There are 2⁸ = 256 possible single-byte codes, which soon turned out to be insufficient. When all 256 byte codes were used up, Intel had to discard a never-used instruction code (0F = POP CS) and use it as an escape code for 256 new two-byte codes of 0F followed by another byte (A byte is written as two hexadecimal digits, i.e. 00 - FF).

As you may already have predicted, this new space of 256 two-byte codes eventually became filled up too. The logical thing to do now would be to sacrifice another unused code to open up another page of 256 two-byte codes. In fact, there are three undocumented instruction codes that could have been sacrificed for this purpose, but this never happened. Instead they started to make three-byte codes. The problem with discarding the undocumented codes is that these codes actually do something. Not anything important that can't be done just as well with other codes, but at least it is possible to make a program that uses the undocumented instructions. From a technical point of view, it would have been perfectly acceptable to discard the undocumented codes. These codes are not supported by any compiler or assembler. If any programmer is stupid enough to use an undocumented code, which he has no good reason to do, then he cannot expect his program to work on future processors. But the marketing logic is different. If company X makes a CPU that doesn't support the undocumented instruction codes, then company Y could make an advertising campaign saying that Y CPUs are compatible with all legacy software, X CPUs are not. The incompatible software might be old, obscure and useless pieces of code written by reckless programmers with no respect for compatibility issues, but the marketing argument would still be theoretically true.

The problem with the overcrowded instruction code space has been dealt with from time to time by several workarounds and patches. Today, there are far more than a thousand different instruction codes, and many of them use complicated combinations of escape codes, prefix bytes, and postfix bytes to distinguish the different instructions. This makes instructions longer than necessary and, more importantly, it makes the decoding of the instructions complicated.

To understand why instruction decoding is critical, we have to look at how superscalar processors are working today. A modern microprocessor can execute several instructions simultaneously if it has enough execution units and if it can find enough logically independent instructions in the instruction queue. Executing three, four or five instructions simultaneously is not unusual. The limit is not the execution units, which we have plenty of, but the instruction decoder. The length of an instruction can be anywhere from one to fifteen bytes. If we want to decode several instructions simultaneously, then we have a serious problem. We have to know the length of the first instruction before we know where the second instruction begins. So we can't decode the second instruction before we have decoded the first instruction. The decoding is a serial process by nature, and it takes a lot of hardware to be able to decode multiple instructions per clock cycle. In other words, the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are. The new VEX scheme makes the process a little simpler, but we still have to maintain compatibility with the complicated legacy code schemes with all their escape sequences and prefix bytes.

Who owns the codes that are available for future instructions?

As explained above, there is a limited number of unused code bytes available for new instructions. Both Intel, AMD and VIA want to use some of these codes for their new instructions. How is this conflict handled, and how are the vacant codes divided between the competing vendors? We may assume that there are negotiations going on about this, but no public information is available. We can only look at the results and try to guess what has been going on behind the scenes. Judging from which codes are actually used by each company, it looks like Intel has the upper hand in this conflict.

The 256 possible codes of the two-byte instruction code space (0F xx) is divided as follows between the three vendors:

Number of codes	Value after 0F	Assigned to	Used for	Subdivided
2	0D, 0E	AMD	3DNow
1	0F	AMD	3DNow	by suffix byte
4	24, 25, 7A, 7B	AMD	SSE5	by another escape byte
2	A6, A7	VIA	Instructions	by reg bits
2	38, 3A	Intel	SSSE3, SSE4	by another escape byte
2	39, 3B	Intel	for future use	by another escape byte
6	19 - 1E	reserved	hint instructions
11	04, 0A, 0C, 26, 27, 36, 3C, 3D, 3E, 3F, FF		unused
226	All other	Intel	used

As you can see, only a small fraction of the code space is used for instructions introduced by AMD and VIA.

It gets worse when we look at the code space defined by the VEX coding scheme. This scheme has room for 2¹⁶ = 65536 instructions, so there is plenty of room for future instructions without adding extra prefix or suffix bytes. Yet, AMD has not used any of this code space for their new XOP instruction set. Instead, they have made another coding scheme which is very similar to the VEX scheme, but beginning with the byte 8F, where the VEX code begins with C4 or C5. We can only speculate whether the AMD engineers have asked Intel for permission to use part of the huge VEX space and got a no, or whether they have given up beforehand. All we know is that there are disadvantages to using a different coding scheme.

The bytes that follow after C4 or C5 in the VEX scheme are coded in a special ingenious way in order to avoid clashing with existing instructions. It is not possible to use exactly the same method with the XOP scheme beginning with 8F, hence there are small differences between the XOP scheme and the VEX scheme. It would have been possible to make the two schemes identical if AMD had used the initial byte 62 instead of 8F for the XOP scheme, but perhaps Intel have reserved the 62 code for future use. Arguably, it would be possible to use the codes D4 and D5 as well, though with some extra complications.

The small differences between Intel's VEX scheme and AMD's XOP scheme adds an extra complication to the instruction decoder in the CPU. This reduces the likelihood that Intel will copy any of the XOP instructions. If it turns out that some of AMD's XOP instructions are so useful that the software industry will ask Intel to copy them, then we may fear that Intel will choose a VEX encoding for these instructions rather than making their code compatible with AMD's.

The free competition

The x86 instruction set reflects a mechanism that is typical for technical evolution in a free market. One company makes one solution, another company makes another solution, and the market forces decide which solution will be most popular. A de facto standard evolves when one solution goes out of the market and everybody adopts the other solution.

So far, so good. But the "market" for x86 instructions differs from other technical markets by the fact that all inventions are irreversible. We have seen that the microprocessor vendors keep supporting even the oldest obsolete or undocumented instructions for marketing reasons, even when the technical advantage of backwards compatibility is negligible compared to the costs. Intel keeps supporting the old undocumented instructions of the original 8086 processor, and AMD keeps supporting the 3DNow instructions that hardly any programmer uses because the market forces have replaced them with the better SSE instructions.

The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution.

The total number of x86 instructions is well above one thousand. One may ask whether there is a technical need for such a large number of instructions or if some instructions have been added more for marketing reasons than for technical utility.

We need an open standardization process

The free competition on the microprocessor market has certainly been good for the price and performance of CPUs, but it has not been good for the compatibility. We are in a situation where different companies are competing to invent new instructions and keeping their ideas secret from each other and from their costumers as long as possible. It is clear that the problems discussed above cannot be solved optimally without some kind of regulation and coordination. We need an open standardization committee or at least some form of public deliberation to define new instructions and decide how they are coded.

The current situation with unregulated competition and secret development fails to address the following issues:

Unfair competition. The market often favors Intel instructions rather than AMD or VIA instructions for compatibility reasons. The latter companies can only copy new Intel instructions with a delay of a few years.
AMD does not have access to a fair share of the opcode space to use for their innovations. Historically, AMD has used small corners of the opcode space to avoid the risk that Intel might assign another instruction to the same code. There is no part of the huge VEX opcode space that AMD can safely use without permission from Intel.
Technical incompatibility. AMD, Intel and VIA are assigning different codes to identical or equivalent instructions because each keep their innovations secret for as long as possible. It is so expensive for the software industry to make multiple versions of their software that hardly anybody does so.
Short-sighted solutions. The history of the evolution of the x86 instruction set is full of shortsighted decisions that are sub-optimal in a long term perspective. For example, when the vector registers were extended from MMX to XMM, there was no plan for how to handle the predictable future extension to YMM. If such a plan had been made then we wouldn't need the complexity today of having two versions of every XMM instruction (one that zero-extends into the YMM register and one that leaves the upper part of the register unchanged). A standardization committee or public forum would be more likely to include long-term planning.
Sub-optimal solutions. Some instructions could be implemented better at no extra costs. For example, the PANDN and PALIGNR instructions would be more efficient if the two operands were swapped. A public discussion would have corrected such lapses before it was too late.
Feedback from users is always too late. When a new instruction set is published, there is often public criticism, but then it is too late to change anything. The secrecy around innovations makes it impossible to involve the larger software community in the decision making process.
PR considerations often have more weight than technical considerations. Currently, we have far more than a thousand instructions in the x86 instruction set. This is more than any programmer can memorize. It would be better to have fewer instructions and make each instruction more flexible so that it would cover more applications. But there is an obvious PR value in announcing that the newest processor has a bazillion new instructions. The weird and sometimes deliberately misleading names of the instruction set extensions are obviously decided by PR people rather than by technicians.
Backwards compatibility is taken too far. Today's microprocessors are still supporting even the most obscure undocumented instructions of the first 8086 processor from thirty years ago, while operating systems sometimes fail to support software that is five years old. There is no technical reason for this, only a PR reason. The cost of supporting undocumented and obsolete instructions is actually quite high because they take up space in the overcrowded opcode map. If the undocumented codes had been eliminated then all instructions in the SSSE3 and SSE4 instruction sets would have a one-byte escape code rather than a two-bytes escape code.
Inability to declare anything obsolete. There are many things in the x86 instruction set that needs to be cleaned up and sanitized, which an unregulated market is unable to do. A standardization committee could declare that standards-compliant software should not use a certain feature. Support for this feature could then be removed after e.g. ten years. For example, the x87 register stack is clearly obsolete. If the standard says, don't use x87 and MMX registers, then we could replace all x87 instructions by emulation after a number of years. It is quite costly in terms of silicon space and performance to support the x87 instructions. Some processors even have an extra stage in the pipeline only for rotating the x87 register stack.
The evolution of the x86 instruction set has many dead ends which are never eliminated. When two different companies invent two different solutions to the same problem, then the market is likely to favor one solution while the other becomes a dead end. However, the company that introduced the solution that happened to become a dead end will keep supporting it in all future for marketing reasons, regardless of the technical costs.

My conclusion is that we need an open standardization committee or a public forum to discuss proposed additions and changes to the x86 instruction set and define an open standard. This committee or forum should of course involve representatives from the hardware vendors as well as the software industry, engineering organizations, standardization organizations, university scientists and consumer organizations.

I think it is unlikely that Intel will voluntarily submit to such a standardization initiative because they have a competitive advantage in the current situation. A considerable pressure from outside is needed. This pressure could come from the software industry, from governments, political organizations, legal rulings, academic organizations, or from debates in public media. As a beginning, I hereby invite all interested persons to discuss these issues in various media and public forums.