Software optimization resources | E-mail subscription to this blog | www.agner.org
Threaded View | Search | List | List Messageboards | Help |
Stop the instruction set war - Agner Fog - 2009-12-05 |
Stop the instruction set war - Agner Fog - 2009-12-06 |
The instruction set war's effect on virtualization - Yuhong Bao - 2009-12-28 |
Stop the instruction set war - Agner Fog - 2009-12-15 |
Stop the instruction set war - Norman Yarvin - 2010-01-09 |
Stop the instruction set war - Agner Fog - 2010-01-10 |
Stop the instruction set war - bitRAKE - 2010-01-12 |
Stop the instruction set war - Agner Fog - 2010-01-13 |
Pentium Appendix H - Yuhong Bao - 2010-02-10 |
Stop the instruction set war - Agner Fog - 2010-09-25 |
Stop the instruction set war - Agner - 2011-08-28 |
Stop the instruction set war - Ruslan - 2016-04-17 |
Stop the instruction set war - Agner - 2016-04-17 |
Stop the instruction set war - Agner - 2020-11-01 |
Stop the instruction set war |
---|
Author: Agner Fog | Date: 2009-12-05 10:43 | ||||||||||||||||||||||||||||||||||||||||||||||||||
There is an almost invisible war going on between Intel and AMD. It's the game of who is defining the new additions to the x86 instruction set. This war has been going on behind the scenes for years without being noticed by the majority IT professionals. Most programmers don't care what is going on at the machine code level, so they can't see all the ridiculous consequences that this war has. Those working with virtualization may have noticed that Intel and AMD processors are incompatible when it comes to virtualization software, but this is only one of the more visible consequences of the conflict. Some important battlesTraditionally, Intel has been the market leader, defining the instruction set for each new generation of microprocessors: 8086, 80186, 80286, 80386, etc. Each new instruction set is a superset of the previous one so that the backwards compatibility is maintained. Intel's main competitor, AMD, has tried several times to gain the lead by defining their own extensions to the x86 instruction set. In 1998, AMD was the first to introduce Single-Instruction-Multiple-Data (SIMD) instructions in their so-called 3DNow instruction set. Intel never supported the 3DNow instructions. Instead, they introduced the SSE instruction set a few years later. SSE does essentially the same thing as 3DNow, but with a larger register size. Clearly, Intel had won and AMD had to support SSE because it was better than 3DNow. In 2001, Intel launched their first 64-bit processor named Itanium with a new parallel instruction set. Instead of accepting the new Itanium instruction set, AMD developed their own 64-bit instruction set which - unlike the Itanium - was backwards compatible with the x86 instruction set. The market favored the backwards compatibility so AMD won this time and Intel had to support the AMD64, or x86-64, instruction set in their next processor. The next important battle is going on right now. It's about instructions with more than two operands. The
industry has recognized a need for fused multiply-and-add instructions (e.g.:
Can our software deal with incompatible CPUs?Software programmers may expect the compilers and software libraries to take care of all the intricacies of instruction sets for them. And the obvious way to deal with incompatible instruction sets is to make multiple branches of the code. Ideally, you would have one branch of code optimized for the latest Intel instruction set, another branch for the latest AMD instruction set, and one or more branches for older CPUs with older instruction sets. The software should detect which CPU it is running on and then choose the appropriate version of the code. This is called CPU dispatching. If the compiler can put a CPU dispatching mechanism into your code then you don't have to care about incompatible instruction sets - or do you? The only compiler I have found that has such a feature for automatic CPU dispatching is Intel's compiler. The Intel compiler can put a CPU dispatcher into your code so that it checks which instruction set (SSE, SSE2, SSE3, etc.) is supported by the CPU and chooses a branch of code that is optimized for that instruction set - but only as long as it is running on an Intel CPU! It refuses to choose the optimal branch if the CPU doesn't have the "GenuineIntel" mark, even if the non-Intel CPU if fully compatible with the optimized code. And who would want to sell a software package that works poorly on AMD and VIA processors? The situation is only slightly better when it comes to software libraries. Most compilers are equipped with libraries of standard functions, or you can use third party libraries. Some of the best optimized software libraries are published by Intel, but again they are optimized for Intel processors, and some of the functions work sub-optimally or not at all on non-Intel processors. AMD also publishes software libraries, and the AMD libraries work well on Intel processors, but of course the AMD libraries don't have a code branch that is optimized for instructions that are only available on Intel processors. There are many other libraries available, but they are typically less optimized and have little or no CPU dispatching. The GNU people are beginning to build a - long overdue - CPU dispatch mechanism into the GNU C library. The GNU library is open source, and of course it must support all x86 CPUs. But this work is done mostly by an Intel guy who has his natural focus on the latest Intel instruction sets and who has so far tested his improvements mainly on Intel processors. The best optimized code branches will work on AMD and VIA processors only with a few years delay when AMD and VIA have copied the Intel instruction sets into their processors. I am not aware of any AMD people contributing the GNU C library. Of course, a programmer can make his own CPU dispatching, but this is a lot of work. The programmer would have to identify the most critical part of his program and divide it into multiple branches. There is no AMD compiler for Windows, so we would have to use assembly code or intrinsic functions to take advantage of AMD-specific instructions in Windows software. Each branch has to be tested separately on different computers. And the maintenance of the code will be a nightmare. Every change in the code has to be implemented in each branch separately and tested on a separate computer. The disadvantages of CPU dispatching are clear. It makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs. The convoluted evolution of the x86 instruction setHistorically, AMD and other companies have copied almost all instructions that Intel have invented in order to maintain compatibility, but they have always lagged a few years behind because of the long development process. On the other side, Intel have never copied the instructions of other companies, except for the x86-64 instructions. For example, AMD were the first to make a prefetch instruction. When Intel made a prefetch instruction shortly after, they used a different code for essentially the same instruction, and AMD had to support the Intel code as well. Likewise, VIA/Centaur were first to make an x86 instruction for AES encryption. Several years later, Intel made a different instruction for the same purpose. This asymmetry, which is due to Intel's market dominance, has forced software developers to use Intel instructions rather than AMD or VIA instructions when they want compatibility. The current x86 instruction set is the result of a long evolution which has involved many short-sighted decisions and patches. An instruction is coded as one or more bytes of eight bits each. On the original 8086 processor, all instructions had a single byte indicating the type of instruction, possibly followed by one or more bytes indicating the operands (registers, memory operands, or constants). There are 28 = 256 possible single-byte codes, which soon turned out to be insufficient. When all 256 byte codes were used up, Intel had to discard a never-used instruction code (0F = POP CS) and use it as an escape code for 256 new two-byte codes of 0F followed by another byte (A byte is written as two hexadecimal digits, i.e. 00 - FF). As you may already have predicted, this new space of 256 two-byte codes eventually became filled up too. The logical thing to do now would be to sacrifice another unused code to open up another page of 256 two-byte codes. In fact, there are three undocumented instruction codes that could have been sacrificed for this purpose, but this never happened. Instead they started to make three-byte codes. The problem with discarding the undocumented codes is that these codes actually do something. Not anything important that can't be done just as well with other codes, but at least it is possible to make a program that uses the undocumented instructions. From a technical point of view, it would have been perfectly acceptable to discard the undocumented codes. These codes are not supported by any compiler or assembler. If any programmer is stupid enough to use an undocumented code, which he has no good reason to do, then he cannot expect his program to work on future processors. But the marketing logic is different. If company X makes a CPU that doesn't support the undocumented instruction codes, then company Y could make an advertising campaign saying that Y CPUs are compatible with all legacy software, X CPUs are not. The incompatible software might be old, obscure and useless pieces of code written by reckless programmers with no respect for compatibility issues, but the marketing argument would still be theoretically true. The problem with the overcrowded instruction code space has been dealt with from time to time by several workarounds and patches. Today, there are far more than a thousand different instruction codes, and many of them use complicated combinations of escape codes, prefix bytes, and postfix bytes to distinguish the different instructions. This makes instructions longer than necessary and, more importantly, it makes the decoding of the instructions complicated. To understand why instruction decoding is critical, we have to look at how superscalar processors are working today. A modern microprocessor can execute several instructions simultaneously if it has enough execution units and if it can find enough logically independent instructions in the instruction queue. Executing three, four or five instructions simultaneously is not unusual. The limit is not the execution units, which we have plenty of, but the instruction decoder. The length of an instruction can be anywhere from one to fifteen bytes. If we want to decode several instructions simultaneously, then we have a serious problem. We have to know the length of the first instruction before we know where the second instruction begins. So we can't decode the second instruction before we have decoded the first instruction. The decoding is a serial process by nature, and it takes a lot of hardware to be able to decode multiple instructions per clock cycle. In other words, the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are. The new VEX scheme makes the process a little simpler, but we still have to maintain compatibility with the complicated legacy code schemes with all their escape sequences and prefix bytes. Who owns the codes that are available for future instructions?As explained above, there is a limited number of unused code bytes available for new instructions. Both Intel, AMD and VIA want to use some of these codes for their new instructions. How is this conflict handled, and how are the vacant codes divided between the competing vendors? We may assume that there are negotiations going on about this, but no public information is available. We can only look at the results and try to guess what has been going on behind the scenes. Judging from which codes are actually used by each company, it looks like Intel has the upper hand in this conflict. The 256 possible codes of the two-byte instruction code space (0F xx) is divided as follows between the three vendors:
As you can see, only a small fraction of the code space is used for instructions introduced by AMD and VIA. It gets worse when we look at the code space defined by the VEX coding scheme. This scheme has room for 216 = 65536 instructions, so there is plenty of room for future instructions without adding extra prefix or suffix bytes. Yet, AMD has not used any of this code space for their new XOP instruction set. Instead, they have made another coding scheme which is very similar to the VEX scheme, but beginning with the byte 8F, where the VEX code begins with C4 or C5. We can only speculate whether the AMD engineers have asked Intel for permission to use part of the huge VEX space and got a no, or whether they have given up beforehand. All we know is that there are disadvantages to using a different coding scheme. The bytes that follow after C4 or C5 in the VEX scheme are coded in a special ingenious way in order to avoid clashing with existing instructions. It is not possible to use exactly the same method with the XOP scheme beginning with 8F, hence there are small differences between the XOP scheme and the VEX scheme. It would have been possible to make the two schemes identical if AMD had used the initial byte 62 instead of 8F for the XOP scheme, but perhaps Intel have reserved the 62 code for future use. Arguably, it would be possible to use the codes D4 and D5 as well, though with some extra complications. The small differences between Intel's VEX scheme and AMD's XOP scheme adds an extra complication to the instruction decoder in the CPU. This reduces the likelihood that Intel will copy any of the XOP instructions. If it turns out that some of AMD's XOP instructions are so useful that the software industry will ask Intel to copy them, then we may fear that Intel will choose a VEX encoding for these instructions rather than making their code compatible with AMD's. The free competitionThe x86 instruction set reflects a mechanism that is typical for technical evolution in a free market. One company makes one solution, another company makes another solution, and the market forces decide which solution will be most popular. A de facto standard evolves when one solution goes out of the market and everybody adopts the other solution. So far, so good. But the "market" for x86 instructions differs from other technical markets by the fact that all inventions are irreversible. We have seen that the microprocessor vendors keep supporting even the oldest obsolete or undocumented instructions for marketing reasons, even when the technical advantage of backwards compatibility is negligible compared to the costs. Intel keeps supporting the old undocumented instructions of the original 8086 processor, and AMD keeps supporting the 3DNow instructions that hardly any programmer uses because the market forces have replaced them with the better SSE instructions. The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution. The total number of x86 instructions is well above one thousand. One may ask whether there is a technical need for such a large number of instructions or if some instructions have been added more for marketing reasons than for technical utility. We need an open standardization processThe free competition on the microprocessor market has certainly been good for the price and performance of CPUs, but it has not been good for the compatibility. We are in a situation where different companies are competing to invent new instructions and keeping their ideas secret from each other and from their costumers as long as possible. It is clear that the problems discussed above cannot be solved optimally without some kind of regulation and coordination. We need an open standardization committee or at least some form of public deliberation to define new instructions and decide how they are coded. The current situation with unregulated competition and secret development fails to address the following issues:
My conclusion is that we need an open standardization committee or a public forum to discuss proposed additions and changes to the x86 instruction set and define an open standard. This committee or forum should of course involve representatives from the hardware vendors as well as the software industry, engineering organizations, standardization organizations, university scientists and consumer organizations. I think it is unlikely that Intel will voluntarily submit to such a standardization initiative because they have a competitive advantage in the current situation. A considerable pressure from outside is needed. This pressure could come from the software industry, from governments, political organizations, legal rulings, academic organizations, or from debates in public media. As a beginning, I hereby invite all interested persons to discuss these issues in various media and public forums. Links
|
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner Fog | Date: 2009-12-06 04:28 |
Thank you to Yuhong Bao and others for sending me information about more conflicts over the instruction code map.
The company Cyrix has used many codes on the
Vacant codes are also needed by software producers for virtual instructions that can be emulated. Microsoft is using the code The instructions POPCNT and RDTSCP were implemented first by AMD and later copied by Intel. [Corrections made 2009-12-07 and later thanks to Yuhong Bao and others] |
Reply To This Message |
The instruction set war's effect on virtualization |
---|
Author: | Date: 2009-12-28 03:34 |
BTW, AnandTech mention live VM migration as another example of where x86 CPU extensions can cause a lot of hassle. You have to often fiddle with CPU masks to migrate across CPU generations, and even then it isn't always possible to mask all features.. But even more important is the effect on cross-vendor live VM migrations. In fact, Red Hat and AMD demoed cross-vendor live VM migration back in 2008: linux.slashdot.org/article.pl?sid=08/11/07/1535235 It isn't mentioned very often in the discussions, but it is important. You see, back before AMD adopted AVX, AMD was going with SSE5 (in fact SSE4a is available already on today's Family 10h AMD processors) and Intel was going with AVX. If cross-vendor live VM migration was to work properly, the VM would have to be crippled all the way back down to SSE3. Even now, the FMA4 vs FMA3 wars means that VMs that have to migrate between Intel Ivy Bridge processors and AMD Bulldozer processors would have no access to FMA at all. |
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner Fog | Date: 2009-12-15 05:53 |
My blog post has caused a lot of discussion on the following messageboards: Thanks to everybody who have contributed. |
Reply To This Message |
Stop the instruction set war |
---|
Author: | Date: 2010-01-09 13:01 |
The x87 instruction set won't truly be obsolete until SSEx has support for floating-point formats of >64 bits. As it is, using those old instructions is a good way to get some extra precision (which can be quite valuable: rather than having to analyze a program in detail to see whether it is numerically stable, one can just re-run it with higher precision and see if the results change much.) |
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner Fog | Date: 2010-01-10 01:41 |
Norman Yarvin wrote:
The x87 instruction set won't truly be obsolete until SSEx has support for floating-point formats of >64 bits.I agree. We need XMM instructions with 80 bits, or better 128 bits, extended floating point precision before we can eliminate x87 completely. This feature should be optional because it is expensive to implement and few users would need it. There are also a few MMX conversion instructions that we need to implement as XMM instructions before we can eliminate the MMX registers. But Microsoft has never supported the 80 bits (long double) precision in their compiler. And the first preliminary specification for x64 Windows banned x87 and MMX. For some reason, they changed their mind and allowed x87/MMX (See my manual on calling conventions). All this just shows that we need coordination and planning rather than each company making its own decisions. |
Reply To This Message |
Stop the instruction set war |
---|
Author: bitRAKE | Date: 2010-01-12 11:49 |
Couldn't the instruction cache store an efficient post-decode encoding for instructions? IIRC, Intel already has a patent for doing this. Another possiblity would be to completely remap the instructions to favor parallel decoding. This would support backward capatiblity thru a processor external translator. Getting the cooperation (from both producers and consumers) might require a demonstraition of what can be gained. Why couldn't Transmeta succeed? Does hyper-threading make the decoder a greater target? Has Intel used instruction set changes to negatively impact competitors (i.e. publishing the second best while secretly working on the target design)? Can a fair market exist with Intel's obvious advantage (both capital and market share)? Should something so one-sided be called a war? |
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner Fog | Date: 2010-01-13 01:37 |
bitRAKE wrote:Couldn't the instruction cache store an efficient post-decode encoding for instructions?AMD stores instruction boundaries in the code cache in order to make decoding easier. Intel did the same in the old Pentium MMX, IIRC. I don't know why they are not doing this any more. Another possiblity would be to completely remap the instructions to favor parallel decoding.They did that in Itanium. But emulation of x86 is too slow. The CISC instruction set, while difficult to decode, has the advantage that it takes less space in the code cache. Has Intel used instruction set changes to negatively impact competitors (i.e. publishing the second best while secretly working on the target design)?I don't think they have ever deliberately published suboptimal instructions. They have failed to support AMD instructions, and they have changed from FMA4 to FMA3 for unknown reasons. FMA4 is obviously better than FMA3 from the programmer's point of view. There may be technical limitations that made them change to FMA3. |
Reply To This Message |
Pentium Appendix H |
---|
Author: | Date: 2010-02-10 11:46 |
Intel once tried to hide some of the new features of the Pentium from x86 competitors by requiring an NDA to be signed in order for info to be disclosed. It was nicknamed Appendix H because it was mentioned in the Appendix H of the Pentium processor family developer's manuals. AMD was able to reverse-engineer the Pentium and offer the K5 with all of them except APIC, but Cyrix cheated and only implemented the 486 instruction set in it's 6x86 and disabled the CPUID instruction by default. In the 6x86L, DE and CX8 was implemented, and in the 6x86MX, they implemented the features TSC and MSR from the Pentium and CMOV and PGE from the P6, but no PSE or VME. Centaur when it released the WinChip decided to again not implement PSE or VME. They also did not implement CMOV or PGE unlike Cyrix 6x86MX. They implemented MCE unlike Cyrix though. WinChip 2 added 3DNow!. Eventually Centaur was sold to VIA Technologies, and it retargeted the core to Socket 370 and the P6 bus and marketed it as the VIA C3, but the core was still virtually the same as before in features with the only difference being that Intel's MTRRs replaced Centaur's MCR and the addition of PGE. Even worse, by then, Windows 2000 was released in which the NTVDM crashed without VME, forcing VIA to provide a patch to NTVDM. It was only with Nehemiah that VIA finally began to really improve the core, with SSE replacing 3DNow!, and PSE and CMOV being implemented. With stepping 8 Nehemiah, VIA finally added VME, SEP, and PAT, catching up with the Pentium III. Rise mP6 was even worse, with it only implementing TSC, CX8, and MMX. Cyrix MediaGX implemented only 486 level features like 5x86 and 6x86, and MediaGXm implemented CX8, TSC, MSR, CMOV, and MMX. Later processors in that series of course added more features. Transmeta was better, with the Crusoe implementing Pentium MMX features (I think) plus CMOV and later SEP. You can see here also that the 586/686 distinction can be quite blurry too, with lots of processors implementing only some 686 features. Even Intel's own Pentium M did not support PAE at all in the original version (luckily the option of using PAE is separate from the option of using i686 instructions in most OSes). The long NOPs that was introduced with the P6 were troublesome too, with even VIA Nehemiah not implementing it. By now, it should be clear that Appendix H did a lot more harm than good, and it was only because the CPU feature bits that was invented with the CPUID instruction that software can wade through the mess. Before then, software just tested for CPU generation (for example, the 386/486 was differed by the test for EFLAGS.AC. Unfortunately I read that the IBM 386SLC CPU was really a relabeled 486 with all 486 instructions but with it being modified so that this test detects a 386, for reasons relating to Intel licensing. And the NexGen Nx586 originally implemented only 386 features, but later a hypercode update allowed user-mode 486 instructions to be supported if an option was enabled, but no kernel-mode instructions which was used by NT 4.0 and later preventing it from running), which has been considered dead since the introduction of CPUID. In fact, Intel did not bother creating a feature bit for the long NOPs, which means that it has to be manually tested via software using the illegal opcode exception, which was even harder in kernel mode because Connectix/Microsoft Virtual PC when encountering them in kernel mode code pops a fatal error that forces a reset of the virtual machine! I wrote this from the research I did, and I got most of the CPU features mentioned above from datasheets from datasheets.chipdb.org , if there is any errors please correct! |
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner Fog | Date: 2010-09-25 10:47 |
Back in December 2009 I wrote
If it turns out that some of AMD's XOP instructions are so useful that the software industry will ask Intel to copy them, then we may fear that Intel will choose a VEX encoding for these instructions rather than making their code compatible with AMD's.Now they are doing exactly this. When AMD announced their planned XOP instruction set they also announced the "CVT16" instructions for supporting floating point numbers with half precision, using their XOP code prefix. The names of these instructions were VCVTPH2PS and VCVTPS2PH. Now Intel have announced two almost identical instructions with the same names, but using their own VEX code prefix. Furthermore, AMD have postponed the implementation of these instructions. Whether they have done so for the sake of compatibility with Intel's instructions, we don't know. If Intel had allowed AMD to use part of the huge VEX opcode space then this would not have happened. We can only speculate what is going on behind the scenes... Link: Intel Advanced Vector Extensions Programming Reference, Aug 2010. |
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner | Date: 2011-08-28 08:30 | ||||||||||||||||||||||||||||||||||||||||
Here is an update of instructions that were first announced by AMD and later copied by Intel:
While AMD keeps copying almost all Intel instructions (except virtualization instructions) for the sake of compatibility, only few of AMDs instructions are copied by Intel. In those cases where Intel have copied an AMD instruction using the XOP coding scheme, they have made an incompatible code using the VEX coding scheme. |
Reply To This Message |
Stop the instruction set war |
---|
Author: | Date: 2016-04-17 03:31 |
Why do different vendors have to maintain nonoverlapping encodings of instructions? Shouldn't support for particular instruction set be queried via CPUID before the code using them would be executed? It seems it'd be quite OK to have a particular opcode mean one thing on a CPU supporting feature X and not Y, and another thing on a CPU supporting feature Y and not X if these features are from different vendors. Why the battle for opcode space then? |
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner | Date: 2016-04-17 07:21 |
Ruslan wrote:It seems it'd be quite OK to have a particular opcode mean one thing on a CPU supporting feature X and not Y, and another thing on a CPU supporting feature Y and not X if these features are from different vendors. Why the battle for opcode space then?If features X and Y are both useful then somebody might want to support both on a later design. It is too expensive for the software industry to support different mutually incompatible microprocessors. |
Reply To This Message |
Stop the instruction set war |
---|
Author: Agner | Date: 2020-11-01 05:47 |
The latest updates to the situation:
Due to the problems discussed in this thread, I have taken the initiative to develop a completely new instruction set that is open and license-free. This new instruction set is called Forwardcom. It has variable-length vector registers, and it is forward compatible in the sense that software written for one implementation of a Forwardcom processor will run optimally without re-compilation on later versions with longer vector registers. Forwardcom is neither RISC nor CISC but combining the best from both technologies. It has relatively few instructions but many variants of each instruction. See www.forwardcom.info for details. |
Reply To This Message |
Threaded View | Search | List | List Messageboards | Help |