My manuals have now finally been updated with a test of Intel's Silvermont (Bay Trail) processor. Intel's old low-power processor named Atom has now finally got a major update after several years in service. The new design called Silvermont is a small low-power processor intended mainly for mobile devices and as a competitor for ARM machines. The Silvermont still contains traces of the old Atom design, but almost everything has been improved or redesigned. The chip has one or more units with two cores each. The two cores in a unit share the same level-2 cache but they have separate execution resources. Thus, there is no competition for execution resources between threads. The Silvermont supports the SSE4.2 instruction set, but not AVX and AVX2. It has a throughput of two instructions per clock cycle. There are two execution pipes for integer instructions, two for floating point and vector instructions, and one for memory read and write. Internal buses and execution units are 128 bits wide. Most execution units are pipelined, but some operations are staying in the same pipeline stage for two (rarely four) clock cycles in the cases of large data sizes or high precision. The high end processors from Intel and AMD have powerful capabilities for out-of-order execution, while the old Atom executes all instructions in program order. The Silvermont is a compromise between these two. It has some out-of-order execution, but not much. Integer instructions in general purpose registers can execute out of order with a depth of at most eight instructions. Floating point and vector instructions cannot execute out of order with other instructions in the same of the two floating point pipelines. There is full register renaming. The cache size is reasonable: 32kB level-1 code, 24kB level-1 data, 1MB level-2. Cache latencies were 3 and 19 clock cycles in my measurements, and the cache performance is generally good. The whole design seems well proportioned with reasonable capacities for a low-power chip in all stages of the pipeline - except for one very big bottleneck: the decoders. Simple instructions can decode at a rate of two instructions per clock cycle, but there are quite a lot of instructions that the decoders cannot handle so smoothly. Instructions that generate more than one micro-operation, as well as instructions with certain combinations of prefixes and escape codes, take four, six or even more clock cycles to decode. In many of my test cases I was unable to determine the latency and throughput of the execution units for certain instructions because the decoders were far behind the execution units. The designers have already removed the common bottleneck of instruction-length decoding by marking instruction boundaries in the code cache (a technique that Intel haven't used since the Pentium MMX seventeen years earlier). It should be possible to remove the unfortunate bottleneck in the decoders without sacrificing too much power consumption. Let's hope that Intel will have solved this problem in the next version of the Silvermont, as well as in the forthcoming Knights Landing coprocessor, which is rumored to be based on the Silvermont architecture. Other news in my manuals include calling conventions for the forthcoming AVX512 instruction set, and an update on how to circumvent Intel's CPU dispatcher for Intel compiler version 14. |