Looks like a typo in the KNL recip-throughput number for FMA - it's currently 3. KNF and KNC get 1, and this chip is a real FMA machine - it's designed around that unit. Pretty sure the correct number for KNL is 0.5 (like VADDPS and VMULPS). As for why the chip has 4 threads per core - I'm the guy that persuaded KNF (and thus KNC) to have 4 threads, and the reason is first to hide memory misses, second branch mispredicts, and third instruction latencies. Those are all huge bottlenecks in real-world performance. Yes, you can also hide them with huge OOO machines, wide decoders, and long pipelines, but when flops/watt is your efficiency metric, those aren't the first choice. The Knights line of chips already open the book with "we assume you have 70+ threads of stuff to do", and while getting from 1 thread to 4 is agony, and 4 threads to 16 is hard, getting from 70 threads to 280 is actually pretty simple. |