The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode. It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.) The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal). For operation in "Flat" mode (MCDRAM as memory, located in the upper 16 GiB of the physical address space), with the coherence agent mapping in "Quadrant" mode (addresses are hashed to coherence agents spread across the entire chip, but each cache line is assigned to an MCDRAM controller in the same "quadrant" as the CHA responsible for coherence), my preferred latency tester gives values of 154ns +/- 1ns (1 standard deviation) for MCDRAM. These values are averaged over many addresses, with the variation mostly from core to core (with a few ns of random variability). My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents. For the same system in "Flat" "All-to-All" mode (addresses are hashed to coherence agents spread across the entire chip, with no special correlation between the location of coherence agents and the MCDRAM controller owning an address), the corresponding value is 156ns +/- 1ns (1 standard deviation). For the same system in "Flat" "Sub-NUMA Cluster 4" mode, the corresponding values are 150.5ns +/- 0.9ns (1 standard deviation) for "local" accesses, and 156.8ns +/- 3.1ns for "remote" accesses. Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles. (Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.) Run-to-run variability is typically small when using large pages, but there are certain idiosyncrasies that have yet to be explained. Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of "hops" required for coherence transactions in "Quadrant" and "SNC-4" modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available.... |