The L1D cache in BD is probably also under-performing by its own merit, compared to the K10 implementation. Truly its associativity is doubled but the size is 1/4 of the previous architecture, witch yields a lower overall hit rate than the old 64KByte 2-way solution. This, combined with the WT policy, that relies too much on the anemic and latent L2 cache makes the whole memory pipeline quite inefficient and hogs the data flow in many corner cases. The sheer size of the caches in BD is simply inadequate to compensate for the poor overall design. I think the L2 caches are the main stumbling block for the architecture in BD, additionally burdened to handle all the snoop traffic, since the L2-to-L3 relation is [mostly] exclusive. The good thing is that the HW prefetching in BD is more flexible now, and can fetch data directly in to the L2 (probably one of the reasons for AMD to make them so large). Sill, all this is a far cry from what Intel has achieved over the years in both efficiency and wide scaling across the product range. Bulldozer is simply a chaotic patch-work of counterintuitive ideas with no leading prospects. |