Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

SB's L1D banks
Author:  Date: 2013-11-07 16:40
Thanks to Tacit Murky for the comments. I like the reordering trick, but it only works if you have accesses to different banks that can be re-ordered. In my original analysis I did not make this assumption. Consider, for example, performing a dot product on vectors of 32-bit values, each with a stride of 64B, and with a modulo-64 offset of 4 Bytes. Every load will access the same bank, so I think this case will have lots of conflicts, but every pair of loads differs in bit 2, so the pairs do not match in bits 5:2 and therefore (according to the wording of the optimization reference manual section 3.6.1.3, page 3-43) should *not* experience bank conflicts.

I had intended to test this particular case, but now that I look at my code I see that my code with offsets does roll over all of the banks (using a stride of 68 Bytes), so the reordering trick is sufficient to explain the observed drop in bank conflicts.

Concerning the write bandwidth on the AMD processors: Recent Intel processors (Nehalem & newer) have 10 "Line Fill Buffers" per core, and use these for streaming stores. In contrast, the AMD Family 10h processors have 8 "Miss Address Buffers" that are used for cacheable L1 misses (load or store) and 4 separate "Write Combining Buffers" that are used for streaming stores. This gives the AMD Family 10h processor significantly less potential concurrency for stores. Unfortunately it is quite difficult to estimate the amount of time that a buffer needs to be occupied for a streaming store operation, so it is not obvious how to determine whether the streaming store performance is concurrency-limited. In both AMD and Intel systems, the buffers used by the cores to handle streaming stores will hand off the data to the memory controller at some point, so they will probably have shorter occupancy than what is required for reads (since the buffers have to track reads for the full round trip), but the specifics of the hand-off are going to be implementation dependent and I don't see any obvious methodology for estimating occupancy. Once the streaming stores have been handed off to memory controller queues things are even less clear, since the number of buffers in the memory controller does not appear to be documented, and the occupancy in those buffers will depend on details of the coherence protocol that are unlikely to be discussed in public.

A brief look at the BIOS and Kernel Developer's Guide for the AMD Family 15h processors suggests that the cache miss buffer architecture has been changed significantly, but I have not worked through the details. I did find a note in AMD's Software Optimization Guide for Family 15h Processors (publication 47414, revision 3.06, January 2012) that says that Family 15h processors have about the same speed as Family 10h processors when writing a single write-combining stream, but may be slower when writing more than one write-combining stream. I have a few Family 15h boxes in my benchmarking cluster, but since our production systems are all currently Intel-based, I have not had much motivation to research the confusing bandwidth numbers that I obtained in my initial testing.

 
thread Test results for Intel's Sandy Bridge processor new - Agner - 2011-01-30
reply Test results for Intel's Sandy Bridge processor new - PaulR - 2011-02-15
replythread AVX2 new - phis - 2011-06-23
last reply AVX2 new - Agner - 2011-06-23
replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-01
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-06
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-10
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-10
replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2013-10-09
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-10-10
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2013-10-11
last replythread SB's L1D banks new - Tacit Murky - 2013-11-03
last reply SB's L1D banks - John D. McCalpin - 2013-11-07
replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-18
replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-18
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-24
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-25
last reply Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-25
replythread Haswell upper128 power gating new - Peter Cordes - 2015-08-28
last replythread Haswell upper128 power gating new - Agner - 2016-01-16
last replythread Haswell upper128 power gating new - John D. McCalpin - 2016-01-29
last reply Haswell upper128 power gating new - Agner - 2016-01-30
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-20
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-12-21
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-22
reply Test results for Intel's Sandy Bridge processor new - Robert - 2015-12-24
last replythread Test results for Intel's Sandy Bridge processor new - Just_Coder - 2015-12-25
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-26
last replythread Test results for Intel's Sandy Bridge processor new - Just_Coder - 2015-08-23
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-25