Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Multithreads load-store throughput for bulldozer
Author:  Date: 2014-07-04 09:13
Agner measured in the same chapter that L2 write throughput is 1 line(64 bytes) per 12 clocks. But as we saw, my program wrote almost 64 bytes per 8 clocks (8 bytes per clock) which exceeds Agner's measurement. So, the buffer did more than coalescing, for example, eliminating duplicated writes on same address.

By the way, I also wrote another benchmark which has no memory writes. So this benchmark must avoid L2 write-through access.


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
volatile size_t a, b;
register size_t sum0 asm("%r12"), sum1 asm("%r13");
static void ringonger(int _)
{
printf("%zu+%zu=%zu\n", sum0, sum1, sum1 + sum0);
exit(0);
}
int main()
{
a = 1;
b = 1;
sum0 = 0;
sum1 = 0;
if (SIG_ERR == signal(SIGALRM, ringonger))
perror("set signal");
alarm(1);
while (1) {
sum0 = sum0 + a;
sum1 = sum1 + b;
}
return 0;
}

It needs some gcc extension for global registers, but gcc-4.7.3 I used dumps invalid loops.

.L5:
movq a(%rip), %rax
movq b(%rip), %rax
jmp .L5

Fix it.

.L5:
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
jmp .L5

As each bulldozer core has 2 read ports, I expected "add mem reg ; add mem reg" sequence per core spends 1 clock. The result of my measurements are below.

ideal value
6800000000

1 thread
6375394786 (93.9% of ideal)

4 threads (1 thread per module)
6388711814.75 (94.0% of ideal)

8 threads (1 thread per core, or, 2 threads per module)
3378318936.375 (49.7% of ideal)

;
I saw again the just half ideal, though the main-loop of the benchmark holds no memory=L2 writes. This result encourages the doubt that load units are shared.
 
thread Test results for AMD Bulldozer processor new - Agner - 2012-03-02
replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-13
reply Test results for AMD Bulldozer processor new - Agner - 2012-03-14
last reply Test results for AMD Bulldozer processor new - Alex - 2012-03-14
replythread Test results for AMD Bulldozer processor new - fellix - 2012-03-15
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-16
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-16
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-17
reply Test results for AMD Bulldozer processor new - avk - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-20
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-21
last reply Cache WT performance of the AMD Bulldozer CPU new - GordonBGood - 2012-06-05
reply Test results for AMD Bulldozer processor new - zan - 2012-04-03
replythread Multithreads load-store throughput for bulldozer new - A-11 - 2014-06-27
last replythread Multithreads load-store throughput for bulldozer new - Bigos - 2014-06-28
last reply Multithreads load-store throughput for bulldozer - A-11 - 2014-07-04
last reply Store forwarding stalls of piledriver new - A-11 - 2014-09-07