Agner measured in the same chapter that L2 write throughput is 1 line(64 bytes) per 12 clocks. But as we saw, my program wrote almost 64 bytes per 8 clocks (8 bytes per clock) which exceeds Agner's measurement. So, the buffer did more than coalescing, for example, eliminating duplicated writes on same address. By the way, I also wrote another benchmark which has no memory writes. So this benchmark must avoid L2 write-through access.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
volatile size_t a, b;
register size_t sum0 asm("%r12"), sum1 asm("%r13");
static void ringonger(int _)
{
printf("%zu+%zu=%zu\n", sum0, sum1, sum1 + sum0);
exit(0);
}
int main()
{
a = 1;
b = 1;
sum0 = 0;
sum1 = 0;
if (SIG_ERR == signal(SIGALRM, ringonger))
perror("set signal");
alarm(1);
while (1) {
sum0 = sum0 + a;
sum1 = sum1 + b;
}
return 0;
}
It needs some gcc extension for global registers, but gcc-4.7.3 I used dumps invalid loops.
.L5:
movq a(%rip), %rax
movq b(%rip), %rax
jmp .L5
Fix it.
.L5:
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
addq a(%rip), %r12
addq b(%rip), %r13
jmp .L5
As each bulldozer core has 2 read ports, I expected "add mem reg ; add mem reg" sequence per core spends 1 clock. The result of my measurements are below.
ideal value
68000000001 thread
6375394786 (93.9% of ideal) 4 threads (1 thread per module)
6388711814.75 (94.0% of ideal) 8 threads (1 thread per core, or, 2 threads per module)
3378318936.375 (49.7% of ideal)
;
I saw again the just half ideal, though the main-loop of the benchmark holds no memory=L2 writes. This result encourages the doubt that load units are shared. |