I found a blog that says store forwarding on Piledriver is improved from Bulldozer.
blog.stuffedcow.net/2014/01/x86-memory-disambiguation/
Unlike 186th page of "microarchitecture.pdf" where you store 32bits and load upper 16bits of the 32bits, the author of this blog store 64bits and load upper 32bits of the 64bits.
Then he says the loading upper 32bits has no stall.
I also reproduced it on FX-8350 with this code below (GNU as)..L5
movq %rbx, b(%rip)
movl b+4(%rip), %eax
addq $1, a(%rip) #increase counter
jmp .L5 |