ok, I further simplified my example (it seemed that in the non-vectorized code the compiler was able to condense multiple identical lines within the loop ) Now we have the following result,
which looks sensibly as expected, i.e. the vectorclass code using only a single packed "vaddpd" whereas the non-vectorized loop does three consequitive "vaddsd"s : non-vectorized: double pos[3],ray_dir[3];
------------------------------------------
for(t=t0;t<t1;t+=dt)
{
for(k=0;k<3;k++) pos[k] += dt*ray_dir[k];
} .L2:
vaddsd %xmm4, %xmm0, %xmm0
vaddsd %xmm8, %xmm3, %xmm3
vaddsd %xmm7, %xmm2, %xmm2
vucomisd %xmm0, %xmm5
vaddsd %xmm6, %xmm1, %xmm1
ja .L2 result: pos=330.082,9961.07,817.343 ----> CPU = 686.4 ms Vectorized: Vec3d v_pos,v_ray_dir;
----------------------------------- for(t=t0;t<t1;t+=dt)
{
v_pos += dt*v_ray_dir;
}
.L2:
vaddsd %xmm4, %xmm0, %xmm0
vaddpd %ymm2, %ymm1, %ymm1
vucomisd %xmm0, %xmm3
ja .L2 result: v_pos=330.082,9961.07,817.343 ----> CPU = 468.0 ms
Now this looks really good as it gives a speedup of about a factor of 1.5 nicely corresponding to the vectorized loop containing only 4 instructions vs. the 6 instructions of the scalar variant.
So with this actually proving the concept of a convenient 'zero-overhead' SIMD vector class I am going to convert parts of the 'real world' code of the main ray marcher application which will be interesting because it is probably going to be a tight race between RAM bandwidth and compute bandwdith (SIMD/AVX) as the ray marcher has to plough through dozens of gigabytes of voxel data (tricubically interpolated, i.e. requiring 64 memory fetches per voxel ;-) BTW: Such applications like the dozens-of-GB-raymarcher are the main reason why I prefer many-core CPU + SIMD/AVX + 100+ GB of memory architecture over GPU-based solutions wich may be even faster compute-wise but still only provides 4 GB RAM ... |