Begin New Subject | Threaded View | Search | List | List Messageboards | Help |
Performance of decimation v blending - Duong Tran - 2016-05-03 |
Performance of decimation v blending - Agner - 2016-05-03 |
Performance of decimation v blending |
---|
Author: | Date: 2016-05-03 09:38 |
Hello, I've been fighting the following problem for weeks and about to give up. But I still hope that I miss something. The problem originates from a fundamental signal processing task: transpose a 4x8 matrix of 16-bit unsigned short integers and its inverse (i.e. transpose a 4x8 matrix).
Nice, but... benchmark shows that transpo() is substantially slower than transpi(), at least for SSSE3 instruction set. For example, when compiled under gcc -march=core2 and run on a Core 2 Duo E4500 processor, it is more than twice slower. After many trials with this and about half dozen more versions of transpo(), I'm about to conclude that (within vectorclass library) one of the two fundamental operations, the _decimation_
is substantially slower than its countperpart, the _blending_
A temporary workaround for decimation is via blending:
which (of course) is only 1.5 times slower than blending. But I don't think it is the best vectorclass can do. I'm right in that? |
Reply To This Message |
Performance of decimation v blending |
---|
Author: Agner | Date: 2016-05-03 12:47 |
Try to look at the assembly output. The permute and blend functions work differently for different patterns, depending on whether there is a suitable machine instruction that fits the pattern. Try also with higher instruction sets. |
Reply To This Message |
Begin New Subject | Threaded View | Search | List | List Messageboards | Help |