Begin New Subject | Threaded View | Search | List | List Messageboards | Help |
Bilinear interpolation of images - Chad Jarvis - 2013-02-05 |
Bilinear interpolation of images - Agner - 2013-02-06 |
Bilinear interpolation of images - Chad Jarvis - 2013-02-07 |
Bilinear interpolation of images |
---|
Author: | Date: 2013-02-05 05:56 |
Hi. I have implemented bilinear interpolation of images in C++ on the CPU. I found a blog that does this with SSE2 and SSE3 instructions which gives the fastest results. fastcpp.blogspot.no/2011/06/bilinear-pixel-interpolation-using-sse.html#comment-form I have tried to implement the same code using the vector class. However, the speed is less than I hoped for compared to the SSE native code on the blog. In fact using fixed point math is even faster than my code with the vectorclass. I thought you might be interested to see what I have done and perhaps you have some comments to improve my code. Below follows my implementation of GetPixelSSE and GitPixelSSE3 functions on the fastcpp blog. inline Vec4f CalcWeights_vector(float x, float y) { return w_x * w_y; int GetPixelSSE(const int* data, float u, float v, const int src_width, const int src_height, Vec4f& weights) { p12x.load(&data[0]); Vec8us p12 = extend_low(p12x); Vec8us L1234 = w12*p12 + w34*p34; int GetPixelSSE3(const int* data, float u, float v, const int src_width, const int src_height, Vec4f& weights) { Vec4ui redv = extend_low(rg); //no mm_madd in vectorclass |
Reply To This Message |
Bilinear interpolation of images |
---|
Author: Agner | Date: 2013-02-06 00:11 |
Chad Jarvis wrote:I have implemented bilinear interpolation of images in C++Your code spends most of the time moving data around and converting between float, 8-bit, 16-bit and 32-bit integers. You may think about whether the data can be organized differently so that you don't need so much moving around, reordering and conversion. Your functions return a single pixel. You may think about whether the data can be organized so that you keep everything in vectors and return a vector of multiple pixels, if this can reduce the number of conversions. Division by 256 can be done faster by (unsigned) shift right by 8. (Or divide by const_uint(256)). You should add red, green and blue before doing the horizontal add, so that you only need one horizontal add. |
Reply To This Message |
Bilinear interpolation of images |
---|
Author: | Date: 2013-02-07 04:32 |
I manged to optimize the code some more and now it's almost as fast as the intrinsic code. I changed "L1234/=256 first to L1234/=const_uint(256)" and finally to "L1234>>8". That was the first big improvement but the next one was unexpected. I changed most vectors from unsigned to sign and used "compress_unsaturated" instead of "compress". I figured this out by looking at "vectori128.h" and the intrinsic code. Compress_unsaturated on signed vectors only uses one instruction (the same as the intrinsic code) whereas compress uses at least three. Compress_unsaturated on unsigned uses many more. In your manual you write for the efficiency of compress_unsaturated "medium (worse than compress in most cases)". I don't understand this because for signed integer based vectors (Vec4i, Vec8s, Vec16c) compress_unsaturated is simpler and faster. Below is the code that gets closest to the intrinsic code. int GetPixelSSE(const int* data, float u, float v, const int src_width, const int src_height, Vec4f& weights) { p12x.load(&data[0]); Vec8s p12 = extend_low(p12x); Vec8s L1234 = w12*p12 + w34*p34; |
Reply To This Message |
Begin New Subject | Threaded View | Search | List | List Messageboards | Help |