Wednesday, July 29, 2009

SSE and auto-vectorization in g++

I have been implementing a SpMV with multiple right hand side vectors (and hence multiple left hand side vectors). For each A[i][j], that translated into performing (assuming that rhs & lhs are represented as tuples of std::vectors):

get<0>(lvecs)[i] += A[i][j] * get<0>(rvecs)[j];
get<1>(lvecs)[i] += A[i][j] * get<1>(rvecs)[j];
get<2>(lvecs)[i] += A[i][j] * get<2>(rvecs)[j];
get<3>(lvecs)[i] += A[i][j] * get<3>(rvecs)[j];
...




I have initially thought about 3-4 different ways of implementing this. One is just to rely on the automatic vectorization capabilities of g++ and hint the compiler to unroll the loop using templates, as this:

template <int D, typename T>
void saxpy(T a, T * __restrict b, T * __restrict c)
{
for(int i=0; i<D; ++i)
{
c[i] += a* b[i];
}
}

int a = 2;
int b[BETA];
int c[BETA];
saxpy<BETA>(a, b, c);



Performing saxpy 1 million times with BETA=16 takes about 9.97 milliseconds on our Opteron and 14.35 milliseconds on my Macbook's Core2Duo processor. These are the best I got, by using the -O3 optimization.

Then I said, let's try to force some SSE through gcc vector extensions:

typedef int vpackedsi __attribute__ ((vector_size (BETA*sizeof(int)))); // 64-bytes, a full cache line !
union ipackedvector
{
vpackedsi v;
int f[BETA];
};

ipackedvector av, bv, cv;
cv.v += av.v * bv.v;


Or course, the drawback here is that a also needs to be a vector since operations between scalar and vector types are not allowed (you'll get error: invalid operands of types 'int' and 'int __vector__' to binary 'operator*') but I assume the automatic vectorization in gcc does the same packing behind the covers (if it is using sse/simd at all)

Unfortunately, the second approach was slower, 16.712 milliseconds on the Opteron and 24.18 milliseconds on the Core2Duo. Exactly 67% slower on both architectures (which kind of surprised me). After taking a look at the generated assembly code, I see that the latter approach uses SSE4 instructions such as pmuldq, and the former uses operations on floating points such as addsd although the code is supposed to use integer. Maybe, it is using FPU pipelines for better performance, gcc must be smarter than I am.

2 comments: