Initialization has been a vague part of my C++ knowledge. Although I always knew that most user defined classes has appropriate constructors, so that I can expect reasonable behavior (e.g. std::string initializes to an empty string "" ), this hasn't been the case for built-in arrays of vectors.
What do you expect from this code:
float * test = new float[10];
copy(test, test+10, ostream_iterator<float>( cout, " ");
Honestly, I expect anything to happen though it sometimes behaves reasonably and spits out zeros. As a matter of fact, valgrind will list hundreds of errors for this tiny peace of code, most of which are of the form "Conditional jump or move depends on uninitialised value". The way to make sure everything is zero initialized is to use std::fill or std::fill_n (but please, no memset) after the allocation. On the other hand, this is not always possible.
Assume you have a templated function which takes an array of T's where T can be pretty much anything. If it is a non-POD (plain old data) class, then you can look at the documentation and make sure the default constructor does what you want it to do (in my case, assignment to zero). But how to handle built-in types without writing a specialization to cover them? std:fill trick won't work since it might not make sense to the non-POD class you're dealing with.
Here is the loophole :
float * test = new float[10] () ;
copy(test, test+10, ostream_iterator<float>( cout, " ");
Note the empty set of parantheses as the initializer, which makes the array elements default constructed and the C++ standard says that a default constructed POD type is zero-initialized. Neat, huh? (btw, valgrind reports 0 errors this time)
Wednesday, July 29, 2009
SSE and auto-vectorization in g++
I have been implementing a SpMV with multiple right hand side vectors (and hence multiple left hand side vectors). For each A[i][j], that translated into performing (assuming that rhs & lhs are represented as tuples of std::vectors):
get<0>(lvecs)[i] += A[i][j] * get<0>(rvecs)[j];
get<1>(lvecs)[i] += A[i][j] * get<1>(rvecs)[j];
get<2>(lvecs)[i] += A[i][j] * get<2>(rvecs)[j];
get<3>(lvecs)[i] += A[i][j] * get<3>(rvecs)[j];
...
I have initially thought about 3-4 different ways of implementing this. One is just to rely on the automatic vectorization capabilities of g++ and hint the compiler to unroll the loop using templates, as this:
template <int D, typename T>
void saxpy(T a, T * __restrict b, T * __restrict c)
{
for(int i=0; i<D; ++i)
{
c[i] += a* b[i];
}
}
int a = 2;
int b[BETA];
int c[BETA];
saxpy<BETA>(a, b, c);
Performing saxpy 1 million times with BETA=16 takes about 9.97 milliseconds on our Opteron and 14.35 milliseconds on my Macbook's Core2Duo processor. These are the best I got, by using the -O3 optimization.
Then I said, let's try to force some SSE through gcc vector extensions:
typedef int vpackedsi __attribute__ ((vector_size (BETA*sizeof(int)))); // 64-bytes, a full cache line !
union ipackedvector
{
vpackedsi v;
int f[BETA];
};
ipackedvector av, bv, cv;
cv.v += av.v * bv.v;
Or course, the drawback here is that a also needs to be a vector since operations between scalar and vector types are not allowed (you'll get error: invalid operands of types 'int' and 'int __vector__' to binary 'operator*') but I assume the automatic vectorization in gcc does the same packing behind the covers (if it is using sse/simd at all)
Unfortunately, the second approach was slower, 16.712 milliseconds on the Opteron and 24.18 milliseconds on the Core2Duo. Exactly 67% slower on both architectures (which kind of surprised me). After taking a look at the generated assembly code, I see that the latter approach uses SSE4 instructions such as pmuldq, and the former uses operations on floating points such as addsd although the code is supposed to use integer. Maybe, it is using FPU pipelines for better performance, gcc must be smarter than I am.
get<0>(lvecs)[i] += A[i][j] * get<0>(rvecs)[j];
get<1>(lvecs)[i] += A[i][j] * get<1>(rvecs)[j];
get<2>(lvecs)[i] += A[i][j] * get<2>(rvecs)[j];
get<3>(lvecs)[i] += A[i][j] * get<3>(rvecs)[j];
...
I have initially thought about 3-4 different ways of implementing this. One is just to rely on the automatic vectorization capabilities of g++ and hint the compiler to unroll the loop using templates, as this:
template <int D, typename T>
void saxpy(T a, T * __restrict b, T * __restrict c)
{
for(int i=0; i<D; ++i)
{
c[i] += a* b[i];
}
}
int a = 2;
int b[BETA];
int c[BETA];
saxpy<BETA>(a, b, c);
Performing saxpy 1 million times with BETA=16 takes about 9.97 milliseconds on our Opteron and 14.35 milliseconds on my Macbook's Core2Duo processor. These are the best I got, by using the -O3 optimization.
Then I said, let's try to force some SSE through gcc vector extensions:
typedef int vpackedsi __attribute__ ((vector_size (BETA*sizeof(int)))); // 64-bytes, a full cache line !
union ipackedvector
{
vpackedsi v;
int f[BETA];
};
ipackedvector av, bv, cv;
cv.v += av.v * bv.v;
Or course, the drawback here is that a also needs to be a vector since operations between scalar and vector types are not allowed (you'll get error: invalid operands of types 'int' and 'int __vector__' to binary 'operator*') but I assume the automatic vectorization in gcc does the same packing behind the covers (if it is using sse/simd at all)
Unfortunately, the second approach was slower, 16.712 milliseconds on the Opteron and 24.18 milliseconds on the Core2Duo. Exactly 67% slower on both architectures (which kind of surprised me). After taking a look at the generated assembly code, I see that the latter approach uses SSE4 instructions such as pmuldq, and the former uses operations on floating points such as addsd although the code is supposed to use integer. Maybe, it is using FPU pipelines for better performance, gcc must be smarter than I am.
Friday, July 24, 2009
C++ Performance for HPC, part #1
I deliberately chose this controversial title as it is one of the most debated issues. I also avoided saying C/C++ as no such language exists. C is C, and C++ is C++. I hate the famous "do one simple run, measure it, draw far-reaching conclusions from that simple experiment" cycle.
This basic example is about the debate on array of structs vs. struct of arrays. You'll hear that an array of structs approach is superior if you are accessing all the entries of the struct simultaneously. This is basically because when you access StrArray[i].first, the other entries of the struct (say StrArray[i].second and StrArray[i].third) will be fetched to the cache automatically and you'll likely to avoid a cache miss when you access them in the next line. However, this only makes sense if you're doing random access to the array. On the other hand, if you're simply streaming through the array, the pressure on the memory subsystem and the caches are simply the same. Because in that case, nothing that you bring to the cache is wasted, i.e. spatial locality is great when you access data sequentially. In that case, the winner of the battle between array of structs and struct of arrays will be determined by lots of other issues such as array length (are there any conflict misses?), cache sizes, replacement policy, compiler, etc. Here is my test:
struct mystruct
{
double first;
double second;
double third;
};
struct StrOfArrays
{
StrOfArrays(int size)
{
firsts = new double[size];
seconds = new double[size];
thirds = new double[size];
}
~StrArrays()
{
DeleteAll(firsts, seconds, thirds);
}
double * firsts;
double * seconds;
double * thirds;
}
int * linear = new int[TESTSIZE];
int * random = new int[TESTSIZE];
__gnu_cxx::iota(linear, linear+TESTSIZE, 0);
__gnu_cxx::iota(random, random+TESTSIZE, 0);
random_shuffle(random, random+TESTSIZE);
StrOfArrays strofarrays (TESTSIZE);
mystruct * arrayofstrs = new mystruct[TESTSIZE];
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[linear[i]].first += 1;
arrayofstrs[linear[i]].second += 2;
arrayofstrs[linear[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming array of structs\n", t2-t1);
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[linear[i]] += 1;
strofarrays.seconds[linear[i]] += 2;
strofarrays.thirds[linear[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming struct of arrays\n", t2-t1);
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[random[i]].first += 1;
arrayofstrs[random[i]].second += 2;
arrayofstrs[random[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing array of structs\n", t2-t1);
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[random[i]] += 1;
strofarrays.seconds[random[i]] += 2;
strofarrays.thirds[random[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing struct of arrays\n", t2-t1);
Results on Opteron 8378:
TESTSIZE=1 million
0.013407 seconds elapsed for streaming array of structs
0.009285 seconds elapsed for streaming struct of arrays
0.050337 seconds elapsed for randomly accessing array of structs
0.079531 seconds elapsed for randomly accessing struct of arrays
TESTSIZE=10 million
0.135137 seconds elapsed for streaming array of structs
0.129781 seconds elapsed for streaming struct of arrays
0.579303 seconds elapsed for randomly accessing array of structs
1.257762 seconds elapsed for randomly accessing struct of arrays
TESTSIZE=100 million
1.348777 seconds elapsed for streaming array of structs
0.917530 seconds elapsed for streaming struct of arrays
12.087753 seconds elapsed for randomly accessing array of structs
32.038711 seconds elapsed for randomly accessing struct of arrays
As seen, for basic streaming (sequential access), the claim of array of structs being superior is not true, at least on our Opteron. However, when we move to random access, you're better of opting for an array of structs approach. Translating this to real application:
If you're doing a sparse matrix X dense vector multiply using triples (row_id, col_id, num_val), stick to your struct of arrays.
If you're doing a sparse matrix X sparse matrix multiply, then you're better off with array of structs because this time you have a lot of random access.
This basic example is about the debate on array of structs vs. struct of arrays. You'll hear that an array of structs approach is superior if you are accessing all the entries of the struct simultaneously. This is basically because when you access StrArray[i].first, the other entries of the struct (say StrArray[i].second and StrArray[i].third) will be fetched to the cache automatically and you'll likely to avoid a cache miss when you access them in the next line. However, this only makes sense if you're doing random access to the array. On the other hand, if you're simply streaming through the array, the pressure on the memory subsystem and the caches are simply the same. Because in that case, nothing that you bring to the cache is wasted, i.e. spatial locality is great when you access data sequentially. In that case, the winner of the battle between array of structs and struct of arrays will be determined by lots of other issues such as array length (are there any conflict misses?), cache sizes, replacement policy, compiler, etc. Here is my test:
struct mystruct
{
double first;
double second;
double third;
};
struct StrOfArrays
{
StrOfArrays(int size)
{
firsts = new double[size];
seconds = new double[size];
thirds = new double[size];
}
~StrArrays()
{
DeleteAll(firsts, seconds, thirds);
}
double * firsts;
double * seconds;
double * thirds;
}
int * linear = new int[TESTSIZE];
int * random = new int[TESTSIZE];
__gnu_cxx::iota(linear, linear+TESTSIZE, 0);
__gnu_cxx::iota(random, random+TESTSIZE, 0);
random_shuffle(random, random+TESTSIZE);
StrOfArrays strofarrays (TESTSIZE);
mystruct * arrayofstrs = new mystruct[TESTSIZE];
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[linear[i]].first += 1;
arrayofstrs[linear[i]].second += 2;
arrayofstrs[linear[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming array of structs\n", t2-t1);
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[linear[i]] += 1;
strofarrays.seconds[linear[i]] += 2;
strofarrays.thirds[linear[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming struct of arrays\n", t2-t1);
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[random[i]].first += 1;
arrayofstrs[random[i]].second += 2;
arrayofstrs[random[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing array of structs\n", t2-t1);
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[random[i]] += 1;
strofarrays.seconds[random[i]] += 2;
strofarrays.thirds[random[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing struct of arrays\n", t2-t1);
Results on Opteron 8378:
TESTSIZE=1 million
0.013407 seconds elapsed for streaming array of structs
0.009285 seconds elapsed for streaming struct of arrays
0.050337 seconds elapsed for randomly accessing array of structs
0.079531 seconds elapsed for randomly accessing struct of arrays
TESTSIZE=10 million
0.135137 seconds elapsed for streaming array of structs
0.129781 seconds elapsed for streaming struct of arrays
0.579303 seconds elapsed for randomly accessing array of structs
1.257762 seconds elapsed for randomly accessing struct of arrays
TESTSIZE=100 million
1.348777 seconds elapsed for streaming array of structs
0.917530 seconds elapsed for streaming struct of arrays
12.087753 seconds elapsed for randomly accessing array of structs
32.038711 seconds elapsed for randomly accessing struct of arrays
As seen, for basic streaming (sequential access), the claim of array of structs being superior is not true, at least on our Opteron. However, when we move to random access, you're better of opting for an array of structs approach. Translating this to real application:
If you're doing a sparse matrix X dense vector multiply using triples (row_id, col_id, num_val), stick to your struct of arrays.
If you're doing a sparse matrix X sparse matrix multiply, then you're better off with array of structs because this time you have a lot of random access.
Subscribe to:
Posts (Atom)