Friday, July 24, 2009

C++ Performance for HPC, part #1

I deliberately chose this controversial title as it is one of the most debated issues. I also avoided saying C/C++ as no such language exists. C is C, and C++ is C++. I hate the famous "do one simple run, measure it, draw far-reaching conclusions from that simple experiment" cycle.

This basic example is about the debate on array of structs vs. struct of arrays. You'll hear that an array of structs approach is superior if you are accessing all the entries of the struct simultaneously. This is basically because when you access StrArray[i].first, the other entries of the struct (say StrArray[i].second and StrArray[i].third) will be fetched to the cache automatically and you'll likely to avoid a cache miss when you access them in the next line. However, this only makes sense if you're doing random access to the array. On the other hand, if you're simply streaming through the array, the pressure on the memory subsystem and the caches are simply the same. Because in that case, nothing that you bring to the cache is wasted, i.e. spatial locality is great when you access data sequentially. In that case, the winner of the battle between array of structs and struct of arrays will be determined by lots of other issues such as array length (are there any conflict misses?), cache sizes, replacement policy, compiler, etc. Here is my test:


struct mystruct
{
double first;
double second;
double third;
};

struct StrOfArrays
{
StrOfArrays(int size)
{
firsts = new double[size];
seconds = new double[size];
thirds = new double[size];
}
~StrArrays()
{
DeleteAll(firsts, seconds, thirds);
}
double * firsts;
double * seconds;
double * thirds;
}

int * linear = new int[TESTSIZE];
int * random = new int[TESTSIZE];
__gnu_cxx::iota(linear, linear+TESTSIZE, 0);
__gnu_cxx::iota(random, random+TESTSIZE, 0);
random_shuffle(random, random+TESTSIZE);

StrOfArrays strofarrays (TESTSIZE);
mystruct * arrayofstrs = new mystruct[TESTSIZE];

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[linear[i]].first += 1;
arrayofstrs[linear[i]].second += 2;
arrayofstrs[linear[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming array of structs\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[linear[i]] += 1;
strofarrays.seconds[linear[i]] += 2;
strofarrays.thirds[linear[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming struct of arrays\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[random[i]].first += 1;
arrayofstrs[random[i]].second += 2;
arrayofstrs[random[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing array of structs\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[random[i]] += 1;
strofarrays.seconds[random[i]] += 2;
strofarrays.thirds[random[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing struct of arrays\n", t2-t1);



Results on Opteron 8378:

TESTSIZE=1 million
0.013407 seconds elapsed for streaming array of structs
0.009285 seconds elapsed for streaming struct of arrays
0.050337 seconds elapsed for randomly accessing array of structs
0.079531 seconds elapsed for randomly accessing struct of arrays

TESTSIZE=10 million
0.135137 seconds elapsed for streaming array of structs
0.129781 seconds elapsed for streaming struct of arrays
0.579303 seconds elapsed for randomly accessing array of structs
1.257762 seconds elapsed for randomly accessing struct of arrays

TESTSIZE=100 million
1.348777 seconds elapsed for streaming array of structs
0.917530 seconds elapsed for streaming struct of arrays
12.087753 seconds elapsed for randomly accessing array of structs
32.038711 seconds elapsed for randomly accessing struct of arrays


As seen, for basic streaming (sequential access), the claim of array of structs being superior is not true, at least on our Opteron. However, when we move to random access, you're better of opting for an array of structs approach. Translating this to real application:
If you're doing a sparse matrix X dense vector multiply using triples (row_id, col_id, num_val), stick to your struct of arrays.
If you're doing a sparse matrix X sparse matrix multiply, then you're better off with array of structs because this time you have a lot of random access.

No comments:

Post a Comment