aydozz's technical

Wednesday, October 20, 2010

Unintended std::vector blues

std::vector sometimes has rather obscure semantics. For example, when you're writing generic code with MPI and you use vector to store some local data that needs to be sent/received by another processor, the usual trick of retrieving the beginning address of the internal array by &my_vector[0] will fail miserably when T=bool

This post is about another obscurity. Namely the reserve() method. Unless you use push_back()'s, this is rarely useful. Here is why:

vector < T > v, w;
v.reserve(n);
for(int i=0; i<n; ++i) v[i] = (T) i;
w.swap(v);

Now, guess what? You have two completely empty vectors. The reserve only changed capacity, and assignments with operator[] did not need to resize as well since capacity was enough (imho, it should update size whenever an unused location in the vector starts being used, but that's hard to ensure in practice). Therefore, v is zeros-sized even after the for-loop.

Tuesday, October 19, 2010

Intrusive Data Structures

This is about something that standard CS curriculums rarely teach. We always think of data structures as synonyms to containers. We "insert" something, we "delete", or we traverse that "bag" of objects. Then at some point, we figure out that for certain big enough elements, copying data to the container costs more than the actual operations on the container. Later, we go even more crazy and try the discouraged: inserting pointers to the container instead of the actual objects. This leaves us with terrible memory management and long hours of debugging.

An alternative is to think of "intrusive" data structures. As the simplest example, think about a linked list. Instead of inserting the object to a bigger ListNode object like this:

template <typename Obj>
struct ListNode
{
Obj object;
ListNode * next;
}

We can embed the next pointer inside each object inside the list.

struct Object
{
Object * next;
...
}

Read on... for more information: http://www.codefarms.com/publications/intrusiv/Intrusive.pdf

Monday, August 30, 2010

Template specialization is explicit instantiation

Fact: As the title says, whenever you specialize a templated function, you explicitly instantiate it as well.

What's the practical implication? You can just compile the specialization separately and link it to the main program. This property was invaluable in my case where I wanted to use different compilers for different parts of my code (hence I can not just include every template function).

How do you do it?

Inside main.cpp:
template <unsigned TT>
void foo(/*bunch of parameters*/) {}; // dummy base template

template <>
void foo<2>(/*bunch of parameters*/); // only the declaration

Inside side.cpp
template <unsigned TT>
void foo(/*bunch of parameters*/); // declaration of base template

template <>
void foo<2>(/*bunch of parameters*/)
{ /* lots of code */ }

Wednesday, December 30, 2009

Debugging MPI

Parallel debugging has come a long way. Today, it's quite possible to use -X and remotely debug a parallel MPI application running on a supercomputer. TotalView and DDT are just two example debuggers, their IDEs are pretty familiar for a seasoned C/C++ programmer.

However, debugging MPI is still painful because it rarely does what it claims it does. When there is a crash, you only get to see the crash. The MPI function in which the crash occurred will probably never return an error code because it will exit the application first. It is claimed that you can override this default behavior in C++ by setting error handlers to MPI::ERRORS_THROW_EXCEPTIONS, but I am yet to see that happen in practice.

What's left for me is to peek into the internals of the MPI objects to see if any fields (or the whole object) are uninitialized. But sometimes, there is really not such a thing as an MPI object as we know it, it is just a handle. For example, here is the implementation in MPICH2 and MVAPICH2:

// in file mpi.h
typedef int MPI_Win;

// in file mpiimpl.h
typedef struct MPID_Win {
int handle; /* value of MPI_Win for this structure */
volatile int ref_count;
int fence_cnt; /* 0 = no fence has been called; 1 = fence has been called */
MPID_Errhandler *errhandler; /* Pointer to the error handler structure */
void *base;
MPI_Aint size;
int disp_unit; /* Displacement unit of *local* window */
MPID_Attribute *attributes;
MPID_Group *start_group_ptr; /* group passed in MPI_Win_start */
int start_assert; /* assert passed to MPI_Win_start */
MPI_Comm comm; /* communicator of window (dup) */
...
char name[MPI_MAX_OBJECT_NAME];
} MPID_Win;

// In file mpicxx.h
class Win {
protected:
MPI_Win the_real_win;

public:
inline Win(MPI_Win obj) : the_real_win(obj) {}
inline Win(void) : the_real_win(MPI_WIN_NULL) {}
virtual ~Win() {}

Win(const Win &obj) : the_real_win(obj.the_real_win){}
Win& operator=(const Win &obj) {
the_real_win = obj.the_real_win; return *this; }

// logical
bool operator== (const Win &obj) {
return (the_real_win == obj.the_real_win); }
bool operator!= (const Win &obj) {
return (the_real_win != obj.the_real_win); }
// C/C++ cast and assignment
inline operator MPI_Win*() { return &the_real_win; }
inline operator MPI_Win() const { return the_real_win; }
Win& operator=(const MPI_Win& obj) {
the_real_win = obj; return *this; }
...
}

prior to any operation, the real objects are retrieved via these handles:

/* Convert MPI object handles to object pointers */
MPID_Win_get_ptr( win, win_ptr );

where

#define MPID_Win_get_ptr(a,ptr) MPID_Get_ptr(Win,a,ptr)

/* Convert handles to objects for MPI types that do _not_ have any predefined objects */
#define MPID_Get_ptr(kind,a,ptr) \
{ \
switch (HANDLE_GET_KIND(a)) { \
case HANDLE_KIND_DIRECT: \
ptr=MPID_##kind##_direct+HANDLE_INDEX(a); \
break; \
case HANDLE_KIND_INDIRECT: \
ptr=((MPID_##kind*) \
MPIU_Handle_get_ptr_indirect(a,&MPID_##kind##_mem)); \
break; \
case HANDLE_KIND_INVALID: \
case HANDLE_KIND_BUILTIN: \
default: \
ptr=0; \
break; \
} \
}

So what's the story behind this design?

MPI Opaque objects such as 'MPI_Comm' or 'MPI_Datatype' are specified by integers (in the MPICH2 implementation); the MPI standard calls these handles. Out of range values are invalid; the value 0 is reserved. For most (with the possible exception of 'MPI_Request' for performance reasons) MPI Opaque objects, the integer encodes both the kind of object (allowing runtime tests to detect a datatype passed where a communicator is expected) and important properties of the object. Even the 'MPI_xxx_NULL' values should be encoded so that different null handles can be distinguished. The details of the encoding of the handles is covered in more detail in the MPICH2 Design Document. For the most part, the ADI uses pointers to the underlying structures rather than the handles themselves. However, each structure contains an 'handle' field that is the corresponding integer handle for the MPI object.
MPID objects (objects used within the implementation of MPI) are not opaque.

Here is the MPICH2 design document for the interested.

Anyway, if you were hopeless enough to look at the design document, you'd see what kind of a black magic is necessary to make sense out of it. This might make sense as it is automatic encapsulation of data, one doesn't even need to use classes and declare its members private, because no one can access that object without deciphering what it is first. Fortunately, there are alternative MPI implementations, such as OpenMPI, that uses real (well, sort of) objects. Here is how MPI_Win is (partially) declared in OpenMPI:

struct ompi_win_t {
opal_object_t w_base;
opal_mutex_t w_lock;

/* Group associated with this window. */
ompi_group_t *w_group;

/* Information about the state of the window. */
uint16_t w_flags;

/* Error handling. This field does not have the "w_" prefix so that the OMPI_ERRHDL_* macros can find it, regardless of whether it's a comm, window, or file. */
ompi_errhandler_t *error_handler;
ompi_errhandler_type_t errhandler_type;

/* displacement factor */
int w_disp_unit;

void *w_baseptr;
size_t w_size;

/** Current epoch / mode (access, expose, lock, etc.). Checked by the argument checking code in the MPI layer, set by the OSC component. Modified without locking w_lock. */
volatile uint16_t w_mode;

/* one sided interface */
ompi_osc_base_module_t *w_osc_module;
};

Once I call Post on a Window, it's w_mode becomes 34 because it is 22 in hexadecimal which means it has been posted and the exposure epoch has actually started. And after I successfully call Start on the same window, its w_mode because 99 which 63 in hexadecimal, telling me that ACCESS_EPOCH and STARTED are now also true.

/* mode */
#define OMPI_WIN_ACCESS_EPOCH 0x00000001
#define OMPI_WIN_EXPOSE_EPOCH 0x00000002
#define OMPI_WIN_FENCE 0x00000010
#define OMPI_WIN_POSTED 0x00000020
#define OMPI_WIN_STARTED 0x00000040
#define OMPI_WIN_LOCK_ACCESS 0x00000080

Just to decipher more... (you should never proceed this far unless you are 100% sure that the it's the MPI implementation's bug, not yours)
w_osc_module contains all the necessary function pointers implemented by the osc (One-Sided Component).
For example the post function looks like the following:

// in file osc_pt2pt_sync.c
int ompi_osc_pt2pt_module_post(ompi_group_t *group, int assert, ompi_win_t *win)
{
int i;
ompi_osc_pt2pt_module_t *module = P2P_MODULE(win);

OBJ_RETAIN(group);
ompi_group_increment_proc_count(group);

OPAL_THREAD_LOCK(&(module->p2p_lock));
assert(NULL == module->p2p_pw_group);
module->p2p_pw_group = group;

/* Set our mode to expose w/ post */
ompi_win_remove_mode(win, OMPI_WIN_FENCE);
ompi_win_append_mode(win, OMPI_WIN_EXPOSE_EPOCH | OMPI_WIN_POSTED);

/* list how many complete counters we're still waiting on */
module->p2p_num_complete_msgs +=
ompi_group_size(module->p2p_pw_group);
OPAL_THREAD_UNLOCK(&(module->p2p_lock));

/* send a hello counter to everyone in group */
for (i = 0 ; i < ompi_group_size(module->p2p_pw_group) ; ++i) {
ompi_osc_pt2pt_control_send(module, ompi_group_peer_lookup(group, i),
OMPI_OSC_PT2PT_HDR_POST, 1, 0);
}
return OMPI_SUCCESS;
}

In reality, some of these functions do very minimal communication. For example, all Start does is to find the rank of each process within the specified group, store the indices and set the true/false flag in the active ranks table.

One thing to notice is the P2P_MODULE() call that basically casts w_osc_module pointer to type ompi_osc_pt2pt_module_t *, which has the following interesting components (ok, interesting from a general active-target synchronization perspective):

// inside ompi/mca/osc/pt2pt/osc_pt2pt.h
struct ompi_osc_pt2pt_module_t {
/** Extend the basic osc module interface */
ompi_osc_base_module_t super; // ABAB: See the ac-hoc inheritance? - Lovely !

/** pointer back to window */
ompi_win_t *p2p_win;

/** communicator created with this window */
ompi_communicator_t *p2p_comm;

/** control message receive request */
struct ompi_request_t *p2p_cb_request;

opal_list_t p2p_pending_control_sends;

/** list of ompi_osc_pt2pt_sendreq_t structures, and includes all requests for this access epoch that have not already been started. p2p_lock must be held when modifying this field. */
opal_list_t p2p_pending_sendreqs;

/** list of unsigned int counters for the number of requests to a particular rank in p2p_comm for this access epoc. p2p_lock must be held when modifying this field */
unsigned int *p2p_num_pending_sendreqs;

/** For MPI_Fence synchronization, the number of messages to send in epoch. For Start/Complete, the number of updates for this Complete. For lock, the number of messages waiting for completion on on the origin side. Not protected by p2p_lock - must use atomic counter operations. */
volatile int32_t p2p_num_pending_out;

/** For MPI_Fence synchronization, the number of expected incoming messages. For Post/Wait, the number of expected updates from complete. For lock, the number of messages on the passive side we are waiting for. Not protected by p2p_lock - must use atomic counter operations. */
volatile int32_t p2p_num_pending_in;

/** Number of "ping" messages from the remote post group we've received */
volatile int32_t p2p_num_post_msgs;

/** Number of "count" messages from the remote complete group we've received */
volatile int32_t p2p_num_complete_msgs;

...

/********************** PWSC data *************************/
struct ompi_group_t *p2p_pw_group;
struct ompi_group_t *p2p_sc_group;
bool *p2p_sc_remote_active_ranks;
int *p2p_sc_remote_ranks;

...
};
typedef struct ompi_osc_pt2pt_module_t ompi_osc_pt2pt_module_t;

Wednesday, July 29, 2009

Initializations in C++

Initialization has been a vague part of my C++ knowledge. Although I always knew that most user defined classes has appropriate constructors, so that I can expect reasonable behavior (e.g. std::string initializes to an empty string "" ), this hasn't been the case for built-in arrays of vectors.

What do you expect from this code:

float * test = new float[10];
copy(test, test+10, ostream_iterator<float>( cout, " ");

Honestly, I expect anything to happen though it sometimes behaves reasonably and spits out zeros. As a matter of fact, valgrind will list hundreds of errors for this tiny peace of code, most of which are of the form "Conditional jump or move depends on uninitialised value". The way to make sure everything is zero initialized is to use std::fill or std::fill_n (but please, no memset) after the allocation. On the other hand, this is not always possible.

Assume you have a templated function which takes an array of T's where T can be pretty much anything. If it is a non-POD (plain old data) class, then you can look at the documentation and make sure the default constructor does what you want it to do (in my case, assignment to zero). But how to handle built-in types without writing a specialization to cover them? std:fill trick won't work since it might not make sense to the non-POD class you're dealing with.

Here is the loophole :

float * test = new float[10] () ;
copy(test, test+10, ostream_iterator<float>( cout, " ");

Note the empty set of parantheses as the initializer, which makes the array elements default constructed and the C++ standard says that a default constructed POD type is zero-initialized. Neat, huh? (btw, valgrind reports 0 errors this time)

SSE and auto-vectorization in g++

I have been implementing a SpMV with multiple right hand side vectors (and hence multiple left hand side vectors). For each A[i][j], that translated into performing (assuming that rhs & lhs are represented as tuples of std::vectors):

get<0>(lvecs)[i] += A[i][j] * get<0>(rvecs)[j];
get<1>(lvecs)[i] += A[i][j] * get<1>(rvecs)[j];
get<2>(lvecs)[i] += A[i][j] * get<2>(rvecs)[j];
get<3>(lvecs)[i] += A[i][j] * get<3>(rvecs)[j];
...

I have initially thought about 3-4 different ways of implementing this. One is just to rely on the automatic vectorization capabilities of g++ and hint the compiler to unroll the loop using templates, as this:

template <int D, typename T>
void saxpy(T a, T * __restrict b, T * __restrict c)
{
for(int i=0; i<D; ++i)
{
c[i] += a* b[i];
}
}

int a = 2;
int b[BETA];
int c[BETA];
saxpy<BETA>(a, b, c);

Performing saxpy 1 million times with BETA=16 takes about 9.97 milliseconds on our Opteron and 14.35 milliseconds on my Macbook's Core2Duo processor. These are the best I got, by using the -O3 optimization.

Then I said, let's try to force some SSE through gcc vector extensions:

typedef int vpackedsi __attribute__ ((vector_size (BETA*sizeof(int)))); // 64-bytes, a full cache line !
union ipackedvector
{
vpackedsi v;
int f[BETA];
};

ipackedvector av, bv, cv;
cv.v += av.v * bv.v;

Or course, the drawback here is that a also needs to be a vector since operations between scalar and vector types are not allowed (you'll get error: invalid operands of types 'int' and 'int __vector__' to binary 'operator*') but I assume the automatic vectorization in gcc does the same packing behind the covers (if it is using sse/simd at all)

Unfortunately, the second approach was slower, 16.712 milliseconds on the Opteron and 24.18 milliseconds on the Core2Duo. Exactly 67% slower on both architectures (which kind of surprised me). After taking a look at the generated assembly code, I see that the latter approach uses SSE4 instructions such as pmuldq, and the former uses operations on floating points such as addsd although the code is supposed to use integer. Maybe, it is using FPU pipelines for better performance, gcc must be smarter than I am.

Friday, July 24, 2009

C++ Performance for HPC, part #1

I deliberately chose this controversial title as it is one of the most debated issues. I also avoided saying C/C++ as no such language exists. C is C, and C++ is C++. I hate the famous "do one simple run, measure it, draw far-reaching conclusions from that simple experiment" cycle.

This basic example is about the debate on array of structs vs. struct of arrays. You'll hear that an array of structs approach is superior if you are accessing all the entries of the struct simultaneously. This is basically because when you access StrArray[i].first, the other entries of the struct (say StrArray[i].second and StrArray[i].third) will be fetched to the cache automatically and you'll likely to avoid a cache miss when you access them in the next line. However, this only makes sense if you're doing random access to the array. On the other hand, if you're simply streaming through the array, the pressure on the memory subsystem and the caches are simply the same. Because in that case, nothing that you bring to the cache is wasted, i.e. spatial locality is great when you access data sequentially. In that case, the winner of the battle between array of structs and struct of arrays will be determined by lots of other issues such as array length (are there any conflict misses?), cache sizes, replacement policy, compiler, etc. Here is my test:

struct mystruct
{
double first;
double second;
double third;
};

struct StrOfArrays
{
StrOfArrays(int size)
{
firsts = new double[size];
seconds = new double[size];
thirds = new double[size];
}
~StrArrays()
{
DeleteAll(firsts, seconds, thirds);
}
double * firsts;
double * seconds;
double * thirds;
}

int * linear = new int[TESTSIZE];
int * random = new int[TESTSIZE];
__gnu_cxx::iota(linear, linear+TESTSIZE, 0);
__gnu_cxx::iota(random, random+TESTSIZE, 0);
random_shuffle(random, random+TESTSIZE);

StrOfArrays strofarrays (TESTSIZE);
mystruct * arrayofstrs = new mystruct[TESTSIZE];

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[linear[i]].first += 1;
arrayofstrs[linear[i]].second += 2;
arrayofstrs[linear[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming array of structs\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[linear[i]] += 1;
strofarrays.seconds[linear[i]] += 2;
strofarrays.thirds[linear[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming struct of arrays\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[random[i]].first += 1;
arrayofstrs[random[i]].second += 2;
arrayofstrs[random[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing array of structs\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[random[i]] += 1;
strofarrays.seconds[random[i]] += 2;
strofarrays.thirds[random[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing struct of arrays\n", t2-t1);

Results on Opteron 8378:

TESTSIZE=1 million
0.013407 seconds elapsed for streaming array of structs
0.009285 seconds elapsed for streaming struct of arrays
0.050337 seconds elapsed for randomly accessing array of structs
0.079531 seconds elapsed for randomly accessing struct of arrays

TESTSIZE=10 million
0.135137 seconds elapsed for streaming array of structs
0.129781 seconds elapsed for streaming struct of arrays
0.579303 seconds elapsed for randomly accessing array of structs
1.257762 seconds elapsed for randomly accessing struct of arrays

TESTSIZE=100 million
1.348777 seconds elapsed for streaming array of structs
0.917530 seconds elapsed for streaming struct of arrays
12.087753 seconds elapsed for randomly accessing array of structs
32.038711 seconds elapsed for randomly accessing struct of arrays

As seen, for basic streaming (sequential access), the claim of array of structs being superior is not true, at least on our Opteron. However, when we move to random access, you're better of opting for an array of structs approach. Translating this to real application:
If you're doing a sparse matrix X dense vector multiply using triples (row_id, col_id, num_val), stick to your struct of arrays.
If you're doing a sparse matrix X sparse matrix multiply, then you're better off with array of structs because this time you have a lot of random access.