Wednesday, December 30, 2009

Debugging MPI

Parallel debugging has come a long way. Today, it's quite possible to use -X and remotely debug a parallel MPI application running on a supercomputer. TotalView and DDT are just two example debuggers, their IDEs are pretty familiar for a seasoned C/C++ programmer.

However, debugging MPI is still painful because it rarely does what it claims it does. When there is a crash, you only get to see the crash. The MPI function in which the crash occurred will probably never return an error code because it will exit the application first. It is claimed that you can override this default behavior in C++ by setting error handlers to MPI::ERRORS_THROW_EXCEPTIONS, but I am yet to see that happen in practice.

What's left for me is to peek into the internals of the MPI objects to see if any fields (or the whole object) are uninitialized. But sometimes, there is really not such a thing as an MPI object as we know it, it is just a handle. For example, here is the implementation in MPICH2 and MVAPICH2:


// in file mpi.h
typedef int MPI_Win;

// in file mpiimpl.h
typedef struct MPID_Win {
int handle; /* value of MPI_Win for this structure */
volatile int ref_count;
int fence_cnt; /* 0 = no fence has been called; 1 = fence has been called */
MPID_Errhandler *errhandler; /* Pointer to the error handler structure */
void *base;
MPI_Aint size;
int disp_unit; /* Displacement unit of *local* window */
MPID_Attribute *attributes;
MPID_Group *start_group_ptr; /* group passed in MPI_Win_start */
int start_assert; /* assert passed to MPI_Win_start */
MPI_Comm comm; /* communicator of window (dup) */
...
char name[MPI_MAX_OBJECT_NAME];
} MPID_Win;


// In file mpicxx.h
class Win {
protected:
MPI_Win the_real_win;

public:
inline Win(MPI_Win obj) : the_real_win(obj) {}
inline Win(void) : the_real_win(MPI_WIN_NULL) {}
virtual ~Win() {}

Win(const Win &obj) : the_real_win(obj.the_real_win){}
Win& operator=(const Win &obj) {
the_real_win = obj.the_real_win; return *this; }

// logical
bool operator== (const Win &obj) {
return (the_real_win == obj.the_real_win); }
bool operator!= (const Win &obj) {
return (the_real_win != obj.the_real_win); }
// C/C++ cast and assignment
inline operator MPI_Win*() { return &the_real_win; }
inline operator MPI_Win() const { return the_real_win; }
Win& operator=(const MPI_Win& obj) {
the_real_win = obj; return *this; }
...
}



prior to any operation, the real objects are retrieved via these handles:

/* Convert MPI object handles to object pointers */
MPID_Win_get_ptr( win, win_ptr );


where


#define MPID_Win_get_ptr(a,ptr) MPID_Get_ptr(Win,a,ptr)

/* Convert handles to objects for MPI types that do _not_ have any predefined objects */
#define MPID_Get_ptr(kind,a,ptr) \
{ \
switch (HANDLE_GET_KIND(a)) { \
case HANDLE_KIND_DIRECT: \
ptr=MPID_##kind##_direct+HANDLE_INDEX(a); \
break; \
case HANDLE_KIND_INDIRECT: \
ptr=((MPID_##kind*) \
MPIU_Handle_get_ptr_indirect(a,&MPID_##kind##_mem)); \
break; \
case HANDLE_KIND_INVALID: \
case HANDLE_KIND_BUILTIN: \
default: \
ptr=0; \
break; \
} \
}



So what's the story behind this design?

MPI Opaque objects such as 'MPI_Comm' or 'MPI_Datatype' are specified by integers (in the MPICH2 implementation); the MPI standard calls these handles. Out of range values are invalid; the value 0 is reserved. For most (with the possible exception of 'MPI_Request' for performance reasons) MPI Opaque objects, the integer encodes both the kind of object (allowing runtime tests to detect a datatype passed where a communicator is expected) and important properties of the object. Even the 'MPI_xxx_NULL' values should be encoded so that different null handles can be distinguished. The details of the encoding of the handles is covered in more detail in the MPICH2 Design Document. For the most part, the ADI uses pointers to the underlying structures rather than the handles themselves. However, each structure contains an 'handle' field that is the corresponding integer handle for the MPI object.
MPID objects (objects used within the implementation of MPI) are not opaque.


Here is the MPICH2 design document for the interested.


Anyway, if you were hopeless enough to look at the design document, you'd see what kind of a black magic is necessary to make sense out of it. This might make sense as it is automatic encapsulation of data, one doesn't even need to use classes and declare its members private, because no one can access that object without deciphering what it is first. Fortunately, there are alternative MPI implementations, such as OpenMPI, that uses real (well, sort of) objects. Here is how MPI_Win is (partially) declared in OpenMPI:


struct ompi_win_t {
opal_object_t w_base;
opal_mutex_t w_lock;

/* Group associated with this window. */
ompi_group_t *w_group;

/* Information about the state of the window. */
uint16_t w_flags;

/* Error handling. This field does not have the "w_" prefix so that the OMPI_ERRHDL_* macros can find it, regardless of whether it's a comm, window, or file. */
ompi_errhandler_t *error_handler;
ompi_errhandler_type_t errhandler_type;

/* displacement factor */
int w_disp_unit;

void *w_baseptr;
size_t w_size;

/** Current epoch / mode (access, expose, lock, etc.). Checked by the argument checking code in the MPI layer, set by the OSC component. Modified without locking w_lock. */
volatile uint16_t w_mode;

/* one sided interface */
ompi_osc_base_module_t *w_osc_module;
};




Once I call Post on a Window, it's w_mode becomes 34 because it is 22 in hexadecimal which means it has been posted and the exposure epoch has actually started. And after I successfully call Start on the same window, its w_mode because 99 which 63 in hexadecimal, telling me that ACCESS_EPOCH and STARTED are now also true.


/* mode */
#define OMPI_WIN_ACCESS_EPOCH 0x00000001
#define OMPI_WIN_EXPOSE_EPOCH 0x00000002
#define OMPI_WIN_FENCE 0x00000010
#define OMPI_WIN_POSTED 0x00000020
#define OMPI_WIN_STARTED 0x00000040
#define OMPI_WIN_LOCK_ACCESS 0x00000080



Just to decipher more... (you should never proceed this far unless you are 100% sure that the it's the MPI implementation's bug, not yours)
w_osc_module contains all the necessary function pointers implemented by the osc (One-Sided Component).
For example the post function looks like the following:



// in file osc_pt2pt_sync.c
int ompi_osc_pt2pt_module_post(ompi_group_t *group, int assert, ompi_win_t *win)
{
int i;
ompi_osc_pt2pt_module_t *module = P2P_MODULE(win);

OBJ_RETAIN(group);
ompi_group_increment_proc_count(group);

OPAL_THREAD_LOCK(&(module->p2p_lock));
assert(NULL == module->p2p_pw_group);
module->p2p_pw_group = group;

/* Set our mode to expose w/ post */
ompi_win_remove_mode(win, OMPI_WIN_FENCE);
ompi_win_append_mode(win, OMPI_WIN_EXPOSE_EPOCH | OMPI_WIN_POSTED);

/* list how many complete counters we're still waiting on */
module->p2p_num_complete_msgs +=
ompi_group_size(module->p2p_pw_group);
OPAL_THREAD_UNLOCK(&(module->p2p_lock));

/* send a hello counter to everyone in group */
for (i = 0 ; i < ompi_group_size(module->p2p_pw_group) ; ++i) {
ompi_osc_pt2pt_control_send(module, ompi_group_peer_lookup(group, i),
OMPI_OSC_PT2PT_HDR_POST, 1, 0);
}
return OMPI_SUCCESS;
}



In reality, some of these functions do very minimal communication. For example, all Start does is to find the rank of each process within the specified group, store the indices and set the true/false flag in the active ranks table.

One thing to notice is the P2P_MODULE() call that basically casts w_osc_module pointer to type ompi_osc_pt2pt_module_t *, which has the following interesting components (ok, interesting from a general active-target synchronization perspective):

// inside ompi/mca/osc/pt2pt/osc_pt2pt.h
struct ompi_osc_pt2pt_module_t {
/** Extend the basic osc module interface */
ompi_osc_base_module_t super; // ABAB: See the ac-hoc inheritance? - Lovely !

/** pointer back to window */
ompi_win_t *p2p_win;

/** communicator created with this window */
ompi_communicator_t *p2p_comm;

/** control message receive request */
struct ompi_request_t *p2p_cb_request;

opal_list_t p2p_pending_control_sends;

/** list of ompi_osc_pt2pt_sendreq_t structures, and includes all requests for this access epoch that have not already been started. p2p_lock must be held when modifying this field. */
opal_list_t p2p_pending_sendreqs;

/** list of unsigned int counters for the number of requests to a particular rank in p2p_comm for this access epoc. p2p_lock must be held when modifying this field */
unsigned int *p2p_num_pending_sendreqs;

/** For MPI_Fence synchronization, the number of messages to send in epoch. For Start/Complete, the number of updates for this Complete. For lock, the number of messages waiting for completion on on the origin side. Not protected by p2p_lock - must use atomic counter operations. */
volatile int32_t p2p_num_pending_out;

/** For MPI_Fence synchronization, the number of expected incoming messages. For Post/Wait, the number of expected updates from complete. For lock, the number of messages on the passive side we are waiting for. Not protected by p2p_lock - must use atomic counter operations. */
volatile int32_t p2p_num_pending_in;

/** Number of "ping" messages from the remote post group we've received */
volatile int32_t p2p_num_post_msgs;

/** Number of "count" messages from the remote complete group we've received */
volatile int32_t p2p_num_complete_msgs;

...

/********************** PWSC data *************************/
struct ompi_group_t *p2p_pw_group;
struct ompi_group_t *p2p_sc_group;
bool *p2p_sc_remote_active_ranks;
int *p2p_sc_remote_ranks;

...
};
typedef struct ompi_osc_pt2pt_module_t ompi_osc_pt2pt_module_t;

Wednesday, July 29, 2009

Initializations in C++

Initialization has been a vague part of my C++ knowledge. Although I always knew that most user defined classes has appropriate constructors, so that I can expect reasonable behavior (e.g. std::string initializes to an empty string "" ), this hasn't been the case for built-in arrays of vectors.

What do you expect from this code:

float * test = new float[10];
copy(test, test+10, ostream_iterator<float>( cout, " ");


Honestly, I expect anything to happen though it sometimes behaves reasonably and spits out zeros. As a matter of fact, valgrind will list hundreds of errors for this tiny peace of code, most of which are of the form "Conditional jump or move depends on uninitialised value". The way to make sure everything is zero initialized is to use std::fill or std::fill_n (but please, no memset) after the allocation. On the other hand, this is not always possible.

Assume you have a templated function which takes an array of T's where T can be pretty much anything. If it is a non-POD (plain old data) class, then you can look at the documentation and make sure the default constructor does what you want it to do (in my case, assignment to zero). But how to handle built-in types without writing a specialization to cover them? std:fill trick won't work since it might not make sense to the non-POD class you're dealing with.

Here is the loophole :


float * test = new float[10] () ;
copy(test, test+10, ostream_iterator<float>( cout, " ");


Note the empty set of parantheses as the initializer, which makes the array elements default constructed and the C++ standard says that a default constructed POD type is zero-initialized. Neat, huh? (btw, valgrind reports 0 errors this time)

SSE and auto-vectorization in g++

I have been implementing a SpMV with multiple right hand side vectors (and hence multiple left hand side vectors). For each A[i][j], that translated into performing (assuming that rhs & lhs are represented as tuples of std::vectors):

get<0>(lvecs)[i] += A[i][j] * get<0>(rvecs)[j];
get<1>(lvecs)[i] += A[i][j] * get<1>(rvecs)[j];
get<2>(lvecs)[i] += A[i][j] * get<2>(rvecs)[j];
get<3>(lvecs)[i] += A[i][j] * get<3>(rvecs)[j];
...




I have initially thought about 3-4 different ways of implementing this. One is just to rely on the automatic vectorization capabilities of g++ and hint the compiler to unroll the loop using templates, as this:

template <int D, typename T>
void saxpy(T a, T * __restrict b, T * __restrict c)
{
for(int i=0; i<D; ++i)
{
c[i] += a* b[i];
}
}

int a = 2;
int b[BETA];
int c[BETA];
saxpy<BETA>(a, b, c);



Performing saxpy 1 million times with BETA=16 takes about 9.97 milliseconds on our Opteron and 14.35 milliseconds on my Macbook's Core2Duo processor. These are the best I got, by using the -O3 optimization.

Then I said, let's try to force some SSE through gcc vector extensions:

typedef int vpackedsi __attribute__ ((vector_size (BETA*sizeof(int)))); // 64-bytes, a full cache line !
union ipackedvector
{
vpackedsi v;
int f[BETA];
};

ipackedvector av, bv, cv;
cv.v += av.v * bv.v;


Or course, the drawback here is that a also needs to be a vector since operations between scalar and vector types are not allowed (you'll get error: invalid operands of types 'int' and 'int __vector__' to binary 'operator*') but I assume the automatic vectorization in gcc does the same packing behind the covers (if it is using sse/simd at all)

Unfortunately, the second approach was slower, 16.712 milliseconds on the Opteron and 24.18 milliseconds on the Core2Duo. Exactly 67% slower on both architectures (which kind of surprised me). After taking a look at the generated assembly code, I see that the latter approach uses SSE4 instructions such as pmuldq, and the former uses operations on floating points such as addsd although the code is supposed to use integer. Maybe, it is using FPU pipelines for better performance, gcc must be smarter than I am.

Friday, July 24, 2009

C++ Performance for HPC, part #1

I deliberately chose this controversial title as it is one of the most debated issues. I also avoided saying C/C++ as no such language exists. C is C, and C++ is C++. I hate the famous "do one simple run, measure it, draw far-reaching conclusions from that simple experiment" cycle.

This basic example is about the debate on array of structs vs. struct of arrays. You'll hear that an array of structs approach is superior if you are accessing all the entries of the struct simultaneously. This is basically because when you access StrArray[i].first, the other entries of the struct (say StrArray[i].second and StrArray[i].third) will be fetched to the cache automatically and you'll likely to avoid a cache miss when you access them in the next line. However, this only makes sense if you're doing random access to the array. On the other hand, if you're simply streaming through the array, the pressure on the memory subsystem and the caches are simply the same. Because in that case, nothing that you bring to the cache is wasted, i.e. spatial locality is great when you access data sequentially. In that case, the winner of the battle between array of structs and struct of arrays will be determined by lots of other issues such as array length (are there any conflict misses?), cache sizes, replacement policy, compiler, etc. Here is my test:


struct mystruct
{
double first;
double second;
double third;
};

struct StrOfArrays
{
StrOfArrays(int size)
{
firsts = new double[size];
seconds = new double[size];
thirds = new double[size];
}
~StrArrays()
{
DeleteAll(firsts, seconds, thirds);
}
double * firsts;
double * seconds;
double * thirds;
}

int * linear = new int[TESTSIZE];
int * random = new int[TESTSIZE];
__gnu_cxx::iota(linear, linear+TESTSIZE, 0);
__gnu_cxx::iota(random, random+TESTSIZE, 0);
random_shuffle(random, random+TESTSIZE);

StrOfArrays strofarrays (TESTSIZE);
mystruct * arrayofstrs = new mystruct[TESTSIZE];

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[linear[i]].first += 1;
arrayofstrs[linear[i]].second += 2;
arrayofstrs[linear[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming array of structs\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[linear[i]] += 1;
strofarrays.seconds[linear[i]] += 2;
strofarrays.thirds[linear[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for streaming struct of arrays\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
arrayofstrs[random[i]].first += 1;
arrayofstrs[random[i]].second += 2;
arrayofstrs[random[i]].third += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing array of structs\n", t2-t1);

gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
for(int i=0; i<TESTSIZE; ++i)
{
strofarrays.firsts[random[i]] += 1;
strofarrays.seconds[random[i]] += 2;
strofarrays.thirds[random[i]] += 3;
}
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf seconds elapsed for randomly accessing struct of arrays\n", t2-t1);



Results on Opteron 8378:

TESTSIZE=1 million
0.013407 seconds elapsed for streaming array of structs
0.009285 seconds elapsed for streaming struct of arrays
0.050337 seconds elapsed for randomly accessing array of structs
0.079531 seconds elapsed for randomly accessing struct of arrays

TESTSIZE=10 million
0.135137 seconds elapsed for streaming array of structs
0.129781 seconds elapsed for streaming struct of arrays
0.579303 seconds elapsed for randomly accessing array of structs
1.257762 seconds elapsed for randomly accessing struct of arrays

TESTSIZE=100 million
1.348777 seconds elapsed for streaming array of structs
0.917530 seconds elapsed for streaming struct of arrays
12.087753 seconds elapsed for randomly accessing array of structs
32.038711 seconds elapsed for randomly accessing struct of arrays


As seen, for basic streaming (sequential access), the claim of array of structs being superior is not true, at least on our Opteron. However, when we move to random access, you're better of opting for an array of structs approach. Translating this to real application:
If you're doing a sparse matrix X dense vector multiply using triples (row_id, col_id, num_val), stick to your struct of arrays.
If you're doing a sparse matrix X sparse matrix multiply, then you're better off with array of structs because this time you have a lot of random access.

Monday, June 29, 2009

Accessor functions

In C++, a temporary cannot be passed by non-const reference. If you are slightly inexperienced and (for the sake of encapsulation) writing your accessors like this:

class Object
{
....
MPI::Intracomm GetRowWorld() { return rowWorld; }
MPI::Intracomm GetColWorld() { return colWorld; }
...
}


You'll be in deep trouble when you want to send these return values to a function. Such as this:

void foo(MPI::Intracomm & rowcomm, MPI::Intracomm & colcomm)
{
...
}

Object A;
foo( A.GetRowWorld(), A.GetColWorld());
// Won't compile (and you should be happy about it)



So, what's the fix? You can try to change the signature of foo:
void foo(const MPI::Intracomm & rowcomm, const MPI::Intracomm & colcomm);

You might think this is smarter as it compiles but now you are in even deeper trouble because your accessor functions return values and they go out of scope immediately. Therefore, when foo starts executing, the references are dangling. You might try to change the signature to this:
void foo(MPI::Intracomm rowcomm, MPI::Intracomm colcomm);

This works, but hey it might be terribly inefficient if MPI::Intracomm is a big object because you're creating temporaries everytime you pass by value. Just leave the signature as is and go fix your class definition instead:

class Object
{
....
MPI::Intracomm & GetRowWorld() { return rowWorld; }
MPI::Intracomm & GetColWorld() { return colWorld; }
MPI::Intracomm GetRowWorld() const { return rowWorld; }
MPI::Intracomm GetColWorld() const { return colWorld; }
...
}

Thursday, June 18, 2009

static_cast and CRTP

I have been using CRTP (curiously recurring template pattern) for my Sparse Matrix classes for a number of reasons:

1) Performance, as this is a high-performance computing issue.
2) Parameter type checking without the need to inspect RTTI (run-time type information).

Here is the deal:

template < class DER >
class SpMat
{ ... }

template < class DER>
template <typename SR>
void SpMat<DER>::SpGEMM(SpMat<DER> & A, SpMat<DER> & B, ...)
{
// check for conformance, etc.
...
static_cast< DER* >(this)->template PlusEq_AtXBt<SR>(static_cast< DER >(A), static_cast< DER >(B));
...
}

class SpDCCols: public SpMat <SpDCCols >
{
public:
template <typename SR>
int PlusEq_AtXBt(const SpDCCols & A, const SpDCCols & B);
....
}


This won't compile because static_cast< DER >(A) fails as no explicit conversion operator from SpMat< SpDCCols > to SpDCCols is supplied, although the latter class is derived from the former. This also has to do with the semantics of static_cast() operator:
Stroustrup says "For a built-in type T, T(e) is equivalent to static_cast(e)"

So, are we gonna provide a conversion constructor? Of course not. One can convert a pointer/reference of a type A to a pointer/reference of a type B if A is a base class of B.
Therefore, the immediate solution is to cast to a reference rather than a concrete object, which will also avoid unnecessary explicit conversion and make your code faster:
static_cast< DER & >(A)

Tuesday, June 16, 2009

Template argument deduction

C++ has an excellent mechanism to deduce arguments. However, it is not free of pitfalls. Suppose you are writing a library function to perform a matrix-vector multiplication on a particular semiring (which is defined as a class with static ::multiply and ::add functions):

// Perform y <- y + Ax template < typename T, typename SR>
void MultiplyAdd(Matrix<T> & A, Vector<T> & x, Vector<T> & y){
.... // usual conformance checks
for(int i=0; i< A.m; ++i){
for(int j=0; j< A.n; ++j){
temp = SR::multiply(A[i][j], x[j]);
y[i] = SR::add(temp, y[i]);
}}}


Here,the template argument T can be deduced from the function arguments but SR can not. If you release your library like this, people will need to explicitly call it like:
MultiplyAdd <int, minplus>

which is kind of annoying. Normally, you'd like to list deducible arguments first so that the user can get away with declaring only the non-deducible ones:
MultiplyAdd <minplus>

----------------------

My second point will be on using function pointers in STL library calls. If you are using a templated function pointer, don't expect the compiler to deduce the arguments for you. Example:

template <typename T>
bool compare_second(pair<T,T> t1, pair<T,T> t2)
{
return t1.second < t2.second;
}

const int N = 100;
pair pairarray[N];
.... // somehow populate the array
sort(pairarray, pairarray +N, compare_second); // cryptic compiler error!


The correct way to do it is obviously:
sort(pairarray, pairarray +N, compare_second<int>);