Wednesday, December 30, 2009

Debugging MPI

Parallel debugging has come a long way. Today, it's quite possible to use -X and remotely debug a parallel MPI application running on a supercomputer. TotalView and DDT are just two example debuggers, their IDEs are pretty familiar for a seasoned C/C++ programmer.

However, debugging MPI is still painful because it rarely does what it claims it does. When there is a crash, you only get to see the crash. The MPI function in which the crash occurred will probably never return an error code because it will exit the application first. It is claimed that you can override this default behavior in C++ by setting error handlers to MPI::ERRORS_THROW_EXCEPTIONS, but I am yet to see that happen in practice.

What's left for me is to peek into the internals of the MPI objects to see if any fields (or the whole object) are uninitialized. But sometimes, there is really not such a thing as an MPI object as we know it, it is just a handle. For example, here is the implementation in MPICH2 and MVAPICH2:


// in file mpi.h
typedef int MPI_Win;

// in file mpiimpl.h
typedef struct MPID_Win {
int handle; /* value of MPI_Win for this structure */
volatile int ref_count;
int fence_cnt; /* 0 = no fence has been called; 1 = fence has been called */
MPID_Errhandler *errhandler; /* Pointer to the error handler structure */
void *base;
MPI_Aint size;
int disp_unit; /* Displacement unit of *local* window */
MPID_Attribute *attributes;
MPID_Group *start_group_ptr; /* group passed in MPI_Win_start */
int start_assert; /* assert passed to MPI_Win_start */
MPI_Comm comm; /* communicator of window (dup) */
...
char name[MPI_MAX_OBJECT_NAME];
} MPID_Win;


// In file mpicxx.h
class Win {
protected:
MPI_Win the_real_win;

public:
inline Win(MPI_Win obj) : the_real_win(obj) {}
inline Win(void) : the_real_win(MPI_WIN_NULL) {}
virtual ~Win() {}

Win(const Win &obj) : the_real_win(obj.the_real_win){}
Win& operator=(const Win &obj) {
the_real_win = obj.the_real_win; return *this; }

// logical
bool operator== (const Win &obj) {
return (the_real_win == obj.the_real_win); }
bool operator!= (const Win &obj) {
return (the_real_win != obj.the_real_win); }
// C/C++ cast and assignment
inline operator MPI_Win*() { return &the_real_win; }
inline operator MPI_Win() const { return the_real_win; }
Win& operator=(const MPI_Win& obj) {
the_real_win = obj; return *this; }
...
}



prior to any operation, the real objects are retrieved via these handles:

/* Convert MPI object handles to object pointers */
MPID_Win_get_ptr( win, win_ptr );


where


#define MPID_Win_get_ptr(a,ptr) MPID_Get_ptr(Win,a,ptr)

/* Convert handles to objects for MPI types that do _not_ have any predefined objects */
#define MPID_Get_ptr(kind,a,ptr) \
{ \
switch (HANDLE_GET_KIND(a)) { \
case HANDLE_KIND_DIRECT: \
ptr=MPID_##kind##_direct+HANDLE_INDEX(a); \
break; \
case HANDLE_KIND_INDIRECT: \
ptr=((MPID_##kind*) \
MPIU_Handle_get_ptr_indirect(a,&MPID_##kind##_mem)); \
break; \
case HANDLE_KIND_INVALID: \
case HANDLE_KIND_BUILTIN: \
default: \
ptr=0; \
break; \
} \
}



So what's the story behind this design?

MPI Opaque objects such as 'MPI_Comm' or 'MPI_Datatype' are specified by integers (in the MPICH2 implementation); the MPI standard calls these handles. Out of range values are invalid; the value 0 is reserved. For most (with the possible exception of 'MPI_Request' for performance reasons) MPI Opaque objects, the integer encodes both the kind of object (allowing runtime tests to detect a datatype passed where a communicator is expected) and important properties of the object. Even the 'MPI_xxx_NULL' values should be encoded so that different null handles can be distinguished. The details of the encoding of the handles is covered in more detail in the MPICH2 Design Document. For the most part, the ADI uses pointers to the underlying structures rather than the handles themselves. However, each structure contains an 'handle' field that is the corresponding integer handle for the MPI object.
MPID objects (objects used within the implementation of MPI) are not opaque.


Here is the MPICH2 design document for the interested.


Anyway, if you were hopeless enough to look at the design document, you'd see what kind of a black magic is necessary to make sense out of it. This might make sense as it is automatic encapsulation of data, one doesn't even need to use classes and declare its members private, because no one can access that object without deciphering what it is first. Fortunately, there are alternative MPI implementations, such as OpenMPI, that uses real (well, sort of) objects. Here is how MPI_Win is (partially) declared in OpenMPI:


struct ompi_win_t {
opal_object_t w_base;
opal_mutex_t w_lock;

/* Group associated with this window. */
ompi_group_t *w_group;

/* Information about the state of the window. */
uint16_t w_flags;

/* Error handling. This field does not have the "w_" prefix so that the OMPI_ERRHDL_* macros can find it, regardless of whether it's a comm, window, or file. */
ompi_errhandler_t *error_handler;
ompi_errhandler_type_t errhandler_type;

/* displacement factor */
int w_disp_unit;

void *w_baseptr;
size_t w_size;

/** Current epoch / mode (access, expose, lock, etc.). Checked by the argument checking code in the MPI layer, set by the OSC component. Modified without locking w_lock. */
volatile uint16_t w_mode;

/* one sided interface */
ompi_osc_base_module_t *w_osc_module;
};




Once I call Post on a Window, it's w_mode becomes 34 because it is 22 in hexadecimal which means it has been posted and the exposure epoch has actually started. And after I successfully call Start on the same window, its w_mode because 99 which 63 in hexadecimal, telling me that ACCESS_EPOCH and STARTED are now also true.


/* mode */
#define OMPI_WIN_ACCESS_EPOCH 0x00000001
#define OMPI_WIN_EXPOSE_EPOCH 0x00000002
#define OMPI_WIN_FENCE 0x00000010
#define OMPI_WIN_POSTED 0x00000020
#define OMPI_WIN_STARTED 0x00000040
#define OMPI_WIN_LOCK_ACCESS 0x00000080



Just to decipher more... (you should never proceed this far unless you are 100% sure that the it's the MPI implementation's bug, not yours)
w_osc_module contains all the necessary function pointers implemented by the osc (One-Sided Component).
For example the post function looks like the following:



// in file osc_pt2pt_sync.c
int ompi_osc_pt2pt_module_post(ompi_group_t *group, int assert, ompi_win_t *win)
{
int i;
ompi_osc_pt2pt_module_t *module = P2P_MODULE(win);

OBJ_RETAIN(group);
ompi_group_increment_proc_count(group);

OPAL_THREAD_LOCK(&(module->p2p_lock));
assert(NULL == module->p2p_pw_group);
module->p2p_pw_group = group;

/* Set our mode to expose w/ post */
ompi_win_remove_mode(win, OMPI_WIN_FENCE);
ompi_win_append_mode(win, OMPI_WIN_EXPOSE_EPOCH | OMPI_WIN_POSTED);

/* list how many complete counters we're still waiting on */
module->p2p_num_complete_msgs +=
ompi_group_size(module->p2p_pw_group);
OPAL_THREAD_UNLOCK(&(module->p2p_lock));

/* send a hello counter to everyone in group */
for (i = 0 ; i < ompi_group_size(module->p2p_pw_group) ; ++i) {
ompi_osc_pt2pt_control_send(module, ompi_group_peer_lookup(group, i),
OMPI_OSC_PT2PT_HDR_POST, 1, 0);
}
return OMPI_SUCCESS;
}



In reality, some of these functions do very minimal communication. For example, all Start does is to find the rank of each process within the specified group, store the indices and set the true/false flag in the active ranks table.

One thing to notice is the P2P_MODULE() call that basically casts w_osc_module pointer to type ompi_osc_pt2pt_module_t *, which has the following interesting components (ok, interesting from a general active-target synchronization perspective):

// inside ompi/mca/osc/pt2pt/osc_pt2pt.h
struct ompi_osc_pt2pt_module_t {
/** Extend the basic osc module interface */
ompi_osc_base_module_t super; // ABAB: See the ac-hoc inheritance? - Lovely !

/** pointer back to window */
ompi_win_t *p2p_win;

/** communicator created with this window */
ompi_communicator_t *p2p_comm;

/** control message receive request */
struct ompi_request_t *p2p_cb_request;

opal_list_t p2p_pending_control_sends;

/** list of ompi_osc_pt2pt_sendreq_t structures, and includes all requests for this access epoch that have not already been started. p2p_lock must be held when modifying this field. */
opal_list_t p2p_pending_sendreqs;

/** list of unsigned int counters for the number of requests to a particular rank in p2p_comm for this access epoc. p2p_lock must be held when modifying this field */
unsigned int *p2p_num_pending_sendreqs;

/** For MPI_Fence synchronization, the number of messages to send in epoch. For Start/Complete, the number of updates for this Complete. For lock, the number of messages waiting for completion on on the origin side. Not protected by p2p_lock - must use atomic counter operations. */
volatile int32_t p2p_num_pending_out;

/** For MPI_Fence synchronization, the number of expected incoming messages. For Post/Wait, the number of expected updates from complete. For lock, the number of messages on the passive side we are waiting for. Not protected by p2p_lock - must use atomic counter operations. */
volatile int32_t p2p_num_pending_in;

/** Number of "ping" messages from the remote post group we've received */
volatile int32_t p2p_num_post_msgs;

/** Number of "count" messages from the remote complete group we've received */
volatile int32_t p2p_num_complete_msgs;

...

/********************** PWSC data *************************/
struct ompi_group_t *p2p_pw_group;
struct ompi_group_t *p2p_sc_group;
bool *p2p_sc_remote_active_ranks;
int *p2p_sc_remote_ranks;

...
};
typedef struct ompi_osc_pt2pt_module_t ompi_osc_pt2pt_module_t;