Wednesday, January 16, 2008

Heap Memory in Multithreaded NUMA Programs

Ok, here is the deal: You want to parallelize you application (say matrix multiplication) and you want asynchronous computation. If you have a shared memory NUMA machine, threads seem to be the obvious choice. The caveat is memory management. Calls to "new" and "delete" are serialized because heap memory is always shared among threads. Also, and probably more costly, there are memory contentions, false sharing, etc.
I tried to use hoard library here is what happened:

Without hoard, using plain GNU C++ library (which might be using ptmalloc?)


Loading Matrices
Loaded !
Loading took 14.019800 seconds
Multiplications started
Transposition took 0.423375 seconds
Multiplication took 4.816739 seconds
Retransposition took 0.486832 seconds
Multiplications finished
8.713002 seconds elapsed (including thread creation cost)


Using hoard:

Loading Matrices
Loaded !
Loading took 2.489643 seconds
Multiplications started
Transposition took 0.463726 seconds
Multiplication took 5.337530 seconds
Retransposition took 0.493123 seconds
Multiplications finished
9.304911 seconds elapsed (including thread creation cost)


What's wrong here? Hoard makes loading matrices really fast (2.4 instead of 14 seconds), but it slowed down the multiplication at the same time (5.3 sec instead of 4.8).
Note that the code uses 16 threads on a 16 core machine.

No comments:

Post a Comment