Development/Pahole
The main documentation is available on the cluster via |
Description | Content |
---|---|
module load | devel/pahole |
Availability | bwUniCluster |
License | GPL |
Citing | n/a |
Links | Homepage | Releases |
Graphical Interface | No |
Introduction
The poke-a-hole, or short pahole tool is part of the dwarves tool-set. It dissects data structures in binary object files, showing (otherwise useless) padding and data crossing cachelines, allowing optimization and performance analysis of data structures in user- or kernel-code.
This tool is worthy to be known by any C and Fortran developer -- the article The lost Art of Structure Packingg by Eric S. Raymond provides an in-detail answer, why.
Documentation
There currently is no web documentation, or tutorial. After loading the module, documentation is provided in the man page.
$ man pahole
Usage
Dissecting data structures
You may receive information on the padding and alignment of data structures using pahole. If interested in the data layout pass the option to expand data structures, e.g. the kernel's data structure for every single task, use pahole -E task_struct
struct task_struct {
struct thread_info {
long unsigned int flags; /* 0 8 */
unsigned int status; /* 8 4 */
} thread_info; /* 0 16 */
/* XXX last struct has 4 bytes of padding */
volatile long int state; /* 16 8 */
void * stack; /* 24 8 */
struct {
int counter; /* 32 4 */
} usage; /* 32 4 */
unsigned int flags; /* 36 4 */
unsigned int ptrace; /* 40 4 */
/* XXX 4 bytes hole, try to pack */
struct llist_node {
struct llist_node * next; /* 48 8 */
} wake_entry; /* 48 8 */
int on_cpu; /* 56 4 */
unsigned int cpu; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
unsigned int wakee_flips; /* 64 4 */
...
}
This nicely shows, where the compiler needed to insert padding to adhere to the architecture's alignment requirements specified by the ABI. Additionally it layouts the crossing of past cacheline boundaries, which might be problematic in false sharing of cachelines in multi-threaded programming. Both of this information may be used by You to re-layout your data-structures to minimize them and limit cache-thrashing.
Usage in own application
To employ this in your own application, recompile with compiler option -g. For the following code (part of the Open MPI implementation) pahole $MPI_LIB_DIR/libmpi.so:
struct ompi_communicator_t {
opal_infosubscriber_t super; /* 0 96 */
/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
opal_mutex_t c_lock; /* 96 64 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
char c_name[64]; /* 160 64 */
/* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
ompi_comm_extended_cid_t c_contextid; /* 224 16 */
ompi_comm_extended_cid_block_t c_contextidb; /* 240 32 */
/* --- cacheline 4 boundary (256 bytes) was 16 bytes ago --- */
uint32_t c_index; /* 272 4 */
int c_my_rank; /* 276 4 */
uint32_t c_flags; /* 280 4 */
uint32_t c_assertions; /* 284 4 */
int c_id_available; /* 288 4 */
int c_id_start_index; /* 292 4 */
uint32_t c_epoch; /* 296 4 */
/* XXX 4 bytes hole, try to pack */
ompi_group_t * c_local_group; /* 304 8 */
ompi_group_t * c_remote_group; /* 312 8 */
/* --- cacheline 5 boundary (320 bytes) --- */
struct ompi_communicator_t * c_local_comm; /* 320 8 */
struct opal_hash_table_t * c_keyhash; /* 328 8 */
int c_cube_dim; /* 336 4 */
/* XXX 4 bytes hole, try to pack */
struct mca_topo_base_module_t * c_topo; /* 344 8 */
int c_f_to_c_index; /* 352 4 */
/* XXX 4 bytes hole, try to pack */
struct ompi_peruse_handle_t * * c_peruse_handles; /* 360 8 */
ompi_errhandler_t * error_handler; /* 368 8 */
ompi_errhandler_type_t errhandler_type; /* 376 4 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 6 boundary (384 bytes) --- */
struct mca_pml_comm_t * c_pml_comm; /* 384 8 */
struct mca_mtl_comm_t * c_mtl_comm; /* 392 8 */
mca_coll_base_comm_coll_t * c_coll; /* 400 8 */
int32_t c_nbc_tag; /* 408 4 */
/* XXX 4 bytes hole, try to pack */
ompi_instance_t * instance; /* 416 8 */
int any_source_offset; /* 424 4 */
/* XXX 4 bytes hole, try to pack */
opal_object_t * agreement_specific; /* 432 8 */
_Bool any_source_enabled; /* 440 1 */
_Bool comm_revoked; /* 441 1 */
_Bool coll_revoked; /* 442 1 */
/* size: 448, cachelines: 7, members: 32 */
/* sum members: 419, holes: 6, sum holes: 24 */
/* padding: 5 */
};
The output shows, any single MPI_Comm requires a total of 448 Bytes, extending to over 7 64-Byte cache lines, with 6 holes of 4 Byte each. Rearranging the above data structure would reduce the amount of holes (e.g. move above any_source_offset to after agreement_specific), taking into account data access in the most-common code paths would also optimize cache line usage. Low hanging fruits such as eliminating (or reducing or moving to the end) c_name, which is solely used for application/ompi debugging, might help as well...