Parallel (MPI-Enabled) DIRSIG Manual

DIRSIG5 supports the automated processing of single/multi-frame simulations across multiple processes/nodes using either the OpenMPI or MPICH implementation of the Message Passing Interface (MPI).

The early releases of DIRSIG5 required the user to download a separate release that had been compiled with MPI. As of the DIRSIG 2022.12 release, MPI can be enabled in the main release using a command-line option (see the use of --exec=mpi in the examples below).

Overview

A unified build for MPI and non-MPI Linux environments is available from myDIRSIG. When running in non-MPI mode, DIRSIG uses conventional file I/O to write output files, while the MPI mode leverages MPI-IO. It is recommended that end-users running on a single workstation with a single CPU socket use non-MPI mode to reduce the number of run-time dependencies. Workstations featuring more than one CPU socket can gain a significant run-time performance benefit by using the MPI build with a proper process topology.

MPI execution is supported by the following programs:

Available Adaptors

MPI poses an challenging dependency problem, as various implementations are not runtime-compatible with each other. As such, DIRSIG has taken the approach of distributing a number of adapter plugins to make DIRSIG compatible with various MPI implementations. These adapters, however, are simply a compatibility layer. The OpenMPI and MPICH run-time libraries are NOT distributed with DIRSIG5 because they are assumed to be installed on the host. It is also assumed that the runtime libraries for these implementations are in your system’s LD_LIBRARY_PATH. Currently, the adapters included are:

Table 1. Available MPI adaptors
Name	Version
`openmpi3`	OpenMPI 3.1.6
`openmpi4`	OpenMPI 4.1.1
`mpich3`	MPICH 3.4.3
`mpich4`	MPICH 4.1.1

The build system used to produce the release is Redhat Enterprise Linux (RHEL) 8.

Process Topology and Managing the Multi-Scale Parallelization

Due to NUMA considerations, it is recommended that one runs one bound process per CPU socket. Each process should then spawn a number of threads equal to the number of CPU cores available on said socket. In general, it is recommended that one disable simultaneous multi-threading (SMT, e.g. Hyper-Threading) to better control NUMA characteristics, but the performance impact of this varies between system configurations and may not be optimal in all situations.

We recently benchmarked the interaction of macro-scale (number of processes) and micro-scale (number of threads) parallelization on a single computer that could support 40 simultaneous threads (2 CPU sockets each populated with a 10-core Xeon that supports hyper-threading). And this is what we saw:

Scaling the number of single-threaded MPI processes.

 4 processes w/ 1 thread -> 52m 47.931s
 8 processes w/ 1 thread -> 27m 37.566s
16 processes w/ 1 thread -> 14m 35.913s
32 processes w/ 1 thread -> 11m 13.620s

As we can see above, the total run time decreases linearly with the number of processes until we get close to the theoretical maximum of 40 (note that hyper-threading does not guarantee 100% theoretical utilization).

Scaling the number of MPI processes and threads per process.

32 processes w/ 1 thread -> 11m 13.620s
16 processes w/ 2 threads -> 9m 49.638s
 8 processes w/ 4 threads -> 9m 29.620s
 4 processes w/ 8 threads -> 9m 21.366s

As we can see above, each configuration had 32 effective threads and the total run-times were similar. On this machine with a theoretical maximum of 40 simultaneous threads or processes, the 32 single-threaded process was the least efficient.

While it is not common to run MPI on a single node (machine), it is common for MPI clusters to contain physical machines that have multiple CPUs per machine. Depending on what your MPI topology, you can easily run into the same issue of having multiple DIRSIG processes trying to use the maximum number of threads. This is why we recommend running a single process per socket. The OpenMPI launcher has a convience option for this (see the --npersocket option):

$ mpirun --npersocket 1 dirsig5 --exec=mpi MY_SIM.sim

Alternatively, you can manually control the number of processes started using the -np (number of processes) MPI option and the number of threads using the --threads DIRSIG option:

$ mpirun -np 8 dirsig5 --threads=1 --exec=mpi rgb.sim

DIRSIG Internal Scheduling Modes

DIRSIG supports two scheduling modes:

All Processes per-Frame (APF): This is the default behavior. APF distributes the processing of each frame (capture) in a simulation across all processes. It is used to maximize complete frame throughput.
Unique Frame per-Process (UFP): UFP assigns one unique frame (capture) to each process. Under certain circumstances it can achieve a lower overall simulation run-time than APF for multi-frame simulations (e.g. poor filesystem/MPI-IO) implementation support for writing simultaneously to the same file and/or if the scene changes significantly between frames). It only is available for multi-frame simulations with a platform truth/image schedule of capture (i.e. one image and/or truth file per frame).

Currently, DIRSIG doesn’t transition from UFP to AFP scheduling mode for remainder frames, so efficiency peaks when the number of frames is evenly divisible by the number of processes.

Quick Start

A user can easily load the OpenMPI utilities/libraries into their shell environment using environment modules, the MPI environment environment module is installed with the openmpi package in RHEL-like Linux distributions when deployed via yum.

Load the OpenMPI environment

Load the environment module for OpenMPI into the user’s shell environment:

$ module load mpi/openmpi-x86_64

Run DIRSIG with 1 process per socket (default APF schedule)

Use the --npersocket option to mpirun to have OpenMPI run a single process per socket:

$ mpirun --npersocket 1 dirsig5 --exec=mpi MY_SIM.sim

It is critical to include the --exec=mpi flag on any MPI-aware DIRSIG program. DIRSIG has no reliable way of detecting if it was run using mpirun or mpiexec, so it relies on this flag to select the proper scheduler and drivers.

Same as above, but with UFP schedule

Use the DIRSIG5 --mpi_one_event_per_node option to switch the internal scheduler into the unique frame per-process (UFP) mode:

$ mpirun --npersocket 1 dirsig5 --mpi_one_event_per_node --exec=mpi MY_SIM.sim