DIRSIG5 supports the automated processing of single/multi-frame simulations across multiple processes/nodes using either the OpenMPI (1.x) or MPICH implementation of the Message Passing Interface (MPI).
The early releases of DIRSIG5 required the user to download
a separate release that had been compiled with MPI. As of the
DIRSIG 2022.12 release,
MPI can be enabled in the main release using a command-line
option (see the use of
A unified build for MPI and non-MPI Linux environments is available from myDIRSIG. When running in non-MPI mode, DIRSIG uses conventional file I/O to write output files, while the MPI mode leverages MPI-IO. It is recommended that end-users running on a single workstation with a single CPU socket use non-MPI mode to reduce the number of run-time dependencies. Workstations featuring more than one CPU socket can gain a significant run-time performance benefit by using the MPI build with a proper process topology.
MPI execution is supported by the following programs:
MPI poses an challenging dependency problem, as various implementations
are not runtime-compatible with each other. As such, DIRSIG has
taken the approach of distributing a number of adapter plugins to
make DIRSIG compatible with various MPI implementations. These
adapters, however, are simply a compatibility layer. The OpenMPI
and MPICH run-time libraries are NOT distributed with DIRSIG5
because they are assumed to be installed on the host. It is also
assumed that the runtime libraries for these implementations are
in your system’s
LD_LIBRARY_PATH. Currently, the adapters included
OpenMPI 1.x: Built with OpenMPI v1.10.7.
MPICH: Built with MPICH v3.2.
The build system used to produce the release is Redhat Enterprise Linux (RHEL) 7.
Process Topology and Managing the Multi-Scale Parallelization
Due to NUMA considerations, it is recommended that one runs one bound process per CPU socket. Each process should then spawn a number of threads equal to the number of CPU cores available on said socket. In general, it is recommended that one disable simultaneous multi-threading (SMT, e.g. Hyper-Threading) to better control NUMA characteristics, but the performance impact of this varies between system configurations and may not be optimal in all situations.
We recently benchmarked the interaction of macro-scale (number of processes) and micro-scale (number of threads) parallelization on a single computer that could support 40 simultaneous threads (2 CPU sockets each populated with a 10-core Xeon that supports hyper-threading). And this is what we saw:
4 processes w/ 1 thread -> 52m 47.931s 8 processes w/ 1 thread -> 27m 37.566s 16 processes w/ 1 thread -> 14m 35.913s 32 processes w/ 1 thread -> 11m 13.620s
As we can see above, the total run time decreases linearly with the number of processes until we get close to the theoretical maximum of 40 (note that hyper-threading does not guarantee 100% theoretical utilization).
32 processes w/ 1 thread -> 11m 13.620s 16 processes w/ 2 threads -> 9m 49.638s 8 processes w/ 4 threads -> 9m 29.620s 4 processes w/ 8 threads -> 9m 21.366s
As we can see above, each configuration had 32 effective threads and the total run-times were similar. On this machine with a theoretical maximum of 40 simultaneous threads or processes, the 32 single-threaded process was the least efficient.
While it is not common to run MPI on a single node (machine), it is
common for MPI clusters to contain physical machines that have multiple
CPUs per machine. Depending on what your MPI topology, you can easily
run into the same issue of having multiple DIRSIG processes trying to
use the maximum number of threads.
This is why we recommend running a single process per socket. The
OpenMPI launcher has a convience option for this (see the
$ mpirun --npersocket 1 dirsig5 --exec=mpi MY_SIM.sim
Alternatively, you can manually control the number of processes started
-np (number of processes) MPI option and the number of
threads using the
--threads DIRSIG option:
$ mpirun -np 8 dirsig5 --threads=1 --exec=mpi rgb.sim
DIRSIG Internal Scheduling Modes
DIRSIG supports two scheduling modes:
- All Processes per-Frame (APF)
This is the default behavior. APF distributes the processing of each frame (capture) in a simulation across all processes. It is used to maximize complete frame throughput.
- Unique Frame per-Process (UFP)
UFP assigns one unique frame (capture) to each process. Under certain circumstances it can achieve a lower overall simulation run-time than APF for multi-frame simulations (e.g. poor filesystem/MPI-IO) implementation support for writing simultaneously to the same file and/or if the scene changes significantly between frames). It only is available for multi-frame simulations with a platform truth/image schedule of capture (i.e. one image and/or truth file per frame).
|Currently, DIRSIG doesn’t transition from UFP to AFP scheduling mode for remainder frames, so efficiency peaks when the number of frames is evenly divisible by the number of processes.|
A user can easily load the OpenMPI utilities/libraries into their shell
the MPI environment environment module is installed with the openmpi
package in RHEL-like Linux distributions when deployed via
Load the OpenMPI environment
Load the environment module for OpenMPI into the user’s shell environment:
$ module load mpi/openmpi-x86_64
Run DIRSIG with 1 process per socket (default APF schedule)
--npersocket option to
mpirun to have OpenMPI run a single
process per socket:
$ mpirun --npersocket 1 dirsig5 --exec=mpi MY_SIM.sim
It is critical to include the
Same as above, but with UFP schedule
Use the DIRSIG5
--mpi_one_event_per_node option to switch the internal
scheduler into the unique frame per-process (UFP) mode:
$ mpirun --npersocket 1 dirsig5 --mpi_one_event_per_node --exec=mpi MY_SIM.sim