Example
The following example shows a generic loosely synchronous, iterative
code, using MPI_FENCE for synchronization. The window at each MPI process
consists of array A, which contains the origin and target buffers of
the
put operations.
The same code could be written with get rather than put. Note that, during the communication phase, each window is concurrently read (as origin buffer of puts) and written (as target buffer of puts). This is OK, provided that there is no overlap between the target buffer of a put and another communication buffer.
Example
Same generic example, with more computation/communication overlap. We
assume that the update phase is broken into two
subphases: the first,
where the ``boundary,'' which is involved in communication, is updated, and
the second, where the ``core,'' which neither
uses nor provides
communicated data, is updated.
The get communication can be concurrent with the core update, since they do not access the same locations, and the local update of the origin buffer by the get operation can be concurrent with the local update of the core by the update_core call. In order to get similar overlap with put communication we would need to use separate windows for the core and for the boundary. This is required because we do not allow local stores to be concurrent with puts on the same, or on overlapping, windows.
Example
Same code as in Example Examples,
rewritten using post-start-complete-wait.
Example
Same example, with post-start-complete-wait, as in Example Examples.
Example
A checkerboard, or double buffer communication pattern, that allows
more computation/communication overlap. Array A0 is updated
using values of array A1, and vice versa. We assume that communication is symmetric: if process A gets data from process B, then process B gets data from process A. Window wini consists of array Ai.
An MPI process posts the local window associated with win0 before it completes RMA accesses to the remote windows associated with win1. When the call to MPI_WIN_WAIT on win1 returns, then all neighbors of the calling MPI process have posted the windows associated with win0. Conversely, when the call to MPI_WIN_WAIT on win0 returns, then all neighbors of the calling MPI process have posted the windows associated with win1. Therefore, the MPI_MODE_NOCHECK option can be used with the calls to MPI_WIN_START.
Put operations can be used, instead of get operations, if the area of array A0 (resp. A1) used by update(A1, A0) (resp. update(A0, A1)) is disjoint from the area modified by the RMA operation. On some systems, a put operation may be more efficient than a get operation, as it requires information exchange only in one direction.
In the next several examples, for conciseness, the expression
means to perform a get-accumulate operation with the result buffer (given by result_addr in the description of MPI_GET_ACCUMULATE) on the left side of the assignment, in this case, z. This format is also used with MPI_COMPARE_AND_SWAP and MPI_COMM_SIZE. Process B... refers to any process other than A.
Example
The following example implements a naive, nonscalable counting
semaphore. The example demonstrates the use of
MPI_WIN_SYNC to manipulate the public copy of X, as well
as MPI_WIN_FLUSH to complete operations without closing the
access epoch opened with MPI_WIN_LOCK_ALL. To avoid the
rules regarding synchronization of the public and private copies of
windows, MPI_ACCUMULATE and MPI_GET_ACCUMULATE
are used to write to or read from the local public copy.
Example
Implementing a critical region between two MPI processes (Peterson's
algorithm). Despite their appearance in the
following example, MPI_WIN_LOCK_ALL and
MPI_WIN_UNLOCK_ALL are not collective calls, but it is
frequently useful to open shared access epochs to all MPI processes from
all other MPI processes in a window. Once the access epochs are
opened, accumulate operations as well as flush and sync
synchronization can be used to read from or write to the
public copy of the window.
Example
Implementing a critical region between multiple MPI processes with compare
and swap. The call to MPI_WIN_SYNC is necessary on
Process A after local initialization of A to guarantee the public copy
has been updated with the initialization value found in the private
copy. It would also be valid to call MPI_ACCUMULATE with
MPI_REPLACE to directly initialize the public copy. A call
to MPI_WIN_FLUSH would be necessary to assure A in the
public copy of Process A had been updated before the barrier.
ExampleThe following example demonstrates the proper synchronization in the
unified memory model when a data transfer is implemented with load and
store accesses in the case of windows in shared memory (instead of using MPI_PUT or
MPI_GET) and the synchronization between MPI processes is performed using
point-to-point communication. The synchronization between MPI processes
must be supplemented with a memory synchronization through calls to
MPI_WIN_SYNC, which act locally as a processor-memory barrier. In
Fortran, if MPI_ASYNC_PROTECTS_NONBLOCKING is
.FALSE.
or the variable X is not declared as ASYNCHRONOUS,
reordering of the accesses to the
variable X must be prevented with MPI_F_SYNC_REG
operations. (No equivalent function is needed in C.)
The variable X is contained within a shared memory window and X corresponds to the same memory location at both processes. The first call to MPI_WIN_SYNC performed by process A ensures completion of the load/store accesses issued by process A. The first call to MPI_WIN_SYNC performed by process B ensures that process A's updates to X are visible to process B. Similarly, the second call to MPI_WIN_SYNC on each process ensures correct ordering of the point-to-point communication and thus that the load/store operations on process B have completed before any subsequent load/store accesses to the variable X in process A.
Example
The following example shows how request-based operations can be used
to overlap communication with computation. Each MPI process fetches,
processes, and writes the result for NSTEPS chunks of data. Instead
of a single buffer, M local buffers are used to allow up to M
communication operations to overlap with computation.
Example
The following example constructs a distributed shared linked list using dynamic
windows. Initially process 0 creates the head of the list, attaches it to
the window, and broadcasts the pointer to all MPI processes. All MPI processes then
concurrently append N new elements to the list. When an MPI
process attempts to
attach its element to the tail of the list it may discover that its tail pointer
is stale and it must chase ahead to the new tail before the element can be
attached.
This example requires some modification to
work in an environment where the layout of the structures is different on
different MPI processes.