The following rules specify the latest point in the execution of the application an operation must complete at the origin or the target. The update initiated by a call to MPI_GET in the origin process memory is visible when the get operation is complete at the origin (or earlier); the update initiated by a call to MPI_PUT or an accumulate procedure in the public copy of the target window is visible when the put or accumulate operation has completed at the target (or earlier). The rules also specify the latest point at which an update of one window copy becomes visible in another overlapping copy.
6. An update by a put or accumulate operation to a public window copy becomes
visible in the private copy in MPI process memory at the latest when an ensuing
call to MPI_WIN_WAIT, MPI_WIN_FENCE, MPI_WIN_LOCK,
MPI_WIN_LOCK_ALL, or MPI_WIN_SYNC is executed on that window by the
window owner. In the RMA unified memory model, an update by a put or
accumulate operation to a public window copy eventually becomes visible in the private
copy in MPI process memory without additional RMA calls.
The rules above also define, by implication, when an update to a public window copy becomes visible in another overlapping public window copy. Consider, for example, two overlapping windows, win1 and win2. A call to MPI_WIN_FENCE on win1 by the window owner makes visible in the target process memory previous updates to window win1 by origin processes. A subsequent call to MPI_WIN_FENCE on win2 makes these updates visible in the public copy of win2.
The behavior of some MPI RMA operations may be
undefined in certain situations. For example, the result of
several origin processes performing concurrent put
operations to the same target location is undefined. In addition, the
result of a single origin process performing multiple
put operations to the same target location within the
same access epoch is also undefined.
The result at the target may have all of the
data from one of the put operations (the ``last'' one,
in some sense), some bytes from each of the operations, or
something else. In MPI-2, such operations were erroneous.
That meant that an MPI implementation was permitted to raise an error.
Thus, user programs or tools that used MPI RMA could not
portably permit such operations, even if the application code could
function correctly with such an undefined result. Starting with MPI-3, these
operations are not erroneous, but do not have a defined behavior.
Rationale.
As discussed in [8], requiring
operations such as
overlapping puts to be erroneous makes it difficult to use MPI
RMA to implement programming models---such as Unified Parallel C (UPC) or SHMEM---that permit
these operations. Further, while MPI-2 defined these operations as
erroneous, the MPI Forum is unaware of any implementation that enforces
this rule, as it would require significant overhead. Thus, relaxing
this condition does not impact existing implementations or applications.
( End of rationale.)
Advice
to implementors.
Overlapping accesses are undefined. However, to assist users in
debugging code, implementations may wish to provide a mode in which such
operations are detected and reported to the user. Note, however, that starting with MPI-3, such operations must
not raise an error.
( End of advice to implementors.)
A program with a well-defined outcome in the MPI_WIN_SEPARATE memory model
must obey the following rules.
The last constraint on correct RMA accesses may seem unduly
restrictive, as it forbids concurrent accesses to nonoverlapping
locations in a window. The reason for this constraint is that, on
some architectures, explicit coherence restoring operations may be
needed at synchronization points.
A different operation may be needed for locations that were
updated by stores and for locations that were remotely
updated by put or accumulate operations. Without this constraint,
the MPI library would have to track
precisely which locations in a window were updated by a put or
accumulate operation. The additional overhead of maintaining such
information is considered prohibitive.
( End of rationale.)
Note that MPI_WIN_SYNC may be used within a passive
target epoch to synchronize the private and public window copies
(that is, updates to one are made visible to the other).
In the MPI_WIN_UNIFIED memory model, the rules are simpler because the public and private windows are the same. However, there are restrictions to avoid concurrent access to the same memory locations by different MPI processes. The rules that a program with a well-defined outcome must obey in this case are:
Advice to users.
Some compiler optimizations can result in code that
maintains the sequential semantics of the program, but violates
this rule
by introducing temporary values into locations in memory. Most
compilers only apply such transformations under very high levels of
optimization and users should be aware that such aggressive optimization
may produce unexpected results.
( End of advice to users.)
3.
Updating a location in the
window with a store access
that is also the target of a remote read (but not update) is valid
(not erroneous) but the precise result will depend on the behavior
of the implementation. Store
updates will appear in
memory, but there are no atomicity or ordering guarantees if
more than one byte is updated. Updates are stable in the sense that
once data appears in memory, the data remains until replaced by
another update. This permits
updates to memory
with store accesses
without requiring an RMA epoch. Users are cautioned that remote accesses to
a window that is updated by the local MPI process has defined
behavior only if the other rules given here and
elsewhere in this chapter
are followed.
4.
A location in a window must not be accessed as a
target of an RMA
operation once an update to that location has started and until the
update completes at the target. There is one
exception to this rule: in the case where the same location is updated
by two concurrent accumulates with the same
predefined datatype on the same window. Additional restrictions on the
operation apply; see the info key accumulate_ops in
Section Window Creation.
5.
A put or accumulate must not access a target
window once a store, put, or
accumulate update to another (overlapping) target window
has started on the same location in the target window and until the update
completes at the target window.
Conversely, a store access
to a location in a window must not be executed once a put or
accumulate update to the same location in that target window has started
and until the put or accumulate
update completes at the target.
Advice to users.
A user can write correct programs by following the following rules:
With the post-start synchronization, the target process can tell the origin process that its window is now ready for RMA access; with the complete-wait synchronization, the origin process can tell the target process that it has finished its RMA accesses to the window.
The RMA synchronization operations define when updates are guaranteed
to become visible in public and private windows. Updates may become
visible earlier, but such behavior is implementation dependent.
( End of advice to users.)
The following examples illustrate these semantics.
Example
The following example demonstrates updating a memory location inside a
window for the separate memory model, according to
Rule Semantics and Correctness. The MPI_WIN_LOCK and
MPI_WIN_UNLOCK calls around the store to X in
process B are necessary
to ensure consistency between the public and private copies of the
window.
Example
In the RMA unified model, although the public and private copies
of the windows are synchronized, caution must be used when
combining load/store accesses with multi-process synchronization.
Although the following example appears correct, the compiler or
hardware may delay the store to X after the barrier, possibly
resulting in the MPI_GET returning
an incorrect value
of X.
MPI_BARRIER provides process synchronization, but not memory synchronization. The example could potentially be made safe through the use of compiler- and hardware-specific notations to ensure the store to X occurs before process B enters the MPI_BARRIER. The use of one-sided synchronization calls, as shown in Example Semantics and Correctness, also ensures the correct result.
Example
The following example demonstrates the reading of a memory location
updated by an origin process (Rule Semantics and Correctness) in
the RMA separate memory model. Although the call to
MPI_WIN_UNLOCK on process A and the MPI_BARRIER
ensure that the public copy on process B reflects the updated value of X,
the call to MPI_WIN_LOCK by process B is necessary to
synchronize the private copy with the public copy.
Note that in this example, the barrier is not critical to the semantic correctness. The use of exclusive locks guarantees no other MPI process will modify the public copy after MPI_WIN_LOCK synchronizes the private and public copies. A polling implementation looking for changes in X on process B would be semantically correct. The barrier is required to ensure that process A completes the put operation at the target before process B executes the load of X.
Example
Similar to Example Semantics and Correctness, the following example
is unsafe even in the unified model, because the load of X cannot be
guaranteed to occur after the MPI_BARRIER. While Process B
does not need to explicitly synchronize the public and private copies
through MPI_WIN_LOCK as the MPI_PUT will update
both the public and private copies of the window, the scheduling of
the load could result in old values of X being returned. Compiler and hardware
specific notations could ensure the load occurs after the data is updated, or
explicit one-sided synchronization calls can be used to ensure the proper result.
Example
The following example further clarifies
Rule Semantics and Correctness. MPI_WIN_LOCK and
MPI_WIN_LOCK_ALL do not update the public copy of
a window with changes to the private copy. Therefore, there is no
guarantee that process A in the
following sequence will see the value of X as updated by the
store by process B before the lock.
The addition of a call to MPI_WIN_SYNC before the call to MPI_BARRIER by process B would guarantee process A would see the updated value of X, as the public copy of the window would be explicitly synchronized with the private copy.
Example
Similar to the previous example, Rule Semantics and Correctness can
have unexpected implications for general active target synchronization
with the RMA separate memory model. It is not guaranteed
that process B reads the value of X as per the local update by process
A, because neither the call to MPI_WIN_WAIT nor
the call to MPI_WIN_COMPLETE by process A ensure visibility in
the public window copy.
To allow process B to read the value of X stored by A, the local store must be replaced by a local put operation that updates the public window copy. Note that by this replacement X may become visible in the private copy of process A only after the MPI_WIN_WAIT call in process A. The update to Y made before the MPI_WIN_POST call is visible in the public window after the MPI_WIN_POST call and therefore process B will read the proper value of Y. The get of Y could be moved to the epoch opened by MPI_WIN_START, and process B would still get the value stored by process A.
Example
The following example demonstrates the interaction of general
active target synchronization with load accesses in the
RMA separate memory model. Rules Semantics and Correctness
and Semantics and Correctness do not guarantee that the
private copy of X at process B has been updated before the load access is executed.
To ensure that the value put by process A is read, the load access must be replaced with a get operation, or must be placed after the call to MPI_WIN_WAIT.