The compiler is allowed to temporarily modify data in memory. Normally, this problem may occur only when overlapping communication and computation, as in Example Solutions, Case (b) on page Solutions. Example Temporary Data Movement and Temporary Memory Modification also shows a possibility that could be problematic.
Example
Overlapping Communication and Computation.
Example
The compiler may substitute the nested loops through loop fusion.
Example
Another optimization is based on the usage of a separate memory storage area, e.g., in a GPU.
In the compiler-generated, possible optimization in Example Temporary Data Movement and Temporary Memory Modification, buf(100,100) from Example Temporary Data Movement and Temporary Memory Modification is equivalenced with the 1-dimensional array buf_1dim(10000). The nonblocking receive may asynchronously receive the data in the boundary buf(1,1:100) while the fused loop is temporarily using this part of the buffer. When the tmp data is written back to buf, the previous data of buf(1,1:100) is restored and the received data is lost. The principle behind this optimization is that the receive buffer data buf(1,1:100) was temporarily moved to tmp.
Example Temporary Data Movement and Temporary Memory Modification shows a second possible optimization. The whole array is temporarily moved to local_buf.
When storing local_buf back to the original location buf, then this implies overwriting the section of buf that serves as a receive buffer in the nonblocking MPI call, i.e., this storing back of local_buf is therefore likely to interfere with asynchronously received data in buf(1,1:100).
Note that this problem may also occur:
Note also that the methods
Note also that compiler optimization with temporary data movement should not be prevented by declaring buf as VOLATILE because the VOLATILE implies that all accesses to any storage unit (word) of buf must be directly done in the main memory exactly in the sequence defined by the application program. The VOLATILE attribute prevents all register and cache optimizations. Therefore, VOLATILE may cause a huge performance degradation.
Instead of solving the problem, it is better to prevent the problem:
when overlapping communication and computation,
the nonblocking communication (or nonblocking or split collective I/O)
and the computation should be executed on different variables,
and the communication should be protected with the
ASYNCHRONOUS attribute.
In this case, the temporary memory modifications are done
only on the variables used in the computation and cannot have any
side effect on the data used in the nonblocking MPI operations.
Rationale.
This is a strong restriction for application programs.
To weaken this restriction, a new or modified asynchronous feature
in the Fortran language would be necessary:
an asynchronous attribute that can be used on parts of an array
and together with asynchronous operations outside the scope of Fortran.
If such a feature becomes available in a future edition of the Fortran standard,
then this restriction also may be weakened in a later version
of the MPI standard.
( End of rationale.)
In Example Permanent Data Movement
(which is a solution for the problem shown in
Example Solutions
and in Example Comparison with C
(which is a solution for the problem shown in
Example Temporary Data Movement and Temporary Memory Modification),
the array is split into inner and halo
part and both disjoint parts are passed to a subroutine separated_sections.
This routine overlaps the receiving of the halo data and the calculations
on the inner part of the array.
In a second step, the whole array is used to do
the calculation on the elements where inner+halo is needed.
Note that the halo and the inner area are strided arrays.
Those can be used in nonblocking communication
only with a Fortran 2018 (or TS 29113) based MPI library.