Temporary Data Movement and Temporary Memory Modification

372. Temporary Data Movement and Temporary Memory Modification

Up: Fortran Support Next: Permanent Data Movement Previous: The Fortran TARGET Attribute

The compiler is allowed to temporarily modify data in memory. Normally, this problem may occur only when overlapping communication and computation, as in Example 0 , Case (b) on page 0 . Example 0 also shows a possibility that could be problematic.

Example Overlapping Communication and Computation.

USE mpi_f08 
REAL :: buf(100,100) 
CALL MPI_Irecv(buf(1,1:100),...req,...) 
DO j=1,100 
  DO i=2,100 
    buf(i,j)=.... 
  END DO 
END DO 
CALL MPI_Wait(req,...)

Example The compiler may substitute the nested loops through loop fusion.

REAL :: buf(100,100),  buf_1dim(10000) 
EQUIVALENCE (buf(1,1), buf_1dim(1)) 
CALL MPI_Irecv(buf(1,1:100),...req,...) 
tmp(1:100) = buf(1,1:100) 
DO j=1,10000 
  buf_1dim(h)=... 
END DO 
buf(1,1:100) = tmp(1:100) 
CALL MPI_Wait(req,...)

Example Another optimization is based on the usage of a separate memory storage area, e.g., in a GPU.

REAL :: buf(100,100), local_buf(100,100)  
CALL MPI_Irecv(buf(1,1:100),...req,...) 
local_buf = buf  
DO j=1,100 
  DO i=2,100 
    local_buf(i,j)=.... 
  END DO 
END DO 
buf = local_buf ! may overwrite asynchronously received 
                ! data in buf(1,1:100) 
CALL MPI_Wait(req,...)

In the compiler-generated, possible optimization in Example 0 , buf(100,100) from Example 0 is equivalenced with the 1-dimensional array buf_1dim(10000). The nonblocking receive may asynchronously receive the data in the boundary buf(1,1:100) while the fused loop is temporarily using this part of the buffer. When the tmp data is written back to buf, the previous data of buf(1,1:100) is restored and the received data is lost. The principle behind this optimization is that the receive buffer data buf(1,1:100) was temporarily moved to tmp.

Example 0 shows a second possible optimization. The whole array is temporarily moved to local_buf.

When storing local_buf back to the original location buf, then this implies overwriting the section of buf that serves as a receive buffer in the nonblocking MPI call, i.e., this storing back of local_buf is therefore likely to interfere with asynchronously received data in buf(1,1:100).

Note that this problem may also occur:

With the local buffer at the origin process, between an RMA communication call and the ensuing synchronization call; see Chapter One-Sided Communications .
With the window buffer at the target process between two ensuing RMA synchronization calls.
With the local buffer in MPI parallel file I/O split collective operations between the ..._BEGIN and ..._END calls; see Section Split Collective Data Access Routines .

As already mentioned in subsection The Fortran ASYNCHRONOUS attribute on page The Fortran ASYNCHRONOUS Attribute of Section Problems with Code Movement and Register Optimization , the ASYNCHRONOUS attribute can prevent compiler optimization with temporary data movement, but only if the receive buffer and the local references are separated into different variables, as shown in Example 0 and in Example 0 .

Note also that the methods

calling MPI_F_SYNC_REG (or such a user-defined routine),
using module variables and COMMON blocks, and
the TARGET attribute

cannot be used to prevent such temporary data movement. These methods influence compiler optimization when library routines are called. They cannot prevent the optimizations of the code fragments shown in Example 0 and 0 .

Note also that compiler optimization with temporary data movement should not be prevented by declaring buf as VOLATILE because the VOLATILE implies that all accesses to any storage unit (word) of buf must be directly done in the main memory exactly in the sequence defined by the application program. The VOLATILE attribute prevents all register and cache optimizations. Therefore, VOLATILE may cause a huge performance degradation.

Instead of solving the problem, it is better to prevent the problem: when overlapping communication and computation, the nonblocking communication (or nonblocking or split collective I/O) and the computation should be executed on different variables, and the communication should be protected with the ASYNCHRONOUS attribute. In this case, the temporary memory modifications are done only on the variables used in the computation and cannot have any side effect on the data used in the nonblocking MPI operations.
Rationale.

This is a strong restriction for application programs. To weaken this restriction, a new or modified asynchronous feature in the Fortran language would be necessary: an asynchronous attribute that can be used on parts of an array and together with asynchronous operations outside the scope of Fortran. If such a feature becomes available in a future edition of the Fortran standard, then this restriction also may be weakened in a later version of the MPI standard. ( End of rationale.)
In Example 0 (which is a solution for the problem shown in Example 0 and in Example 0 (which is a solution for the problem shown in Example 0 ), the array is split into inner and halo part and both disjoint parts are passed to a subroutine separated_sections. This routine overlaps the receiving of the halo data and the calculations on the inner part of the array. In a second step, the whole array is used to do the calculation on the elements where inner+halo is needed. Note that the halo and the inner area are strided arrays. Those can be used in non-blocking communication only with a TS 29113 based MPI library.

Up: Fortran Support Next: Permanent Data Movement Previous: The Fortran TARGET Attribute

Return to MPI-3.1 Standard Index
Return to MPI Forum Home Page

(Unofficial) MPI-3.1 of June 4, 2015
HTML Generated on June 4, 2015