From: R. Bader
To: J3                                                           08-294r2
Subject: Coarray transactions
Date: 2008 November 09
References: J3/08-007r2, WG5/N1753, WG5/N1754

This paper aims to understand, and perhaps also to solve the problem
discussed in section B. of N1754.
The example program in section D. of N1754 is a bit of a red herring
since deadlock situations with companion processors or I/O may also
occur if only two images are involved. However there may indeed be
issues with implementing the present specification efficiently and/or
reliably on some common platforms; hence the suggestions made in this
paper. To set the stage, some diagrams are introduced which
illustrate the situations addressed by the first two bullets of
paragraph 6 of 8.5.1:

----------------------------------------------------------------
Legend:
P, Q, R are image numbers
a is a coarray
S(XY) is a pairwise synchronization which induces segment
ordering for the two specified images.
Time goes downward ... whatever that means.
----------------------------------------------------------------

The "normal" situation in coarray communication is the one
given by


              P             Q
              |             |
              |             | a = ...
        S(PQ) ~~~~~~~~~~~~~~~                RAW
              |             |
 ... = a[Q]   |<------------|
              |             |


and further diagrams covering WAW (push instead of pull by P), WAR.
The headaches are caused by the exchanges in the lower part of
the following diagram:


              P             Q            R
              |             |            |
              |  ... = a[R] |<-----------|
        S(PQ) ~~~~~~~~~~~~~~~            |   WAR (R "passive")
              |             |            |
 a[R] = ...   |------------------------->|
              |             |            |
        S(PR) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              |       S(QR) ~~~~~~~~~~~~~~
              |             |            |
              |  ... = a[R] |<-----------|   RAW (R "passive")
              |             |            |   (or WAW, R "passive",
                                              not shown)

The present formulation of the draft standard requires S(PQ) in the
lower part of the above diagram (instead of the S(PR)+S(QR) sequence
drawn above). This implies that the run time must insert the
S(PR)+S(QR) in form of a memory fence on image R (whose executing
code may at that time indeed be busy quite elsewhere, or stalling).

[Side remark: If the suggestions in this paper are rejected,
 adding a Note clarifying the intended behaviour for the
 problematic cases should be considered.]

This is similar to a one-sided MPI call with a passive partner,
which is, by the way, not implemented e.g., in MPICH2 or Intel MPI,
indicating that there indeed may be serious reliability/performance
issues on off-the-shelf hardware with sufficiently unintelligent
interconnects.

The suggestion made here is that image R should only be passive with
respect to references to a, but not with respect to requiring a

sync images (/P,Q,R/)

(or equivalent user-defined ordering) for the RAW/WAW case, as images
P and Q do. This does imply a change in semantics, but only for the
problematic 3-way diagrams (which, as far as user education is
concerned, should be indicated as being bad programming practice in
most cases anyhow). However, it has the potential to make the
implementation much easier. In particular, an implementation based
on existing, well-defined MPI semantics should then be possible.

To make the situation more clear in the wording of the standard,
it is suggested to start out from an image "owning" a coarray,
and attend to all conceivable transactions with the owner. If this
is done, one obtains 16 elementary transactions, of which 12 involve
updates. Of these, 9 involve image communication, and of these again,
only the 2 already shown above need special treatment. The edits
supplied below describe all transactions and how they are to be
treated; for all but the 2 problematic cases there should be
no effective change in behaviour compared to the present draft.

For those vendors who feel that their run time is capable of handling
the problematic diagrams without the image owning the diagram
requiring an explicit synchronization point, a modifier

sync images (/P,Q,R/, SYNC_MODE='MEMORY_FENCE')

could be added, which should be specified on the image owning the
coarray only. The implementation would then be allowed to ignore
the sync in case the run time can deal with it automatically.
This may even be feasible on an off-the-shelf system if P,Q,R
execute on the same node.


Suggested edits to 08-007r2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[35] add two new paragraphs to 2.5.7:

4+1 A coarray update may happen on either the image owning it (Uo),
    or a remote image (Ur). Coarray updates comprise
    * the definition or undefinition of a coarray variable either
      on the image owning it (Uo), or a remote image (Ur).
    * the allocation of an allocatable subobject of a coarray or
      the change of the pointer association of a pointer subobject
      of a coarray by the image owning it (Uo).

    Note 2.17+:

    The definition or undefinition by a remote image of a noncoarray
    dummy argument whose effective argument is a coarray is not
    considered a coarray update. This is to avoid side effects
    due to presence or lack of copy-in/copy-out. Hence, this case
    receives special treatment with respect to image execution
    control (8.5.1).

4+2 A coarray reference from the image owning it is denoted by (Ro);
    a coarray reference from any other image is denoted by (Rr).

[32] add a new section:

2.4.6 Coarray transactions

1 During execution of all images of a program, the sequence of all
  references or updates of a coarray entity owned by a particular
  image (2.5.7) is composed of elementary transactions, where the
  final part of each transaction overlaps with the initial part of
  the next transaction, provided the latter exists.

2 There exist 16 elementary coarray transactions:
  * (Ro) after (Ro), (Ro) after (Rr), (Rr) after (Ro), (Rr) after (Rr)
  * (Ro) after (Uo), (Ro) after (Ur), (Rr) after (Uo), (Rr) after (Ur)
  * (Uo) after (Ro), (Uo) after (Rr), (Ur) after (Uo), (Ur) after (Rr)
  * (Uo) after (Uo), (Uo) after (Ur), (Ur) after (Uo), (Ur) after (Ur)

  Note 2.14+1:

  Each elementary coarray transaction involves 1, 2 or 3 images.
  The case of 1 image refers to serial execution. In the case of
  3 images, only two images will execute statements that actually
  reference or update the coarray.

  Note 2.14+2:

  The selection of each elementary transaction is performed by the
  programmer, based on the needs imposed by the parallel algorithm
  to be implemented.


[188] replace para6 of 8.5.1 by

6 If two or more images perform an elementary coarray transaction, and
  either no or exactly one image performs its part of the transaction
  by executing a GET_ATOMIC or SET_ATOMIC intrinsic on the coarray,

  (C848+1) the segment of the image owning the coarray shall be
           ordered with respect to the segment of the remote image
           referencing or updating the coarray for the elementary
           transactions
           (Ro) after (Ur), (Rr) after (Uo),
           (Uo) after (Rr), (Ur) after (Uo),
           (Uo) after (Ur), (Ur) after (Uo).

  (C848+2) the segments of both remote images shall be ordered
           with respect to the image owning the coarray for
           the elementary transactions (Rr) after (Ur),
           (Ur) after (Ur). If the remote images are not the
           same image, the segments of the remote images shall
           also be ordered with respect to each other.

  (C848+3) the segments of two different remote images shall be
           ordered with respect to each other for the elementary
           transaction (Ur) after (Rr).

7 For all elementary transactions by two or more images which
  contain at least one coarray update, and for which at least one
  image executes a procedure in segments Pi, Pi+1 , ..., Pk for which
  the coarray is an effective argument associated with a non-coarray
  dummy argument,

  (C848+5)  the coarray shall not be referenced or updated on the
            other image Q in a segment Qj unless Qj precedes Pi or
            succeeds Pk.

  Note 8.30:

  Any prescribed image segment ordering must be consistent with the
  semantics of the elementary transaction it is associated with.

  Note 8.31

  The processor may optimize the execution of a segment as if it were
  the only image in execution. In particular, this applies to
  coarray transactions which do not contain updates, and to
  coarray transactions which contain references and updates
  by the owning image only.

  Note 8.31+1:
  The model upon which the interpretation of a program is based is
  that there is a permanent memory location for each coarray and that
  all images can access it. In practice, an image may make a copy of
  a coarray (in cache or a register, for example) and, as an
  optimization, defer copying a changed value back to the permanent
  location while it is still being used. It is safe to defer this
  transfer until the end of the current segment and thereafter to
  reload from permanent memory any coarray that was not defined within
  the segment. It would not be safe to defer these actions beyond the
  end of the current segment since another image might reference the
  variable then.

  Note 8.31+2:
  The incorrect sequencing of image control statements can suspend
  execution indefinitely. For example, one image might be executing
  a SYNC ALL statement while another is executing an ALLOCATE
  statement for a coarray.

[189] in 8.5.3, replace R860 with

  R860 sync-images-stmt is
       SYNC IMAGES ( image-set [ , sync-stat-list ]
                     [ , SYNC_MODE='MEMORY_FENCE' ]  )

and add

  C850+ SYNC_MODE may only appear if the executing image is the
        image owning all coarrays for whose elementary transactions
        the statement imposes image ordering, and if all transactions
        ordered by the statement are either (Rr) after (Ur) or (Ur)
        after (Ur).

as well as

  Note 8.36+

  As an optimization, an implementation may ignore a SYNC IMAGES
  statement if SYNC_MODE='MEMORY_FENCE' is specified. It then must
  guarantee the correct execution of all involved elementary
  transactions, as well as avoid a deadlock with respect to
  the other images in image-set.


Further comment:
~~~~~~~~~~~~~~~~

  The above edits assume that VOLATILE coarrays have been disallowed
  and replaced by the atomic statements or intrinsics suggested in
  N1753.