From: R. Bader To: J3 08-294r2 Subject: Coarray transactions Date: 2008 November 09 References: J3/08-007r2, WG5/N1753, WG5/N1754 This paper aims to understand, and perhaps also to solve the problem discussed in section B. of N1754. The example program in section D. of N1754 is a bit of a red herring since deadlock situations with companion processors or I/O may also occur if only two images are involved. However there may indeed be issues with implementing the present specification efficiently and/or reliably on some common platforms; hence the suggestions made in this paper. To set the stage, some diagrams are introduced which illustrate the situations addressed by the first two bullets of paragraph 6 of 8.5.1: ---------------------------------------------------------------- Legend: P, Q, R are image numbers a is a coarray S(XY) is a pairwise synchronization which induces segment ordering for the two specified images. Time goes downward ... whatever that means. ---------------------------------------------------------------- The "normal" situation in coarray communication is the one given by P Q | | | | a = ... S(PQ) ~~~~~~~~~~~~~~~ RAW | | ... = a[Q] |<------------| | | and further diagrams covering WAW (push instead of pull by P), WAR. The headaches are caused by the exchanges in the lower part of the following diagram: P Q R | | | | ... = a[R] |<-----------| S(PQ) ~~~~~~~~~~~~~~~ | WAR (R "passive") | | | a[R] = ... |------------------------->| | | | S(PR) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | S(QR) ~~~~~~~~~~~~~~ | | | | ... = a[R] |<-----------| RAW (R "passive") | | | (or WAW, R "passive", not shown) The present formulation of the draft standard requires S(PQ) in the lower part of the above diagram (instead of the S(PR)+S(QR) sequence drawn above). This implies that the run time must insert the S(PR)+S(QR) in form of a memory fence on image R (whose executing code may at that time indeed be busy quite elsewhere, or stalling). [Side remark: If the suggestions in this paper are rejected, adding a Note clarifying the intended behaviour for the problematic cases should be considered.] This is similar to a one-sided MPI call with a passive partner, which is, by the way, not implemented e.g., in MPICH2 or Intel MPI, indicating that there indeed may be serious reliability/performance issues on off-the-shelf hardware with sufficiently unintelligent interconnects. The suggestion made here is that image R should only be passive with respect to references to a, but not with respect to requiring a sync images (/P,Q,R/) (or equivalent user-defined ordering) for the RAW/WAW case, as images P and Q do. This does imply a change in semantics, but only for the problematic 3-way diagrams (which, as far as user education is concerned, should be indicated as being bad programming practice in most cases anyhow). However, it has the potential to make the implementation much easier. In particular, an implementation based on existing, well-defined MPI semantics should then be possible. To make the situation more clear in the wording of the standard, it is suggested to start out from an image "owning" a coarray, and attend to all conceivable transactions with the owner. If this is done, one obtains 16 elementary transactions, of which 12 involve updates. Of these, 9 involve image communication, and of these again, only the 2 already shown above need special treatment. The edits supplied below describe all transactions and how they are to be treated; for all but the 2 problematic cases there should be no effective change in behaviour compared to the present draft. For those vendors who feel that their run time is capable of handling the problematic diagrams without the image owning the diagram requiring an explicit synchronization point, a modifier sync images (/P,Q,R/, SYNC_MODE='MEMORY_FENCE') could be added, which should be specified on the image owning the coarray only. The implementation would then be allowed to ignore the sync in case the run time can deal with it automatically. This may even be feasible on an off-the-shelf system if P,Q,R execute on the same node. Suggested edits to 08-007r2: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [35] add two new paragraphs to 2.5.7: 4+1 A coarray update may happen on either the image owning it (Uo), or a remote image (Ur). Coarray updates comprise * the definition or undefinition of a coarray variable either on the image owning it (Uo), or a remote image (Ur). * the allocation of an allocatable subobject of a coarray or the change of the pointer association of a pointer subobject of a coarray by the image owning it (Uo). Note 2.17+: The definition or undefinition by a remote image of a noncoarray dummy argument whose effective argument is a coarray is not considered a coarray update. This is to avoid side effects due to presence or lack of copy-in/copy-out. Hence, this case receives special treatment with respect to image execution control (8.5.1). 4+2 A coarray reference from the image owning it is denoted by (Ro); a coarray reference from any other image is denoted by (Rr). [32] add a new section: 2.4.6 Coarray transactions 1 During execution of all images of a program, the sequence of all references or updates of a coarray entity owned by a particular image (2.5.7) is composed of elementary transactions, where the final part of each transaction overlaps with the initial part of the next transaction, provided the latter exists. 2 There exist 16 elementary coarray transactions: * (Ro) after (Ro), (Ro) after (Rr), (Rr) after (Ro), (Rr) after (Rr) * (Ro) after (Uo), (Ro) after (Ur), (Rr) after (Uo), (Rr) after (Ur) * (Uo) after (Ro), (Uo) after (Rr), (Ur) after (Uo), (Ur) after (Rr) * (Uo) after (Uo), (Uo) after (Ur), (Ur) after (Uo), (Ur) after (Ur) Note 2.14+1: Each elementary coarray transaction involves 1, 2 or 3 images. The case of 1 image refers to serial execution. In the case of 3 images, only two images will execute statements that actually reference or update the coarray. Note 2.14+2: The selection of each elementary transaction is performed by the programmer, based on the needs imposed by the parallel algorithm to be implemented. [188] replace para6 of 8.5.1 by 6 If two or more images perform an elementary coarray transaction, and either no or exactly one image performs its part of the transaction by executing a GET_ATOMIC or SET_ATOMIC intrinsic on the coarray, (C848+1) the segment of the image owning the coarray shall be ordered with respect to the segment of the remote image referencing or updating the coarray for the elementary transactions (Ro) after (Ur), (Rr) after (Uo), (Uo) after (Rr), (Ur) after (Uo), (Uo) after (Ur), (Ur) after (Uo). (C848+2) the segments of both remote images shall be ordered with respect to the image owning the coarray for the elementary transactions (Rr) after (Ur), (Ur) after (Ur). If the remote images are not the same image, the segments of the remote images shall also be ordered with respect to each other. (C848+3) the segments of two different remote images shall be ordered with respect to each other for the elementary transaction (Ur) after (Rr). 7 For all elementary transactions by two or more images which contain at least one coarray update, and for which at least one image executes a procedure in segments Pi, Pi+1 , ..., Pk for which the coarray is an effective argument associated with a non-coarray dummy argument, (C848+5) the coarray shall not be referenced or updated on the other image Q in a segment Qj unless Qj precedes Pi or succeeds Pk. Note 8.30: Any prescribed image segment ordering must be consistent with the semantics of the elementary transaction it is associated with. Note 8.31 The processor may optimize the execution of a segment as if it were the only image in execution. In particular, this applies to coarray transactions which do not contain updates, and to coarray transactions which contain references and updates by the owning image only. Note 8.31+1: The model upon which the interpretation of a program is based is that there is a permanent memory location for each coarray and that all images can access it. In practice, an image may make a copy of a coarray (in cache or a register, for example) and, as an optimization, defer copying a changed value back to the permanent location while it is still being used. It is safe to defer this transfer until the end of the current segment and thereafter to reload from permanent memory any coarray that was not defined within the segment. It would not be safe to defer these actions beyond the end of the current segment since another image might reference the variable then. Note 8.31+2: The incorrect sequencing of image control statements can suspend execution indefinitely. For example, one image might be executing a SYNC ALL statement while another is executing an ALLOCATE statement for a coarray. [189] in 8.5.3, replace R860 with R860 sync-images-stmt is SYNC IMAGES ( image-set [ , sync-stat-list ] [ , SYNC_MODE='MEMORY_FENCE' ] ) and add C850+ SYNC_MODE may only appear if the executing image is the image owning all coarrays for whose elementary transactions the statement imposes image ordering, and if all transactions ordered by the statement are either (Rr) after (Ur) or (Ur) after (Ur). as well as Note 8.36+ As an optimization, an implementation may ignore a SYNC IMAGES statement if SYNC_MODE='MEMORY_FENCE' is specified. It then must guarantee the correct execution of all involved elementary transactions, as well as avoid a deadlock with respect to the other images in image-set. Further comment: ~~~~~~~~~~~~~~~~ The above edits assume that VOLATILE coarrays have been disallowed and replaced by the atomic statements or intrinsics suggested in N1753.