To: J3 08-167 From: Bill Long Subject: Reply to J3/08-126 Date: 2008 April 28 References: J3/08-126 At the beginnning of J3 meeting 183 (joint with WG5), a research group from Rice University submitted a paper with comments on coarrays. Given the timing, it was not practical to form a complete reply at the meeting. A subgroup was tasked with forming a reply after the meeting ended. Below is that Reply, written primarily by Bill Long, John Reid, and Aleks Donev. It was sent to the authors of 08-126 by email on March 12, 2008. =============================================================== Response to paper J3/08-126. 12-Mar-2008. The HPC subgroup of J3 appreciates the support of coarrays and critical commentary of the feature in paper J3/08-126, "A Critique of Co-array Features in Fortran 2008". We encourage continued participation by the Rice group in the Fortran standards process. At J3 meeting 183, and the concurrent and collocated WG5 meeting, held February 11-15, 2008 in Las Vegas, the coarray feature in the Fortran 2008 draft was split into two parts. The basic syntax and coarray variable declarations, the SPMD execution model, the memory model based on segments, the SYNC ALL, SYNC IMAGES, and SYNC MEMORY statements, the ALL STOP statement and program termination semantics, and the CRITICAL construct were retained in the base Fortran 2008 proposal. These features represent a minimal coarray facility and correspond, for the most part, to features that have been available and proven in existing implementations. The remaining features, mainly related to TEAMs and collective intrinsic subroutines, were moved to a proposed Technical Report (TR). The current schedule calls for the TR to be published one year after the main standard. This allows more time to develop these features, and allows vendors more time for implementation. At the meeting an editorial proposal was approved that removes the hyphen from the word "co-array" and related terms beginning with "co-". This reply uses this new naming convention. In the context of the revised strategy, following are responses to the comments made in J3 paper 08-126. Locks ----- We found the proposal to add a lock / unlock facility to the standard very compelling, and feel that this functionality is sufficiently useful that it should be added to the base language, rather than being deferred to the TR. We believe two new statements, LOCK and UNLOCK, would provide the needed capability. Statements are preferred to intrinsic subroutines because the semantics for coindexed actual arguments would not apply to a statement. Statements also avoid conflicts with existing user subroutine names, and obviate the need to include the "CALL " syntax. A LOCK statement would require specification of a lock variable of an opaque derived type, and optionally SUCCESS, STAT, and ERRMSG specifiers. If a SUCCESS specifier appears, the LOCK statement could complete even if acquisition of the lock was not successful. If the SUCCESS specifier does not appear, the statement would not complete until the specified lock was acquired. The STATUS and ERRMSG specifiers provide the same functionality as in the SYNC ALL statement. An UNLOCK statement would require specification of a lock variable, and optionally STAT and ERRMSG specifiers. Execution of the UNLOCK statement would release the specified lock. Because of the timing of paper J3/08-126, it was not possible to submit a paper to add LOCK and UNLOCK to the standard at meeting 183. Instead, we ask that your group submit a request for a lock/unlock feature when the window for submitting public comment on the standard opens in the near future. The nature of how comments are processed strongly favors narrowly focused comments in cases where a change is requested to the text of the standard. You are, of course, welcome to submit other comments separately. General reductions ------------------ We found the argument for a generalized facility for performing reductions across images to be compelling. Because this would be implemented as a collective subroutine, it would be proposed as part of the TR. We believe this capability is provided by a CO_REDUCE collective subroutine, with coarray, result, and procedure arguments. The procedure argument would specify a user-supplied function with two arguments, both of the same type and type parameters as the coarray argument, and returning a result with the same type and type parameters as the result argument. If a CO_REDUCE function is added, it is possible that some of the currently documented collective subroutine could be removed as redundant. The current feeling is that the need for CO_SUM is widespread, so the ability to optimize that operation justifies a separate subroutine. The other cases are less clear, and suggestions are welcome as part of a public comment on this issue. Atomic memory operations ------------------------ Some vendors already support atomic memory operations through added intrinsic procedures, so this area is a reasonable candidate for standardization. The common candidates are atomic compare and swap, fetch and integer add, and integer add. In cases where a separate hardware increment instruction is available, the compiler can detect a 1 in the add versions and generate the correct instruction. We also discussed, as a possible feature for f08, an ATOMIC construct that would provide a syntax for atomic operations. That proposal did not make the feature cut for this revision. We feel that the urgency for this feature is somewhat reduced if the lock/unlock statements are available. It is also the case that constructs like critical a[1] = a[1] + 1 end critical can be compiled as atomic memory operations already. Access to the functionality of the fetch and add operation is less obvious, and would benefit from standardization. As with the lock case, the current rules for coindexed arguments are problematic. We might consider exempting certain intrinsic procedures, given that the atomic operations are most naturally represented as procedure references. Atomic memory operations are not necessarily tied to coarrays. For example, they might be used in a DO CONCURRENT construct acting on variables local to the executing image. Therefore, it is not clear that they should be candidates for the TR, or better suited for the base language in this or the next revision. Critical constructs ------------------- Critical constructs protect a code sequence from concurrent execution by more than one image. Examples include the update of a shared variable (see above), or coordinating output to a shared file. They are not necessarily designed as ways to protect data objects. However, they could be extended through the addition of a lock variable on the CRITICAL statement, as in critical (lock_var) end critical that would specify a lock specific to this critical construct, and hence provide locking local to a team of images. The END CRITICAL statement would implicitly release the lock. This is effectively the same as starting the block with a LOCK statement and ending it with an UNLOCK statement, but adds the facilities of a block construct and automatic release of the lock. Assuming that the lock/unlock statements are already present in the language, there should be no extra implementation effort with the above enhancement. If you think this is a useful addition, requesting it through a public comment is appropriate. Since the critical construct is intended to be part of the base language, it would be reasonable to include this enhancement there. Split-phase barriers -------------------- The NOTIFY and QUERY statements were moved to the TR as part of the reorganization described above. As originally designed, these can be used to provide a split barrier. In addition, the SYNC IMAGES statement can be used with a similar effect, particularly for wave-front type computations. In all cases, the barriers identify image numbers. We are open to redesigning the syntax and semantics of NOTIFY and QUERY (and even changing the names) to better accomplish the goals of splitting a barrier operation between a 'signal' and 'wait' phase. For simplicity, we would prefer to have only one set of such statements. SYNC ALL -------- While SYNC ALL could be considered redundant with a special form of SYNC TEAM, we need this capability in the base standard, from which teams are initially excluded. In addition, in existing practice, the use of SYNC ALL is significantly higher than all the other synchronization primitives, so a separate statement seems appropriate. Formation of teams ------------------ The FORM_TEAM intrinsic is allowed (even expected) to perform timings to determine optimal remote reference patterns for subsequent team synchronizations or collective operations. It does not tell you which images are "close" in advance. A new intrinsic function could return an array of the distances, in time, to some or all the other images, or, alternatively, an array of the N closest images. Suggetions on the specific definition of such an intrinsic would be welcome. The result value might be used to form a locally fast team. This could be implemented using some of the same technology as in FORM_TEAM, and would not require changing the simple image number identification scheme. Multiple codimensions --------------------- We discussed the issue of multiple codimensions at meeting 183 and voted to keep them. While they are not applicable in every problem, in the fairly common cases of 2 or 3 dimension decomposition of a large array across a grid of images, they provide a natural accessing mechanism that simplifies code development and maintenance. The collection of cosubscripts directly reduces to an image number, so implementation of multiple codimensions is trivial. Thus, there seems to be little justification to deny availability to users for whom multiple codimensions are useful. Global pointers --------------- Global pointers (in the sense of UPC) or arrays (in the sense of HPF) could be added to Fortran with the addition of a GLOBAL attribute. We see some issues with this direction. One of the basic concepts in the coarray model, that remote references are graphically visible to the programmer through the [] notation, would be lost. The memory model would have to assume that any reference or definition of such an object involves remote data. In the case of global arrays, the question of how the data is distributed arises, though "processor dependent" is probably the simplest option. Given the departure from the current model, we think global pointers and arrays are not appropriate for either the base Fortran 2008 language or the associated TR. If usage experience exposes a justification for this feature, it could be a proposal for the next revision of the standard. Team memory allocations ----------------------- A coarray that is allocated on only a subset of the images (or, equivalently, with a different size on each image) has been discussed in the past, and syntax proposed. However, this capability is already available through allocatable (or pointer) components of coarray structures, so the idea was dropped in the interest of simplicity. The existing allocatable coarrays are required to be the same size on all images to allow their allocation on a symmetric heap. This is a performance advantage since the remote address of a reference can be deduced from locally available data. Allocatable coarrays that do not meet this constraint, including the allocatable components mentioned above, generally require access information from the remote image. This performance hit might be small in cases where large amounts of data are transferred, but could be significant (2x) for scalar references. One could contemplate an environment with multiple symmetric heaps, but we suspect the implementation difficulties of this would result in significant vendor resistance, and might not result in the desired performance. If you feel there is a need for the syntax of top-level asymmetric coarrays, we encourage you to submit a public comment. Given that this capability is related to teams of images, action on this comment would be taken with respect to the TR rather than the base language document. Remote procedure execution -------------------------- The ability for one image to directly initiate execution of a code sequence on a different image was intentionally omitted from the design of Fortran with coarrays. This capability dramatically complicates the memory consistency model and implementation of the language. With the current design, compilers know that the code execution sequence is visible, and hence the usual optimization techniques are available. It is expected that at least some implementations would execute multiple threads per image, but those would be for local automatic parallelism and OpenMP, under local execution control, and hence optimizable. Given Fortran's target audience, the ability to generate optimized code is an overriding concern. With the lock/unlock facility and Fortran's object oriented capabilities, a user could construct a work queue system that would allow an image to post a work request that a remote image could execute after checking the queue. Such a scheme might be reasonable if the alternative is large volumes of data transferred among images. Remote data allocation ---------------------- The ability for one image to execute a memory allocation operation on a different image was intentionally omitted from the design of Fortran with coarrays. The reasons are essentially the same as why remote procedure execution is not supported. The current design does provide a very flexible mechanism, through pointer components of coarrays, that provides access to memory on a different image. The allocation of the memory is always local, as is the pointer association. This significantly simplifies the memory model as well as the implementation. Multi-version variables ----------------------- Variables with specific producer / consumer capabilities make most sense on a hardware platform with many threads per computation core (so some other computation occurs during the wait period) and full/empty bits in the memory hardware (so the synchronization time is minimal). Adding a SYNC attribute to a variable's declaration seems sufficient to support this capability. Such an extension already exists in at least one compiler. While hardware of this nature exists, it is very uncommon. If that were to change in the future, modifying the language to incorporate this capability is straightforward. Without hardware support, software-only implementations would be likely result in disappointing performance.