J3/06-121 To: J3 From: Aleksandar Donev & Craig Rasmussen Date: January 27, 2006 Subject: Collective assignments in CAF Note: This proposal assumes that the CoArray Fortran proposal is accepted and Fortran 2008 is an explicitly parallel language. 1. Background. We discuss here the possibilities for future extensions of the base CoArray Fortran (CAF) proposal to enable efficient implementation of data transfers involving multiple images. As an example, consider a near-neighbor pattern of communication among n images where each image calculates the average of the values from its left and right neighbor (without worrying about boundaries for now): real :: a[*] ! A co-scalar a[2:n-1]=(a[1:n-2]+a[3:n])/2 The CAF proposal that is proposed for integration into Fortran 2008 does not allow co-sections, and it also does not allow defining a variable while it is being referenced by other images. Therefore, this example would be written as: me=this_image() tmp=(a[me-1]+a[me+1])/2 sync_all() a=tmp There are two aspects of CAF that make it attractive for efficient programming: 1) Explicit control over data distribution (not in OpenMP and implicit or coarse in HPF) 2) Explicit control over computation distribution (left in compiler hands in OpenMP and HPF) These are strengths of CAF. However, in many cases the appropriate distribution of computation is directly obvious from the data distribution, and thus a compiler could easily do it automatically. For example, a[1]=1 obviously should be executed by image 1 if there is a choice. In other cases, a compiler could do a better-informed (platform-specific) computation distribution or communication optimization than the programmer when collective operations are involved. The goal of this paper is to investigate the type of extensions to CAF needed to allow the programmer to more directly express collective statements, letting the compiler actually decide how exactly to execute the computation partitioning and data transfer. 1. Co-Sections of Co-Arrays The first step is the incorporation of designators involving section triplets in the co-dimension into the standard. Consider the co-array real :: b(10)[*] and the (data reference) b(2:10:2)[2:5]. There are two main options: 1) This variable is the usual kind of Fortran array section, i.e., it acts like a local array, not a co-array. Essentially, the semantics are as if there is a temporary local array holding a copy of the data from images 2-5. It is of rank two, and its shape is (5,3), i.e., the rank is the sum of the rank and co-rank and the shape is the concatenation of the two shapes. This variable can be used in expressions, as an actual argument, etc., possibly with some restrictions. 2) This variable is a new kind of array section, a co-section, which contains information about its distribution, i.e., the set of images it resides on (in this case 2-5). We adopt the first option, because it is simpler and seems to fit well into the standard. 2. Serial Assignment Statements. Assignment statements in CAF are serial, in the sense that they are executed by an image, not a collection of images, and the programmer has control over which images executes it. In fact, the programmer must limit concurrent executions so that multiple images do not define/reference a variable without suitable synchronization (ordering). For example: if(iam==1) b(:)[2:]=b(:)[2:]+b(:) Image 1 first collects from all the processors the rank-2 array a(:)[2:] into a local copy, it adds its own local values, and then writes the resulting values to a(:)[2:]. If the programmer knew that all images would execute this portion of the code, than they could write: if(iam>1) b=b+b(:)[1] which says that all images get the value of b(:)[1] from image 1, and then they increment their local array accordingly. We want to give the programmer the ability to have such complete control over the execution sequence. However, this can also be a hassle for the programmer, who instead of focusing on what happens to the value of an array (data-flow) must focus on who evaluates or changes the value (control-flow). Additionally, it does not allow for platform-specific compiler optimizations and instead forces the programmer to do the assignment of tasks to images. 3. Collective Block. We introduce the concept of a collective statement to CAF: a statement that is executed cooperatively by a set of images. For this purpose, we introduce a new block construct, the COLLECTIVE construct: COLLECTIVE() END COLLECTIVE Here the image team is an array listing a set of images, or * for all images. The block of statements should be limited to assignment statements and control-flow constructs, for now. The semantics of the collective block is such that all the images that are members of the team must execute this block. Other images that reach this block simply skip over it---only the images listed actually execute the block. The block is executed by the team of images collectively, however, the semantics is identical to only one of the images in the team actually executing the statements. In fact, that would be a valid, though not efficient, implementation of the COLLECTIVE block. The idea is to allow the compiler to actually implement this semantics more efficiently using the knowledge that all images in the team will execute this block. Shared-memory parallelism models like OpenMP follow the fork-join model, where most of the time the execution is joined (single image) and in special sections it is forked (multiple images). With the collective block proposed above we have a reverse model for CAF: Most of the time the executions on different images procede independently, but in the critical blocks they are merged into a (semantically) single execution sequence. 4. Collective assignments. We now have all the ingredients for collective assignment statements. No new syntax or change of the semantics of the assignment statement is introduced, instead, one merely puts the assignment inside a collective block. For example: real :: a[*] collective(*) a[2:n-1]=(a[1:n-2]+a[3:n])/2 a[1]=(a[n]+a[2])/2 a[n]=(a[n-1]+a[1])/2 end collective or collective(*) b(:)[2:]=b(:)[2:]+b(:)[1] end collective or other types of collective data transfer. The compiler can translate these the best it can. 5. Future Extensions. It is also important to allow the user to write collective procedures, meaning procedures that must be executed by a team of images (similar to the collective intrinsics or the synchronization routines such as SYNC_ALL). While the body of the procedure could easily be placed inside a COLLECTIVE block, it might also be useful to allow the COLLECTIVE specification to appear in the interface of the procedure in some way.