To: J3 J3/19-158 From: Gary Klimowicz Subject: Add reductions to DO CONCURRENT Date: 2019-February-14 Reference: 18-007r1, 18-266r4 Motivation ========== Performing a reduction on the array elements computed in DO CONCURRENT can be inefficient for large arrays. The naive expression of a reduction is non-conforming: real :: a(:) real :: s integer :: i ... ! With DO CONCURRENT, the use of 's' in the ! loop below is not defined before referenced. s = 0 do concurrent (i = 1:size(a)) s = s + (2 * a(i) - 1) end do print *, s end Potential solutions using temporary arrays (used to accumulate intermediate results) followed by REDUCE are inefficient in the use of data and potential data management require in highly threaded environments. When the programmer is interested in performing a simple reduction of the data in the DO CONCURRENT, this overhead can exceed the gains from the computational parallelism. Also, with knowledge of the reduction, a processor could optimize DO CONCURRENT to be scaled to the number of available threads available at runtime (the number of which are not known to the programmer), chunking computations. At meeting 215, J3 approved this item for moving forward. Simple Use Cases ================ ! With current semantics, the code below is not ! conformant. 's' is not defined in the iteration ! before it is referenced. s = 1000 DO CONCURRENT (i = 1:SIZE(a)) s = s + (2 * a(i) - 1) END DO ! s = 1000 + SUM(2*A-1) ! Likewise, with 'foo', not defined before used. foo = a(1) DO CONCURRENT (i = 2:SIZE(a)) foo = MIN(foo, a(i)) END DO ! foo = MINVAL(a) We would like programs with semantics similar to the above to be conformant and still be able to exploit all available runtime parallelism. Requirements ============ Provide Fortran support for some of the features of the OpenMP reduction clause. The proposal should support: 1. Extend DO CONCURRENT to include the addition of scalar "reduction variables" and a corresponding "reduction operators" for the reduction variable. 2. Provide for the initialization of the reduction variable consistent with the reduction operator. 3. Allow multiple reduction variables to be defined for a DO CONCURRENT. 4. Support the following reduction operators and initial values: OPERATOR INITIAL VALUE -------- ------------- + 0 * 1 .AND. .TRUE. .OR. .FALSE. .EQV. .TRUE. .NEQV. .FALSE. MIN Largest representable number MAX Smallest representable number IAND NOT(0) IOR 0 IEOR 0 5. Have reduction variable semantics aligned with existing locality specifications for DO CONCURRENT. 6. Have behavior that is a good analog to the OpenMP reduction clause. The proposal need not support: 1. Reduction variables with the ALLOCATABLE, INTENT (IN), or OPTIONAL attribute. Proposal ======== In R1123, insert "concurrent-reduction" between "concurrent-header" and "concurrent-locality" to provide for the definition of the reduction variables. R1123 loop-control is [ , ] do-variable = scalar-int-expr, scalar-int-expr [ , scalar-int-expr ] or [ , ] WHILE ( scalar-logical-expr ) or [ , ] CONCURRENT concurrent-header concurrent-reduction concurrent-locality Add a rule R11xx concurrent-reduction is [ reduction-spec ] ... Add a rule R11yy reduction-spec is REDUCE (reduce-operator : variable-name-list) Add a rule R11zz reduce-operator is + or * or .AND. or .OR. or .EQV. or .NEQV. or MIN or MAX or IAND or IOR or IEOR Add constraints: C11aa Each variable-name in a reduction-spec is restricted to be an intrinsic type or array of intrinsic type. C11bb Each variable-name in a reduction-spec shall not have the ALLOCATABLE, INTENT (IN), or OPTIONAL attribute. C11cc Each variable-name shall be defined before execution of the DO CONCURRENT. C11dd The reduction variable can only appear on the left-hand-side of statements that look like variable-name = & variable-name reduce-operator expression or variable-name = & expression reduce-operator variable-name or variable-name = & reduce-operator(variable-name, expression) or variable-name = & reduce-operator(expression, variable-name) The variable-name is treated with semantics similar to LOCAL_INIT. The initial value before execution of each iteration of the DO CONCURRENT body is based on the reduce-operator as defined in the table below. OPERATOR INITIAL VALUE -------- ------------- + 0 * 1 .AND. .TRUE. .OR. .FALSE. .EQV. .TRUE. .NEQV. .FALSE. MIN Largest representable number MAX Smallest representable number IAND NOT(0) IOR 0 IEOR 0 Upon completion of the DO CONCURRENT, the value of each variable in the reduction-spec is its initial value combined (via the reduce-operator) with the values of the variable at the end of each iteration of the loop body, performed in any order. The processor may make assumptions about how to parallelize the reduction to optimize or eliminate synchronized access to each variable-name. Revisited Use Cases =================== s = 1000 DO CONCURRENT (i = 1:size(a)) REDUCE(+:s) s = s + (2 * a(i) - 1) END DO foo = a(1) DO CONCURRENT (i = 2:size(a)) REDUCE(MIN:foo) foo = MIN(foo, a(i)) END DO Benefits ======== Adding REDUCE will: - Increase the usefulness of DO CONCURRENT - Increase opportunities to exploit SIMD and other parallelism - Increase ability to optimize parallelism according to the available threads - Make it easier to write conforming programs - Make it easier to check the programmer's intent about reductions - Reduce memory allocation needs or explicit data management when offloading to target processors - Eliminate explicit synchronization - Reduce need for explicit temporaries Edits ===== To come. Straw Vote ========== JoR would like to propose a straw vote: 1. Should JoR continue developing this proposal? Assuming we continue with this proposal, where do we draw the line on capabilities? A. Should we relax the constraint about intrinsic types and allow derived types? B. And therefore, should we allow the definition of reduction operators to include user-defined operators? C. Should we allow any pure procedure of two arguments to be the reduction operator?