To: J3 J3/19-198r2 From: Gary Klimowicz Subject: Add reductions to DO CONCURRENT (US20) Date: 2019-August-07 Reference: 18-007r1, 18-266r4, 19-158 Motivation ========== Performing a reduction on the array elements computed in DO CONCURRENT can be inefficient for large arrays. The naive expression of a reduction is non-conforming: real :: a(:) real :: s integer :: i ... ! With DO CONCURRENT, the use of 's' in the ! loop below is not defined before it is referenced. s = 0 do concurrent (i = 1:size(a)) s = s + (2 * a(i) - 1) end do print *, s end Potential solutions using temporary arrays (used to accumulate intermediate results) followed by REDUCE are inefficient in the use of data and the potential data management required in highly threaded environments. When the programmer is interested in performing a simple reduction of the data in the DO CONCURRENT, this overhead can exceed the gains from the computational parallelism. Also, with knowledge of the reduction, a processor could optimize DO CONCURRENT to be scaled to the number of threads available at runtime (the number of which may not be known to the programmer), chunking computations. At meeting 218, J3 approved continued work on this item. Note that this paper only describes REDUCE for scalar variables. Once we get the semantics of that right we can expand this to cover REDUCE on non-scalar variables. Simple Use Cases ================ ! With current semantics, the code below is not ! conformant. 's' is not defined in the iteration ! before it is referenced. s = 1000 DO CONCURRENT (i = 1:SIZE(a)) s = s + (2 * a(i) - 1) END DO ! s = 1000 + SUM(2*A-1) ! Likewise, with 'foo', not defined before used. foo = a(1) DO CONCURRENT (i = 2:SIZE(a)) foo = MIN(foo, a(i)) END DO ! foo = MINVAL(a) We would like programs with semantics similar to the above to be conformant and still be able to exploit all available runtime parallelism. Requirements ============ Provide Fortran support for some of the features of the OpenMP reduction clause on numeric and logical intrinsic types. The proposal should support: 1. Extend DO CONCURRENT to include the addition of scalar "reduction variables" and corresponding "reduction operators" on the reduction variables. 2. Allow multiple reduction variables to be defined for a DO CONCURRENT. 3. Support intrinsic binary and n-ary operators as reduction operators: + * .AND. .OR. .EQV. .NEQV. MIN MAX IAND IOR IEOR 4. Provide for the initialization of the reduction variables consistent with the reduction operators. 5. Have reduction variable semantics aligned with existing locality specifications for DO CONCURRENT. 6. Have behavior that is a good analog to the OpenMP reduction clause. The proposal need not support: 1. Reduction variables with the ALLOCATABLE, POINTER, INTENT (IN), or OPTIONAL attribute. Proposal ======== Add a new locality-spec to describe reduction variables. These reduction variables are local to each iteration of the loop, initialized with a suitable value upon entry to the loop, but are related to a defined instance of the reduction variable in the scope outside the loop. This proposal is limited to scalar reduction variables of numeric and logical type. In the future, we may consider extending this proposal to include - reduction variables of arrays of numeric and logical type; - reduction variables of derived types; - additional intrinsic operators and procedures - user-defined PURE functions as operators; - operators and procedures of derived types. Syntax change for locality-spec: R1129 locality-spec is LOCAL ( variable-name-list ) or LOCAL_INIT ( variable-name-list ) or SHARED ( variable-name-list ) or DEFAULT ( NONE ) or REDUCE (reduce-op : reduce-var-name-list) Define a limited set of binary operators and procedures that can be used to combine values of the reduction variables. Syntax for the binary operators and procedures: R1129a reduce-op is binary-reduce-op or function-reduce-op R1129b binary-reduce-op is + or * or .AND. or .OR. or .EQV. or .NEQV. R1129c function-reduce-op is MIN or MAX or IAND or IOR or IEOR Add a rule to define reduction variable names R1129d reduce-var-name is scalar-variable-name Add constraints for the reduction variables: - it must be scalar - it must not be a pointer - it must not be allocatable - it must not be an optional dummy argument that is not present - it must be permitted to appear in a variable definition context Limit the definition of reduction variables in each iteration to make the implementation less complex. Allow: var = var op expression var = expression op var var = function (var, expression) var = function (expression, var) for suitable values of op and function. Upon start of an iteration, the reduction variables are given default initial values based on the reduction operator, type and kind of the variable. Table T1 below summarizes these for the operators defined above. Upon completion of the DO CONCURRENT, the value of each reduction variable is as if the value of the reduction variable before execution of the loop combined (via the reduction operator) with the values of the reduction variable from each iteration of the DO CONCURRENT. This combination could be performed in any order. The processor may make assumptions about how to parallelize the DO CONCURRENT iterations to optimize or eliminate synchronized access to the reduction variables. Table T1. The types of the allowed operands and default initial values for each reduction operator and procedure. The types use the type designators used found in Table 10.2. OPERATOR INITIAL VALUE ALLOWED TYPES -------- ------------- ------------- + 0 I, R, Z * 1 I, R, Z .AND. .TRUE. L .OR. .FALSE. L .EQV. .TRUE. L .NEQV. .FALSE. L MIN nnlmsbtp * I, R MAX pnlmsbtp ** I, R IAND NOT(0) I IOR 0 I IEOR 0 I * nnlmsbtp = the value of the negative number of the largest magnitude supported by the processor for numbers of the type and kind of the reduction variable. ** pnlmsbtp = the value of the positive number of the largest magnitude supported by the processor for numbers of the type and kind of the reduction variable. Revisited Use Cases =================== s = 1000 DO CONCURRENT (i = 1:size(a)) REDUCE(+:s) s = s + (2 * a(i) - 1) END DO foo = a(1) DO CONCURRENT (i = 2:size(a)) REDUCE(MIN: foo) foo = MIN(foo, a(i)) END DO Benefits ======== Adding REDUCE will: - Increase the usefulness of DO CONCURRENT - Increase opportunities to exploit SIMD and other parallelism - Increase ability to optimize parallelism according to the available threads - Make it easier to write conforming programs - Make it easier to check the programmer's intent about reductions - Reduce memory allocation needs or explicit data management when offloading to target processors - Eliminate explicit synchronization - Reduce need for explicit temporaries Questions ========= 1. Should we remove the restriction on forms of use of the reduction variable in the loop iterations? var = var op expression var = expression op var var = function (var, expression) var = function (expression, var) Edits ===== To come.