To: J3 J3/18-266r2 From: Gary Klimowicz Subject: Add reductions to DO CONCURRENT Date: 2018-October-16 Reference: 18-007: 11.1.7 (DO construct) and 16.9.161 (REDUCE) I Motivation ------------ Performing a reduction on the array elements computed in DO CONCURRENT can be inefficient for large arrays. The naive expression of a reduction is non-conforming: real :: a(:) real :: s integer :: i ... ! With DO CONCURRENT, the use of 's' in the ! loop below is not defined before referenced. s = 0 do concurrent (i = 1:size(a)) s = s + (2 * a(i) - 1) end do print *, s end Potential solutions using temporary arrays (used to accumulate intermediate results) followed by REDUCE are inefficient in the use of data and potential data management require in highly threaded environments. When the programmer is interested in performing a simple reduction of the data in the DO CONCURRENT, this overhead can exceed the gains from the computational parallelism. Also, with knowledge of the reduction, a processor could optimize DO CONCURRENT to be scaled to the number of available threads available at runtime (the number of which are not known to the programmer), chunking computations. II Use Cases ------------ ! With current semantics, the code below is not ! conformant. 's' is not defined in the iteration ! before it is referenced. s = 0 do concurrent (i = 1:size(a)) s = s + (2 * a(i) - 1) end do foo = a(1) do concurrent (i = 2:size(a)) foo = MIN(foo, a(i)) end do We would like programs with semantics similar to the above to be conformant and still be able to exploit all available runtime parallelism. III What we have in mind ------------------------ The intent is to support an in-Fortran specification of some of the features of the OpenMP reduction clause. We propose a way of specifying scalar reduction variables and operators that are managed across the DO CONCURRENT block executions to accumulate scalar values from each iteration. This would be done with processor-specific synchronization. We propose something (example syntax only): do concurrent concurrent-header concurrent-locality ... end do We propose extending concurrent-locality to include REDUCE (reduce-variable-name : reduce-operator) The reduce-variable-name is treated with semantics similar to SHARED, except that the processor may make assumptions about how to parallelize the reduction to optimize or eliminate synchronized access to the reduce-variable-name. The reduce-operator is limited to the following associative intrinsic operators or functions: + * .AND. .OR. MIN MAX The reduce-variable can only appear on the left-hand-side of statements that look like reduce-variable-name = reduce-variable-name & reduce-operator expression or reduce-variable-name = & reduce-operator(reduce-variable-name, expression) This will: * Increase the usefulness of DO CONCURRENT * Increase opportunities to exploit SIMD and other parallelism * Increase ability to optimize parallelism according to the number of available threads * Make it easier to write conforming programs * Make it easier to check the programmer's intent about the reduction * Reduce memory allocation needs or explicit data management when offloading to target processors * Eliminate explicit process synchronization Straw votes or questions for discussion --------------------------------------- 1. Should we further restrict or expand the number of operators? 2. Should we have language that limits the s = s + ... form of the reduction variable in the loop body? 3. Should we allow multiple reductions per DO CONCURRENT? 4. Should we consider something a bit more general, like a SHARED_UPDATE locality specifier? 5. Should this be limited to scalars, or also allow array reductions?