To: J3                                                     J3/18-266r3
From: Gary Klimowicz
Subject: Add reductions to DO CONCURRENT
Date: 2018-October-17

Reference: 18-007: 11.1.7 (DO construct) and 16.9.161 (REDUCE)

I Motivation
------------
Performing a reduction on the array elements computed
in DO CONCURRENT can be inefficient for large arrays.
The naive expression of a reduction is non-conforming:

        real :: a(:)
        real :: s
        integer :: i

        ...

        ! With DO CONCURRENT, the use of 's' in the
        ! loop below is not defined before referenced.
        s = 0
        do concurrent (i = 1:size(a))
          s = s + (2 * a(i) - 1)
        end do

        print *, s
        end

Potential solutions using temporary arrays (used to
accumulate intermediate results) followed by REDUCE are
inefficient in the use of data and potential data
management require in highly threaded environments.

When the programmer is interested in performing a simple
reduction of the data in the DO CONCURRENT, this overhead
can exceed the gains from the computational parallelism.

Also, with knowledge of the reduction, a processor could optimize
DO CONCURRENT to be scaled to the number of available threads
available at runtime (the number of which are not known to the
programmer), chunking computations.

At meeting 215, J3 approved this item for moving forward.


II Use Cases
------------
    ! With current semantics, the code below is not
    ! conformant. 's' is not defined in the iteration
    ! before it is referenced.
    s = 1000
    DO CONCURRENT (i = 1:SIZE(a))
      s = s + (2 * a(i) - 1)
    END DO
    ! s = 1000 + SUM(2*A-1)

    ! Likewise with 'foo', not defined before used.
    foo = a(1)
    DO CONCURRENT (i = 2:SIZE(a))
      foo = MIN(foo, a(i))
    END DO
    ! foo = minimum of every element of a

We would like programs with semantics similar to the above to be
conformant and still be able to exploit all available runtime
parallelism.


III What we have in mind
------------------------
The intent is to support an in-Fortran specification of some of the
features of the OpenMP reduction clause.

We propose a way of specifying scalar and array reduction variables
and operators that are managed across the DO CONCURRENT body
executions to accumulate values from each iteration. This would be
done with processor-specific synchronization.

We propose extending the definition of concurrent-locality to include

    REDUCE (reduce-operator : reduce-variable-name-list)

The reduce-operator is limited to the associative intrinsic operators
and functions listed in the table below.

Each reduce-variable-name is restricted to be an intrinsic type or
array of intrinsic type.

The reduce-variable-names shall not have the ALLOCATABLE,
INTENT (IN), or OPTIONAL attribute.

The reduce-variable-names shall be defined before execution of the
DO CONCURRENT.

The reduce-variable-name is treated with semantics similar
to LOCAL_INIT. The initial value before execution of each
iteration of the DO CONCURRENT body is based on the reduce-operator
as defined in the table below.

  OPERATOR          INITIAL VALUE
  --------          -------------
    +                     0
    *                     1
  .AND.                .TRUE.
  .OR.                 .FALSE.
  .EQV.                .TRUE.
  .NEQV.               .FALSE.
   MIN          Largest representable number
   MAX         Smallest representable number
   IAND              All bits on
   IOR                    0
   IEOR                   0

The value of the reduce-variable upon completion of the DO CONCURRENT
is its initial value combined (via the reduce-operator) with the
values of the reduce-variable at the end of each iteration of the
loop body, performed in any order.

The processor may make assumptions about how to parallelize the
reduction to optimize or eliminate synchronized access to each
reduce-variable-name.

The reduce-variable can only appear on the left-hand-side of
statements that look like
        reduce-variable-name = &
                reduce-variable-name reduce-operator expression
    or
        reduce-variable-name = &
                expression reduce-operator reduce-variable-name
    or
        reduce-variable-name = &
              reduce-operator(reduce-variable-name, expression)
    or
        reduce-variable-name = &
              reduce-operator(expression, reduce-variable-name)

Revisited Use Cases
-------------------
    s = 1000
    DO CONCURRENT (i = 1:size(a)) REDUCE(+:s)
      s = s + (2 * a(i) - 1)
    END DO

    foo = a(1)
    DO CONCURRENT (i = 2:size(a)) REDUCE(MIN:foo)
      foo = MIN(foo, a(i))
    END DO


Benefits
--------
Adding REDUCE will:
    - Increase the usefulness of DO CONCURRENT
    - Increase opportunities to exploit SIMD and other parallelism
    - Increase ability to optimize parallelism according to the
      available threads
    - Make it easier to write conforming programs
    - Make it easier to check the programmer's intent about reductions
    - Reduce memory allocation needs or explicit data management when
      offloading to target processors
    - Eliminate explicit synchronization
    - Reduce need for explicit temporaries