To: J3                                                     J3/19-158
From: Gary Klimowicz
Subject: Add reductions to DO CONCURRENT
Date: 2019-February-14

Reference: 18-007r1, 18-266r4

Motivation
==========

Performing a reduction on the array elements computed
in DO CONCURRENT can be inefficient for large arrays.
The naive expression of a reduction is non-conforming:

        real :: a(:)
        real :: s
        integer :: i

        ...

        ! With DO CONCURRENT, the use of 's' in the
        ! loop below is not defined before referenced.
        s = 0
        do concurrent (i = 1:size(a))
          s = s + (2 * a(i) - 1)
        end do

        print *, s
        end

Potential solutions using temporary arrays (used to
accumulate intermediate results) followed by REDUCE are
inefficient in the use of data and potential data
management require in highly threaded environments.

When the programmer is interested in performing a simple
reduction of the data in the DO CONCURRENT, this overhead
can exceed the gains from the computational parallelism.

Also, with knowledge of the reduction, a processor could optimize
DO CONCURRENT to be scaled to the number of available threads
available at runtime (the number of which are not known to the
programmer), chunking computations.

At meeting 215, J3 approved this item for moving forward.


Simple Use Cases
================

    ! With current semantics, the code below is not
    ! conformant. 's' is not defined in the iteration
    ! before it is referenced.
    s = 1000
    DO CONCURRENT (i = 1:SIZE(a))
      s = s + (2 * a(i) - 1)
    END DO
    ! s = 1000 + SUM(2*A-1)

    ! Likewise, with 'foo', not defined before used.
    foo = a(1)
    DO CONCURRENT (i = 2:SIZE(a))
      foo = MIN(foo, a(i))
    END DO
    ! foo = MINVAL(a)

We would like programs with semantics similar to the above to be
conformant and still be able to exploit all available runtime
parallelism.


Requirements
============

Provide Fortran support for some of the features of the OpenMP
reduction clause.

The proposal should support:
    1. Extend DO CONCURRENT to include the addition of scalar
       "reduction variables" and a corresponding "reduction operators"
       for the reduction variable.

    2. Provide for the initialization of the reduction variable
       consistent with the reduction operator.

    3. Allow multiple reduction variables to be defined for a
       DO CONCURRENT.

    4. Support the following reduction operators and initial values:
              OPERATOR          INITIAL VALUE
              --------          -------------
                +                     0
                *                     1
              .AND.                .TRUE.
              .OR.                 .FALSE.
              .EQV.                .TRUE.
              .NEQV.               .FALSE.
               MIN          Largest representable number
               MAX         Smallest representable number
               IAND                 NOT(0)
               IOR                    0
               IEOR                   0

    5. Have reduction variable semantics aligned with existing
       locality specifications for DO CONCURRENT.

    6. Have behavior that is a good analog to the OpenMP reduction clause.


The proposal need not support:
    1. Reduction variables with the ALLOCATABLE, INTENT (IN),
       or OPTIONAL attribute.


Proposal
========

In R1123, insert "concurrent-reduction" between "concurrent-header"
and "concurrent-locality" to provide for the definition of the
reduction variables.

    R1123 loop-control
        is [ , ] do-variable = scalar-int-expr, scalar-int-expr
                    [ , scalar-int-expr ]
        or [ , ] WHILE ( scalar-logical-expr )
        or [ , ] CONCURRENT concurrent-header concurrent-reduction
                    concurrent-locality

Add a rule
    R11xx   concurrent-reduction is [ reduction-spec ] ...

Add a rule

    R11yy   reduction-spec is REDUCE (reduce-operator : variable-name-list)

Add a rule
    R11zz   reduce-operator is +
                    or *
                    or .AND.
                    or .OR.
                    or .EQV.
                    or .NEQV.
                    or MIN
                    or MAX
                    or IAND
                    or IOR
                    or IEOR

Add constraints:
    C11aa   Each variable-name in a reduction-spec is restricted to be
            an intrinsic type or array of intrinsic type.

    C11bb   Each variable-name in a reduction-spec shall not have the
            ALLOCATABLE, INTENT (IN), or OPTIONAL attribute.

    C11cc   Each variable-name shall be defined before execution of the
            DO CONCURRENT.

    C11dd   The reduction variable can only appear on the left-hand-side of
            statements that look like
                    variable-name = &
                            variable-name reduce-operator expression
                or
                    variable-name = &
                            expression reduce-operator variable-name
                or
                    variable-name = &
                          reduce-operator(variable-name, expression)
                or
                    variable-name = &
                          reduce-operator(expression, variable-name)

The variable-name is treated with semantics similar to LOCAL_INIT.
The initial value before execution of each iteration of the
DO CONCURRENT body is based on the reduce-operator as defined
in the table below.

      OPERATOR          INITIAL VALUE
      --------          -------------
        +                     0
        *                     1
      .AND.                .TRUE.
      .OR.                 .FALSE.
      .EQV.                .TRUE.
      .NEQV.               .FALSE.
       MIN          Largest representable number
       MAX         Smallest representable number
       IAND                 NOT(0)
       IOR                    0
       IEOR                   0

Upon completion of the DO CONCURRENT, the value of each variable
in the reduction-spec is its initial value combined (via the
reduce-operator) with the values of the variable at the end
of each iteration of the loop body, performed in any order.

The processor may make assumptions about how to parallelize the
reduction to optimize or eliminate synchronized access to each
variable-name.


Revisited Use Cases
===================
    s = 1000
    DO CONCURRENT (i = 1:size(a)) REDUCE(+:s)
      s = s + (2 * a(i) - 1)
    END DO

    foo = a(1)
    DO CONCURRENT (i = 2:size(a)) REDUCE(MIN:foo)
      foo = MIN(foo, a(i))
    END DO


Benefits
========
Adding REDUCE will:
    - Increase the usefulness of DO CONCURRENT
    - Increase opportunities to exploit SIMD and other parallelism
    - Increase ability to optimize parallelism according to the
      available threads
    - Make it easier to write conforming programs
    - Make it easier to check the programmer's intent about reductions
    - Reduce memory allocation needs or explicit data management when
      offloading to target processors
    - Eliminate explicit synchronization
    - Reduce need for explicit temporaries


Edits
=====
To come.


Straw Vote
==========
JoR would like to propose a straw vote:
   1. Should JoR continue developing this proposal?

Assuming we continue with this proposal, where do we draw
the line on capabilities?
    A. Should we relax the constraint about intrinsic types
       and allow derived types?

    B. And therefore, should we allow the definition of
       reduction operators to include user-defined operators?

    C. Should we allow any pure procedure of two arguments
       to be the reduction operator?