To: J3                                                     J3/18-266r1
From: Gary Klimowicz
Subject: Add reductions to DO CONCURRENT
Date: 2018-October-15

Reference: 18-007: 11.1.7 (DO construct) and 16.9.161 (REDUCE)

Motivation
----------
Performing a reduction on the array elements computed
in DO CONCURRENT can be complex or inefficient for large
arrays. The naive expression of a reduction is non-conforming
(taken from comp.lang.fortran):

        real :: a(1024), s
        integer :: i

        call random_seed()
        call random_number(a)

        ! With DO CONCURRENT, the use of 's' in the
        ! loop below is not defined.
        s = 0
        do concurrent (i = 1:size(a))
          s = s + (2 * a(i) - 1)
        end do

        print *, s
        end

('s' is referenced in an iteration before its value is defined.)

Specifying SHARED locality does not solve the problem:

        ! other initializations and declarations as above

        ! The sums on 's' can lead to race conditions.
        s = 0
        do concurrent (i = 1:size(a))
          s = s + (2 * a(i) - 1)
        end do

        print *, s
        end

One potential solution is to construct a temporary array
variable to hold the values to be reduced, and then perform
the reduction outside the DO CONCURRENT.

        real :: s_temp(1024)
        ! other initializations and declarations as above

        do concurrent (i = 1:size(a)), shared(s)
          s_temp(i) = s + (2 * a(i) - 1)
        end do
        s = sum(s_temp)

Another potential solution is to use critical sections
around the reduction. This requires the extension
of CRITICAL to use beyond their use in teams:
        s = 0
        do concurrent (i = 1:size(a)), shared(s)
          critical
            s = s + (2 * a(i) - 1)
          end critical
        end do

These solutions become more complex or inefficient as we
use more complicated concurrent-headers or wish to perform
reductions on multiple expressions within the body of the
DO CONCURRENT.

For computing environments that can exploit significant SIMD
parallelism, these solutions can involve a substantial
amount of data or process synchronization. The DO CONCURRENT
can also be scaled to the number of available threads (the
number of which are not visible to the programmer).

When the user is most interested in performing a simple
reduction of the data in the DO CONCURRENT, this overhead
can exceed the gains from the computational parallelism.

We propose a way of specifying scalar reduction variables that are
managed across the DO CONCURRENT block executions to accumulate the
scalar values from each executed block. This would be done with
processor-specific synchronization.

The intent is to support an in-Fortran specification of some
of the features of the OpenMP reduction clause.

We propose something simpler (example syntax only):

        s = 0
        do concurrent (i = 1:size(a)), reduce(s)
            s = s + (2 * a(i) - 1)
        end do

This will:
* Increase expressiveness of DO CONCURRENT
* Increase opportunities to exploit SIMD and other parallelism
* Increase ability to optimize parallelism according to the
  number of available threads
* Make it easier to write conforming programs
* Reduce memory allocation needs or explicit data management
  when offloading to target processors
* Eliminate explicit process synchronization
* Avoid redefinition of CRITICAL/END CRITICAL beyond teams


Key questions
-------------
1. Is this use case compelling?
2. Are there ways to exploit the potential parallelism with
   the existing locality-specifiers and by relaxing the semantics
   in 11.1.7.5 pp1-4?
3. Can this be done with existing CRITICAL/END CRITICAL semantics?
4. Can this be done with expansion of the ATOMIC_xxx functions
   beyond their use with scalar coarrays or coarray indexed values?

Straw Vote
----------
1. Is this worth continuing to investigate?