To: J3                                                     J3/25-145
From: Brandon Cook & Dan Bonachea
Subject: Requirements for US20: Local Prefix Operation Intrinsics
Date: 2025-June-11

#Reference: J3/24-157, J3/23-235r2, J3/23-113, WG5/N-2239

1. Introduction
===============

Scan, or prefix reduction, operations are fundamental building blocks
in parallel algorithms and data manipulation tasks. The SCAN and
CO_SCAN proposal (J3/23-235r2) received "mixed support" at a previous
meeting, and prospective work items, including prefix reduction
operations, were "conditionally accepted" at the 2024 meeting, pending
further discussion. They were again discussed at the February 2025 WG5
meeting, and subsequently promoted to "accepted" work item US20 via
WG5 letter ballot in May 2025 (WG5/N-2239). The result of the WG5 vote
was 20 yes 5 no and 1 undecided with several informative comments.

This document focuses on requirements exclusively for the local prefix
reduction variant, refining previous concepts based on community and
WG5 feedback. By focusing exclusively on the local variant our aim is
to allow consideration independent of the closely related but distinct
collective subroutines. We do however revisit use cases briefly for
clarity as we are now addressing the two variants separately.

2. Nomenclature
================

The term "SCAN" has been identified as problematic due to potential
conflicts and historical connotations. In this document we adopt the
terminology of "prefix reduction" to reference the operation.

3. Motivation and Local Prefix Reduction Use Cases
==================================================

A local prefix reduction intrinsic offers significant benefits for
Fortran programmers by providing a standardized, optimizable way to
perform common computational patterns. There are many additional
motivating use cases available in the literature (and described in
J3/23-113).

3.1. Sparse Matrix Algorithms
-------------------------------------------

Many high-performance computing applications rely heavily on sparse
matrix operations. Prefix operations are important in these contexts,
for example:

    * Computing Offsets: In constructing sparse matrix
      representations, a prefix reduction over the number of non-zero
      elements per row/column is used to compute the offsets into the
      final data structure.  Document J3/24-157 highlights this use
      case from the MFDn application, noting that this step is
      non-trivial to implement portably and efficiently across
      different architectures (CPU and GPU).

    * Segmented prefix reductions for SpMV: Load-balancing sparse
      matrix-vector product (SpMV) kernels on GPUs often employ
      segmented prefix reduction operations.  The Ginkgo library
      utilizes such algorithms to achieve high performance on these
      kernels. It is not clearly possible to implement these
      approaches in Fortran without such an intrinsic.

3.2. Implementation of Existing Intrinsics
------------------------------------------

Efficient implementations of some existing Fortran intrinsics can be
expressed using prefix operations. J3/24-157 provides an example
implementation of the `PACK` intrinsic using an inclusive prefix
reduction to calculate the destination indices for the packed
elements.


4. Discussion of Requirements
====================================

Based on feedback and existing practice, the design space for prefix
intrinsics warrants careful consideration.

4.1. General vs. Specific Intrinsics
------------------------------------

There are three primary approaches to introducing prefix reduction
operations as local intrinsic transformational functions into the
standard.

* Option A: General `REDUCE_PREFIX`

    This approach, modeled on `REDUCE` , provides one powerful and
    flexible function that takes a user-defined `operation`.  Its
    strength is its generality and minimalism, however it can be
    verbose for common operations and may present a greater
    optimization challenge for compilers.

* Option B: Specific, High-Utility Intrinsics

    This approach follows the precedent set by High Performance
    Fortran (HPF) and is implemented in libraries like NVIDIA's
    cutensorex.  It would involve adding specific intrinsics like
    `SUM_PREFIX`, `COUNT_PREFIX` and so on.  This offers ease of use
    and maximum optimization potential for common cases but lacks
    generality. It also opens the door for ongoing future requests of
    additional operations and therefore more intrinsics and the
    corresponding effort to standardize and implement them.

* Option C: A Hybrid Approach

    This approach provides benefits of both approaches by adding a
    general `REDUCE_PREFIX` for flexibility while also providing a
    highly-optimizable `SUM_PREFIX` for the most common use
    cases. This creates a complete and consistent family of
    intrinsics, where `SUM` and `REDUCE` have `_PREFIX` variants. The
    remainder of the document assumes this option. With
    straightforward transformations PRODUCT_PREFIX and COUNT_PREFIX
    can be implemented with SUM_PREFIX and MIN_PREFIX and MAX_PREFIX
    are possible with appropriate arguments to REDUCE_PREFIX(see
    Examples).

Straw poll A:
Straw poll B:
Straw poll C:

4.2. Handling Inclusive and Exclusive Prefix Reductions
-------------------------------------------------------

Previous proposals used a single `EXCLUSIVE` logical argument to
switch between inclusive and exclusive prefix reductions.  An
alternative is to provide separate, explicitly named functions.

It is not possible to simultaneously preserve consistent argument
order both between the prefix and non-prefix versions of REDUCE and
between inclusive and exclusive definitions, with the desired
constraint that `IDENTITY` be a non-optional argument for
REDUCE_PREFIX_EXCLUSIVE.

In this paper we present both options.

5. SUM_PREFIX
==============

5.1 SUM
-------

For reference, the syntax for SUM is

SUM(ARRAY, DIM [, MASK]) or
SUM(ARRAY [, MASK])

5.2 Semantic Requirements
-------------------------

These requirements follow a straightforward extension of SUM.

ARRAY is an array of numeric type.

The type and kind parameter of the result is determined by the type of
ARRAY, as specified for the `+` operator.

The rank and shape of the result is the same as the rank and shape of
ARRAY.

MASK is of type logical and conformable with ARRAY.

The values of the result in the inclusive case with rank 1 ARRAY and
no MASK should be equivalent to the naive serial implementation as
follows

RESULT(1) = ARRAY(1)
DO i = 2, size(ARRAY)
   RESULT(i) = RESULT(i-1) + ARRAY(i)
END DO

The values of the result in the exclusive case with rank 1 ARRAY and
no MASK should be equivalent to the naive serial implementation as
follows

RESULT(1) = 0
DO i = 2, size(ARRAY)
   RESULT(i) = RESULT(i-1) + ARRAY(i-1)
END DO

In the case of a multidimensional ARRAY with no DIM argument, RESULT
should be computed analogously to the above, with the iteration
proceeding in array element order.

In the case of a specified DIM, the prefix reduction should be
performed elementwise along that dimension.

In the case of a specified MASK, elements of ARRAY contribute to the
sum if and only if the corresponding element of MASK is true.

5.2 INCLUSIVE and EXCLUSIVE
---------------------------

The ability to select between an inclusive and exclusive prefix sum
operation can be controlled either via arguments or function
definitions:

SUM_PREFIX(ARRAY [, MASK, EXCLUSIVE]) or
SUM_PREFIX(ARRAY, DIM [, MASK, EXCLUSIVE])

OR

SUM_PREFIX_INCLUSIVE(ARRAY [, MASK])
SUM_PREFIX_INCLUSIVE(ARRAY, DIM [, MASK])
SUM_PREFIX_EXCLUSIVE(ARRAY, [, MASK])
SUM_PREFIX_EXCLUSIVE(ARRAY, DIM [, MASK])

straw poll:

6. REDUCE_PREFIX
================

6.1 REDUCE
----------

For reference, the syntax of REDUCE is

REDUCE(ARRAY, OPERATION [,MASK, IDENTITY, ORDERED]) or
REDUCE(ARRAY, OPERATION, DIM [,MASK, IDENTITY, ORDERED])

6.2 Semantic Requirements
-------------------------

These requirements follow a straightforward extension of REDUCE:

OPERATION should be a pure function.

OPERATION should accept exactly two arguments; the result and each
argument must be a scalar, nonallocatable, noncoarray, nonpointer,
nonpolymorphic, nonoptional data object with the same declared type
and type parameters as the input ARRAY.

OPERATION need not be a commutative operator. The computation will not
reorder operands.

If ORDERED is absent or .false., then OPERATION should be
computationally associative. If ORDERED is present with the value
.true., then the computation is guaranteed to effectively proceed in
array element order, and operations will not be reassociated.

The rank and shape of the result is the same as the rank and shape of
ARRAY.

MASK is of type logical and conformable with ARRAY.

IDENTITY has the same type and type parameters as ARRAY and is
conformable with ARRAY.

The values of the result in the inclusive case with rank 1 ARRAY,
ORDERED=.true. and no MASK should be equivalent to the naive serial
implementation as follows

RESULT(1) = ARRAY(1)
DO i = 2, size(ARRAY)
   RESULT(i) = OPERATION(RESULT(i-1), ARRAY(i))
END DO

The values of the result in the exclusive case with rank 1 ARRAY,
ORDERED=.true. and no MASK should be equivalent to the naive serial
implementation as follows

RESULT(1) = IDENTITY(1)
DO i = 2, size(ARRAY)
   RESULT(i) = OPERATION(RESULT(i-1), ARRAY(i-1))
END DO

In the case of a multidimensional ARRAY with no DIM argument, RESULT
should be computed analogously to the above, with the iteration
proceeding in array element order.

In the case of a specified DIM, the prefix reduction should be
performed elementwise along that dimension.

In the case of a specified MASK, elements of ARRAY contribute to the
sum if and only if the corresponding element of MASK is true.

Straw vote: Generalization of IDENTITY to be conformable with ARRAY.

6.3 INCLUSIVE and EXCLUSIVE
---------------------------

It is not possible to simultaneously preserve consistent argument
order both between the prefix and non-prefix versions of REDUCE and
between inclusive and exclusive definitions, with the desired
constraint that IDENTITY be a non-optional argument for
REDUCE_PREFIX_EXCLUSIVE.


REDUCE_PREFIX(ARRAY, OPERATION [, MASK, IDENTITY, ORDERED, EXCLUSIVE])
REDUCE_PREFIX(ARRAY, OPERATION, DIM [, MASK, IDENTITY, ORDERED, EXCLUSIVE])

or

REDUCE_PREFIX_INCLUSIVE(ARRAY, OPERATION [, MASK, IDENTITY, ORDERED])
REDUCE_PREFIX_INCLUSIVE(ARRAY, OPERATION, DIM [, MASK, IDENTITY, ORDERED])
REDUCE_PREFIX_EXCLUSIVE(ARRAY, OPERATION, IDENTITY [, MASK, ORDERED])
REDUCE_PREFIX_EXCLUSIVE(ARRAY, OPERATION, IDENTITY, DIM [, MASK, ORDERED])

Straw poll:

7. SEGMENT
==========

The `SEGMENT` argument from J3/23-113 has compelling use cases. It is
worth noting that HPF also specified a `SEGMENT` argument which NVIDIA
has not yet implemented in their cutensorex library. It is also
possible to implement a segmented prefix reduction with a
non-segmented version by augmenting the definition of the binary
operation and the input array with a corresponding segment key at the
expense of additional memory and pre/post-processing.

Straw poll:


8. Examples
===========

Assume

function add(a,b) result(r)
	 integer, intent(in) :: a,b; integer :: r; r=a+b
end function add

8.1 Inclusive prefix reduction with +
--------------------------------------

A = [1, 2, 3, 4]
R = SUM_PREFIX_INCLUSIVE(A)
! R will be [1, 3, 6, 10]

R = REDUCE_PREFIX_INCLUSIVE(A, OPERATION=add)
! R will be [1, 3, 6, 10]

8.2 Exclusive prefix reduction with +
--------------------------------------

A = [1, 2, 3, 4]
R = SUM_PREFIX_EXCLUSIVE(A)
! R will be [0, 1, 3, 6]

R = REDUCE_PREFIX_EXCLUSIVE(A, OPERATION=add, IDENTITY=0)
! R will be [0, 1, 3, 6]

8.3 Exclusive Sum along dimension 2 with a mask:
------------------------------------------------

B = RESHAPE([1,2,3,4,5,6], [2,3]) ! B = | 1 3 5 |
                                  !     | 2 4 6 |
M = RESHAPE([.T., .T., .F., .T., .T., .T.], [2,3])
R = SUM_PREFIX_EXCLUSIVE(B, DIM=2, MASK=M)
! R will be | 0  1  1 |
!           | 0  2  6 |

8.4 COUNT_PREFIX via SUM_PREFIX
-------------------------------

	function COUNT_PREFIX_INCLUSIVE(MASK)
		LOGICAL, DIMENSION(:), intent(in) :: MASK
		INTEGER, DIMENSION(:) :: COUNT_PREFIX_INCLUSIVE
		COUNT_PREFIX_INCLUSIVE = &
		  SUM_PREFIX_INCLUSIVE(MERGE(1, 0, MASK))
	end function

8.5 PRODUCT_PREFIX via SUM_PREFIX
----------------------------------

Note: This implementation is more numerically stable than a direct
PRODUCT_PREFIX due to the error accumulation properties of addition
compared to multiplication.

	function PRODUCT_PREFIX_INCLUSIVE(ARRAY)
		INTEGER, DIMENSION(:), INTENT(IN) :: ARRAY
		INTEGER, DIMENSION(:) :: PRODUCT_PREFIX_INCLUSIVE
		PRODUCT_PREFIX_INCLUSIVE = &
		   EXP(SUM_PREFIX_INCLUSIVE(LOG(ARRAY)))
	end function