To: J3                                                     J3/25-145r1
From: Brandon Cook & Dan Bonachea
Subject: Requirements for US20: Local Prefix Operation Intrinsics
Date: 2025-June-24

#Reference: J3/25-145, J3/24-157, J3/23-235r2, J3/23-113, WG5/N-2239

1. Introduction
===============

Scan, or prefix reduction, operations are fundamental building blocks
in parallel algorithms and data manipulation tasks. The SCAN and
CO_SCAN proposal (J3/23-235r2) received "mixed support" at a previous
meeting, and prospective work items, including prefix reduction
operations, were "conditionally accepted" at the 2024 meeting, pending
further discussion. They were again discussed at the February 2025 WG5
meeting, and subsequently promoted to "accepted" work item US20 via
WG5 letter ballot in May 2025 (WG5/N-2239). The result of the WG5 vote
was 20 yes 5 no and 1 undecided with several informative comments.

This document focuses on requirements exclusively for the local prefix
reduction variant, refining previous concepts based on community and
WG5 feedback. By focusing exclusively on the local variant our aim is
to allow consideration independent of the closely related but distinct
collective subroutines. We do however revisit use cases briefly for
clarity as we are now addressing the two variants separately.

2. Nomenclature
================

The term "SCAN" has been identified as problematic due to potential
conflicts and historical connotations. In this document we adopt the
terminology of "prefix reduction" to reference the operation.

3. Motivation and Local Prefix Reduction Use Cases
==================================================

A local prefix reduction intrinsic offers significant benefits for
Fortran programmers by providing a standardized, optimizable way to
perform common computational patterns. There are many additional
motivating use cases available in the literature (and described in
J3/23-113).

3.1. Sparse Matrix Algorithms
-------------------------------------------

Many high-performance computing applications rely heavily on sparse
matrix operations. Prefix operations are important in these contexts,
for example:

    * Computing Offsets: In constructing sparse matrix
      representations, a prefix reduction over the number of non-zero
      elements per row/column is used to compute the offsets into the
      final data structure.  Document J3/24-157 highlights this use
      case from the MFDn application, noting that this step is
      non-trivial to implement portably and efficiently across
      different architectures (CPU and GPU).

    * Segmented prefix reductions for SpMV: Load-balancing sparse
      matrix-vector product (SpMV) kernels on GPUs often employ
      segmented prefix reduction operations.  The Ginkgo library
      utilizes such algorithms to achieve high performance on these
      kernels. It is not clearly possible to implement these
      approaches in Fortran without such an intrinsic.

3.2. Implementation of Existing Intrinsics
------------------------------------------

Efficient implementations of some existing Fortran intrinsics can be
expressed using prefix operations. J3/24-157 provides an example
implementation of the `PACK` intrinsic using an inclusive prefix
reduction to calculate the destination indices for the packed
elements.

4. Discussion of Requirements
====================================

Based on feedback and existing practice, the design space for prefix
intrinsics warrants careful consideration.

4.1. General vs. Specific Intrinsics
------------------------------------

Adding a general `REDUCE_PREFIX` for flexibility while also providing
a highly-optimizable `SUM_PREFIX` for the most common use cases
creates a consistent family of intrinsics, where `SUM` and `REDUCE`
have `_PREFIX` variants.

With straightforward transformations PRODUCT_PREFIX and COUNT_PREFIX
can be implemented with SUM_PREFIX and MIN_PREFIX and MAX_PREFIX are
possible with appropriate arguments to REDUCE_PREFIX(see Examples).

4.2. Handling Inclusive and Exclusive Prefix Reductions
-------------------------------------------------------

Explicitly named functions distinguish between the inclusive and
exclusive variants of prefix reductions.

5. SUM_PREFIX
==============

5.1 Existing SUM and new routines
---------------------------------

For reference, the syntax for SUM is

SUM(ARRAY, DIM [, MASK]) or
SUM(ARRAY [, MASK])

The new intrinsics are

SUM_PREFIX_INCLUSIVE(ARRAY [, MASK])
SUM_PREFIX_INCLUSIVE(ARRAY, DIM [, MASK])
SUM_PREFIX_EXCLUSIVE(ARRAY, [, MASK])
SUM_PREFIX_EXCLUSIVE(ARRAY, DIM [, MASK])

5.2 Semantic Requirements
-------------------------

These requirements follow a straightforward extension of SUM.

ARRAY is an array of numeric type.

The type and kind parameter of the result is determined by the type of
ARRAY, as specified for the `+` operator.

The rank and shape of the result is the same as the rank and shape of
ARRAY.

MASK is of type logical and conformable with ARRAY.

The values of the result in the inclusive case with rank 1 ARRAY and
no MASK should be equivalent to the naive serial implementation as
follows

RESULT(1) = ARRAY(1)
DO i = 2, size(ARRAY)
   RESULT(i) = RESULT(i-1) + ARRAY(i)
END DO

The values of the result in the exclusive case with rank 1 ARRAY and
no MASK should be equivalent to the naive serial implementation as
follows

RESULT(1) = 0
DO i = 2, size(ARRAY)
   RESULT(i) = RESULT(i-1) + ARRAY(i-1)
END DO

In the case of a multidimensional ARRAY with no DIM argument, RESULT
should be computed analogously to the above, with the iteration
proceeding in array element order.

In the case of a specified DIM, the prefix reduction should be
performed elementwise along that dimension.

In the case of a specified MASK, elements of ARRAY contribute to the
sum if and only if the corresponding element of MASK is true.


6. REDUCE_PREFIX
================

6.1 Existing REDUCE and new routines
------------------------------------

For reference, the syntax of REDUCE is

REDUCE(ARRAY, OPERATION [,MASK, IDENTITY, ORDERED]) or
REDUCE(ARRAY, OPERATION, DIM [,MASK, IDENTITY, ORDERED])

The new intrinsics are

REDUCE_PREFIX_INCLUSIVE(ARRAY, OPERATION [, MASK, IDENTITY, ORDERED])
REDUCE_PREFIX_INCLUSIVE(ARRAY, OPERATION, DIM [, MASK, IDENTITY, ORDERED])
REDUCE_PREFIX_EXCLUSIVE(ARRAY, OPERATION, IDENTITY [, MASK, ORDERED])
REDUCE_PREFIX_EXCLUSIVE(ARRAY, OPERATION, IDENTITY, DIM [, MASK, ORDERED])

Note that it is not possible to simultaneously preserve consistent
argument order both between the prefix and non-prefix versions of
REDUCE and between inclusive and exclusive definitions, with the
constraint that `IDENTITY` be a non-optional argument for
REDUCE_PREFIX_EXCLUSIVE.

6.2 Semantic Requirements
-------------------------

These requirements follow a straightforward extension of REDUCE:

OPERATION should be a pure function.

OPERATION should accept exactly two arguments; the result and each
argument must be a scalar, nonallocatable, noncoarray, nonpointer,
nonpolymorphic, nonoptional data object with the same declared type
and type parameters as the input ARRAY.

OPERATION need not be a commutative operator. The computation will not
reorder operands.

If ORDERED is absent or .false., then OPERATION should be
computationally associative. If ORDERED is present with the value
.true., then the computation is guaranteed to effectively proceed in
array element order, and operations will not be reassociated.

The rank and shape of the result is the same as the rank and shape of
ARRAY.

MASK is of type logical and conformable with ARRAY.

IDENTITY has the same type and type parameters as ARRAY and is a
scalar.

The values of the result in the inclusive case with rank 1 ARRAY,
ORDERED=.true. and no MASK should be equivalent to the naive serial
implementation as follows

RESULT(1) = ARRAY(1)
DO i = 2, size(ARRAY)
   RESULT(i) = OPERATION(RESULT(i-1), ARRAY(i))
END DO

The values of the result in the exclusive case with rank 1 ARRAY,
ORDERED=.true. and no MASK should be equivalent to the naive serial
implementation as follows

RESULT(1) = IDENTITY
DO i = 2, size(ARRAY)
   RESULT(i) = OPERATION(RESULT(i-1), ARRAY(i-1))
END DO

In the case of a multidimensional ARRAY with no DIM argument, RESULT
should be computed analogously to the above, with the iteration
proceeding in array element order.

In the case of a specified DIM, the prefix reduction should be
performed elementwise along that dimension.

In the case of a specified MASK, elements of ARRAY contribute to the
sum if and only if the corresponding element of MASK is true.


8. Examples
===========

Assume

function add(a,b) result(r)
	 integer, intent(in) :: a,b; integer :: r; r=a+b
end function add

8.1 Inclusive prefix reduction with +
--------------------------------------

A = [1, 2, 3, 4]
R = SUM_PREFIX_INCLUSIVE(A)
! R will be [1, 3, 6, 10]

R = REDUCE_PREFIX_INCLUSIVE(A, OPERATION=add)
! R will be [1, 3, 6, 10]

8.2 Exclusive prefix reduction with +
--------------------------------------

A = [1, 2, 3, 4]
R = SUM_PREFIX_EXCLUSIVE(A)
! R will be [0, 1, 3, 6]

R = REDUCE_PREFIX_EXCLUSIVE(A, OPERATION=add, IDENTITY=0)
! R will be [0, 1, 3, 6]

8.3 Exclusive Sum along dimension 2 with a mask:
------------------------------------------------

B = RESHAPE([1,2,3,4,5,6], [2,3]) ! B = | 1 3 5 |
                                  !     | 2 4 6 |
M = RESHAPE([.T., .T., .F., .T., .T., .T.], [2,3])
R = SUM_PREFIX_EXCLUSIVE(B, DIM=2, MASK=M)
! R will be | 0  1  1 |
!           | 0  2  6 |

8.4 COUNT_PREFIX via SUM_PREFIX
-------------------------------

	function COUNT_PREFIX_INCLUSIVE(MASK)
		LOGICAL, DIMENSION(:), intent(in) :: MASK
		INTEGER, DIMENSION(:) :: COUNT_PREFIX_INCLUSIVE
		COUNT_PREFIX_INCLUSIVE = &
		  SUM_PREFIX_INCLUSIVE(MERGE(1, 0, MASK))
	end function

8.5 PRODUCT_PREFIX via SUM_PREFIX
----------------------------------

Note: This implementation is more numerically stable than a direct
PRODUCT_PREFIX due to the error accumulation properties of addition
compared to multiplication.

	function PRODUCT_PREFIX_INCLUSIVE(ARRAY)
		INTEGER, DIMENSION(:), INTENT(IN) :: ARRAY
		INTEGER, DIMENSION(:) :: PRODUCT_PREFIX_INCLUSIVE
		PRODUCT_PREFIX_INCLUSIVE = &
		   EXP(SUM_PREFIX_INCLUSIVE(LOG(ARRAY)))
	end function

9.0 Summary of straw vote results and design decisions
======================================================

25-145 Requirements for US20: Local Prefix Operation Intrinsics

Straw votes:

Poll: A) General REDUCE_PREFIX, B) Highly specific, C) Hybrid with
REDUCE and SUM versions only.

People can vote more than once. Y/N/U

A) 5/10/8
B) 9/7/7
C) 10/1/8

Poll: Inclusive/Exclusive versions A) Are EXCLUSIVE and IDENTITY
optional arguments, or B) are there separate INCLUSIVE and EXCLUSIVE
names that make IDENTITY a required argument?

A/B/U 2/17/2

The paper has been revised to have separate INCLUSIVE and EXCLUSIVE
routines.

Poll: Generalize IDENTITY to allow rank-1 array?

Scalar/Array/Undecided 15/0/6

The inclusion of this change would also potentially result in
expanding the scope of edits to also include REDUCE for
consistency. In the discussion it was concluded that this could be
added in the future if a compelling use case arises.

Poll: Segmented operation

Should a segmented variant of the operation be included in the design?

Y/N/U 0/15/7

Based on these votes the following revisions were made for 25-145r1:
1. consolidated on REDUCE and SUM (option C)
2. presented separate routines for inclusive and exclusive variants
3. removed references to a generalized IDENTITY
4. removed references to a segmented variant