To: J3                                                     J3/24-157
From: Brad Richardson & Brandon Cook
Subject: In Defense of Intrinsic Scan Operations
Date: 2024-August-12

References: 23-235r2, N2234

1. Introduction
===============

In response to the "mixed support" received for the SCAN and CO_SCAN
proposal [23-235r2] at the 2024 WG5 meeting [N2234], this paper
presents use cases and motivation for their inclusion in the F202Y
work item list.

2. Use Cases and Motivation for SCAN
===================================

The Wikipedia page on the topic provides multiple use case for
scan operations.

https://en.wikipedia.org/wiki/Prefix_sum

As further motivation, some additional use cases are provided below.

2.1 Sparse Matrix Algorithm
---------------------------

Taken from the NERSC user community and an application run in multiple
DOE centers and beyond, MFDn:

Consider a sparse matrix, composed of non-zero and zero blocks.
Non-zero blocks are also sparse. Which elements are non-zero is
determined by a complex set of rules (many-body quantum state
selection rules for interacting states, complicated due to a special
load balanced ordering) to determine which elements are non-zero.

At a high level:

  1. count_non_zero_blocks()
  2. count_non_zero_elements_per_block()
  3. allocate_large_array() # per image, holds many blocks,
     single allocation important for alloc() performance and GPU
     kernel offload with non-zero matrix blocks distributed to SMs.
  4. compute_offsets_into_array_for_each_block() # this is a scan
     over nonzero elements per block
  5. compute_matrix_elements()
  6. eigen_decomposition_of_matrix()

All steps apart from 4 are amenable to parallel processing without
significant cognitive burden when targeting multiple architectures
(CPU and GPU). 4 is highly non-trivial to implement on any
architecture and those implementations are typically not portable
(e.g.
https://www.intel.com/content/www/us/en/developer/articles/technical
  /optimize-scan-operations-explicit-vectorization.html
is very different from
https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html).

Some examples and details from this specific application are
documented in
https://arxiv.org/pdf/2110.10765

The following parallel scan algorithm is relatively naive and only
just achieves a speedup compared to a cpu/serial version for some
input sizes thanks to avoiding a host/device transfer. In practice
a serial kernel is often used on a GPU leading to significant idle
resources during a simulation campaign.

!$acc data present(x,y)
!$acc parallel loop async
doj=1,n
  y(j) = x(j)
end do
!$acc end parallel
offset = 1
! sweep up, reduction in place
do while (offset < n)
  !$acc parallel loop firstprivate(offset) present(y) async
  do concurrent (j=0:n-1:2*offset)
    y(j + 2*offset) = y(j + offset) + y(j + 2*offset)
  end do
  !$acc end parallel
  offset = 2*offset
end do
! sweep down, complete the scan
!$acc serial async
y(n)=0
!$acc end serial
offset = rshift(offset, 1)
do while(offset > 0)
  !$acc parallel loop firstprivate(offset, tmp) present(y) async
  do concurrent(j=0:n-1:2*offset) local(tmp)
    tmp = y(j + offset)
    y(j + offset) = y(j + 2*offset)
    y(j + 2*offset) = tmp + y(j + 2*offset)
  end do
  !$acc end parallel
  offset = rshift(offset, 1)
end do
!$acc wait
!$acc end data

2.2 Sparse Matrix Segmented Scan (spmv)
---------------------------------------

The algorithm for efficiently computing sparse-matrix vector products,
one extremely common operation in scientific computing, described in the
following paper is effectively impossible to implement in Fortran.
These algorithms are used in the high-performance Ginkgo
(https://ginkgo-project.github.io/) linear algebra library.

Hartwig Anzt, Terry Cojean, Chen Yen-Chen, Jack Dongarra, Goran Flegar,
Pratik Nayak, Stanimire Tomov, Yuhsiang M. Tsai, and Weichung Wang.
2020. Load-balancing Sparse Matrix Vector Product Kernels on GPUs.
ACM Trans. Parallel Comput. 7, 1, Article 2 (March 2020), 26 pages.
https://doi.org/10.1145/3380930

The details are too complex to reproduce fully, but in the referenced
paper and source code segmented scan operations are central. e.g.
"The segmented scan approach radically reduces the number of atomic
collisions" and "Our experiments on a large number of matrices revealed
that replacing the segmented scan algorithm with multiple reductions
results in lower performance,.."

source code references:
* segmented scan https://github.com/ginkgo-project/ginkgo/blob
  /caa373d1a3c635755b48cb50dc5417bdc64144af/common/cuda_hip/components
  /segment_scan.hpp#L26

* COO kernel https://github.com/ginkgo-project/ginkgo/blob
  /26eb276f03b3b1e3f46cc20e3bca9f4636d225b6/common/cuda_hip/matrix
  /coo_kernels.cpp#L64

2.3 Implementations of Existing Intrinsics
------------------------------------------

Efficient implementations for some existing intrinsics involve scan
operations. For example, consider the following possible implementation
for the PACK intrinsic.

module m_scan
contains
  function inclusive_scan(A) result(S)
    integer, intent(in) :: A(:)
    integer :: S(size(A))
    integer :: i
    S(1) = A(1)
    do i = 2, size(A)
       S(i) = S(i-1) + A(i)
    end do
  end function inclusive_scan
end module m_scan

module m_cram
  use m_scan
contains
  function cram(A, mask) result(P)
    integer, intent(in) :: A(:)
    logical, intent(in) :: mask(:)
    integer, allocatable :: P(:)
    integer, allocatable :: imask(:)
    integer :: i

    allocate(P(count(mask)))

    allocate(imask(size(mask)))
    where (mask)
       imask = 1
    elsewhere
       imask = 0
    end where

    imask = inclusive_scan(imask)

    do i = 1, size(A)
       if (mask(i)) then
          P(imask(i)) = A(i)
       end if
    end do

  end function cram
end module m_cram

program pack_scan
  use m_cram
  implicit none
  integer :: a(8) = [1,2,3,4,5,6,7,8]
  logical :: m(8)
  m = modulo(a, 2) == 0
  print*, pack(a, m)
  print*, cram(a, m)
end program pack_scan

3. Use Case and Motivation for CO_SCAN
======================================

One motivating use case for having a collective scan operation is for
decomposing a domain when reading data from, or writing data to a file.
For example, each image will be responsible for reading a different
part of a dataset from a file, processing that data, and writing the
results into another file. The below "psuedo-Fortran" illustrates how
this might easily be achieved with a CO_SCAN intrinsic

    ! determine amount of data each image will process
    my_start = inp_data_size
    ! shift my_start to correct position
    call co_postfix_scan(my_start, add, 0)
    ! read data from file
    do i = 1, inp_data_size
      read(inp_file, rec=i+my_start) inp_data(i)
    end do
    ! process the data
    my_start = out_data_size
    ! shift my_start to correct position
    call co_postfix_scan(my_start, add, 0)
    do i = 1, out_data_size
      write(out_file, rec=i+my_start) out_data(i)
    end do

It is conceivable that a programmer could write such a procedure
themselves with the existing features of the language, but writing
it efficiently is a hard problem (TM), and the best algorithm to use
may be platform dependent. It would instead be better if compiler
developers could provide the best implementations for their supported
systems.

A reference implementation of a scalar integer co_postfix_scan might
look like the following.

    module co_scan_m
        implicit none
        private
        public :: co_prefix_scan, co_postfix_scan
    contains
        subroutine co_prefix_scan(a, operation)
            integer, intent(inout) :: a
            interface
                pure function operation(lhs, rhs) result(r)
                    integer, intent(in) :: lhs, rhs
                    integer :: r
                end function
            end interface

            integer, allocatable :: tmp[:]
            integer :: partial, mask, dst

            if (num_images() == 1) return
            allocate(tmp[*])
            partial = a
            mask = 1
            do while (mask < num_images())
                dst = ieor(this_image()-1, mask) + 1
                if (dst <= num_images()) then
                    tmp[dst] = partial
                    sync images (dst)
                    if (this_image() > dst) then
                        partial = operation(tmp, partial)
                        a = operation(tmp, a)
                    else
                        partial = operation(partial, tmp)
                    end if
                end if
                mask = mask * 2
                sync all
            end do
        end subroutine

        subroutine co_postfix_scan(a, operation, initial)
            integer, intent(inout) :: a
            interface
                pure function operation(lhs, rhs) result(r)
                    integer, intent(in) :: lhs, rhs
                    integer :: r
                end function
            end interface
            integer, intent(in) :: initial

            integer, allocatable :: tmp[:]

            if (num_images() == 1) then
                a = initial
                return
            end if

            allocate(tmp[*])
            if (this_image() < num_images()) tmp[this_image()+1] = a
            if (this_image() == 1) tmp = initial
            sync all
            a = tmp
            call co_prefix_scan(a, operation)
        end subroutine
    end module

    program main
        use co_scan_m, only: co_prefix_scan, co_postfix_scan
        integer :: a
        a = this_image()
        call co_prefix_scan(a, add)
        call print_in_image_order(a)
        a = this_image()
        call co_postfix_scan(a, add, 0)
        call print_in_image_order(a)
    contains
        pure function add(lhs, rhs) result(r)
            integer, intent(in) :: lhs, rhs
            integer :: r
            r = lhs + rhs
        end function
        subroutine print_in_image_order(x)
            integer, intent(in) :: x
            integer :: i

            do i = 1, num_images()
                if (this_image() == i) then
                    write(*,*) "image[",i,"]",x
                end if
                sync all
            end do
        end subroutine
    end program

However, compared to the implementations present in e.g. MPICH
https://github.com/pmodels/mpich/blob/main/src/mpi/coll/iscan/*.c
this implementation is not SMP aware and is not generic with respect to
type, i.e. only supports integer. Generics may only partially mitigate
that limitation, and templates will likely not eliminate it either.

4. Conclusion
=============

As demonstrated by the above examples, scan and collective scan
operations are commonly used in high performance computing
applications and can be difficult to implement efficiently. As such
it is highly encouraged that they be added as intrinsic procedures
in the Fortran language.