To: J3 J3/25-144 From: Brandon Cook & Dan Bonachea Subject: Requirements for US20 collective subroutines for prefix operations Date: 2025-June-11 #Reference: J3/24-157, J3/23-235r2, J3/23-113, WG5/N-2239 1. Introduction =============== Scan, or prefix reduction, operations are fundamental building blocks in parallel algorithms and data manipulation tasks. The original `SCAN` and `CO_SCAN` proposal received "mixed support" , and prefix reduction operations were subsequently "conditionally accepted" as a prospective work item for F202Y at the 2024 WG5 meeting, pending further discussion and refinement. They were again discussed at the February 2025 WG5 meeting, and subsequently promoted to "accepted" work item US20 via WG5 letter ballot in May 2025 (WG5/N-2239). The result of the WG5 vote was 20 yes 5 no and 1 undecided with several informative comments. This paper focuses exclusively on requirements for the collective subroutine variant refining previous concepts based on community and WG5 feedback. Our aim is to allow consideration independent of the closely related but distinct local intrinsics. We briefly revisit use cases as we are now addressing the two variants separately. 2. Nomenclature ================ The term "SCAN" has been identified as problematic due to potential conflicts and historical connotations. In this document we adopt the terminology of "prefix reduction" to reference the operation. 3. Motivation and Use Cases ============================ Collective prefix reduction operations are fundamental building blocks for many parallel algorithms. 3.1 Parallel I/O use case ------------------------- While numerous use cases exist, one particularly compelling motivation is the decomposition of a domain for parallel file I/O. Consider an application where multiple images must write distinct blocks of data to a large, shared file. Before writing, each image calculates its own contribution to the total output. The size of each contribution may vary from image to image. An example is applying a filter to the result of a simulation. Then each image must determine where to start writing in the file. Communication is now necessary as this information is fundamentally distributed. The following code fragment illustrates this pattern: ! number of results may vary by image output(:) = compute_variable_results() n = size(output) ! communication is required to determine offset offset = n call co_sum_prefix_exclusive(offset) ! each image writes into the correct position write(out_file, rec=offset) output Without an intrinsic, programmers must implement this collective communication themselves. A reference implementation of a collective prefix reduction is non-trivial, and writing an *efficient* and *portable* version is a hard problem. The best algorithm is often platform-dependent, and user-level implementations are unlikely to match the performance of a vendor-provided intrinsic that is tuned for specific network hardware and system architecture. 3.2 Motivation -------------- * Intrinsic vs. Library: A WG5 comment suggested that prefix reduction operations might be better supported by a standard library. While this may be true for local reductions, for *collectives*, an intrinsic is far superior. It allows compiler and runtime vendors to provide highly optimized implementations that integrate directly with the inter-image communication software and underlying HPC interconnect hardware. This level of optimization is not achievable in a portable user-level library. * Generality vs. Performance (Hardware Offload): A key piece of feedback noted that a general collective with a user-provided procedure can inhibit performance optimizations, as the operation cannot be offloaded to smart network hardware. To address this concern, this paper proposes a hybrid approach, analogous to the relationship between `CO_SUM` and `CO_REDUCE`. 1. A general `CO_REDUCE_PREFIX` for maximum flexibility, analogous to `CO_REDUCE`. 2. A specific `CO_SUM_PREFIX` for the most common use case, analogous to `CO_SUM`. This version can be readily optimized by vendors. 4. Discussion of Requirements ==================================== 4.1 Generality vs. Performance ------------------------- A general collective with a user-provided procedure provides maximum flexibility but can inhibit performance optimizations, for example the operation cannot be offloaded to smart network hardware. * Option A) CO_REDUCE_PREFIX only. Minimal API. * Option B) CO_REDUCE_PREFIX, CO_SUM_PREFIX only. The rest of the paper follows this option. * Option C) CO_REDUCE_PREFIX, CO_SUM_PREFIX, CO_MIN_PREFIX, CO_MAX_PREFIX. Maximize symmetry with existing collective subroutines. straw poll: A straw poll: B straw poll: C 4.1 Separate inclusive and exclusive subroutines ------------------------------------------------- Inclusive vs. Exclusive Terminology: Following widespread practice (e.g., MPI), this proposal adopts the "inclusive" and "exclusive" terminology for the two common variations of prefix reduction. In short, an inclusive prefix reduction includes the value of input element A_i when computing corresponding output element R_i, whereas the exclusive variant does not. To enhance clarity and safety, it proposes separate subroutines for each variant, rather than using an `EXCLUSIVE` logical argument. This allows the `IDENTITY` argument to be made mandatory for exclusive prefix reductions, for which it is a fundamental part of the definition. Straw poll: 4.2 Image ordering ------------------ All the intrinsics proposed in this paper are collective subroutines, and thus subject to all of the common requirements specified in Fortran section 16.6. So for example, they must be invoked collectively by the same statement on all active images in the current team, with arguments that meet specified constraints for corresponding references. Mathematically, a prefix reduction operation accepts an ordered list of input values and computes an ordered list of output result values. We propose collective prefix reductions where both these input and output lists are ordered according to the image indexes in the active team. Specifically, for an inclusive prefix reduction, the result R_i provided to image i is computed from the inputs provided by images (1:i). For an exclusive prefix reduction, the result R_i provided to image i is computed from the inputs provided by images (1:i-1). 5. Collective CO_SUM_PREFIX subroutines ======================================== Prefix reduction with + across images. CO_SUM_PREFIX_INCLUSIVE(A [, STAT, ERRMSG]) CO_SUM_PREFIX_EXCLUSIVE(A [, STAT, ERRMSG]) 5.1 Semantic Requirements ------------------------- A has any numeric type. A must have the same shape, type, and type parameter values in corresponding references. A in INTENT(INOUT). Each element of the computed value assigned into A is equal to a processor-dependent approximation to the inclusive/exclusive (respectively) prefix sum of corresponding elements of A provided in corresponding references. By definition + is assumed to be (approximately) associative and commutative. The identity value for sum is implicitly zero. The computed value is assigned to A if no error condition occurs. Otherwise, A becomes undefined as in CO_SUM. Unlike CO_SUM, RESULT_IMAGE is not necessary. 6. Collective CO_REDUCE_PREFIX subroutines =========================================== Generalized prefix reduction across images. CO_REDUCE_PREFIX_INCLUSIVE(A, OPERATION [, STAT, ERRMSG]) CO_REDUCE_PREFIX_EXCLUSIVE(A, OPERATION, IDENTITY [, STAT, ERRMSG]) 6.1 Semantic Requirements ------------------------- These requirements follow a straightforward extension of CO_REDUCE: A may not be polymorphic or have an ultimate component that is allocatable or a pointer. A must have the same shape, type, and type parameter values in corresponding references. Each element of the computed value assigned into A is equal to a processor-dependent approximation to the inclusive/exclusive (respectively) prefix reduction of corresponding elements of A provided in corresponding references. OPERATION should be a pure function. OPERATION should accept exactly two arguments; the result and each argument must be a scalar, nonallocatable, noncoarray, nonpointer, nonpolymorphic, nonoptional data object with the same declared type and type parameters as the input ARRAY. OPERATION is assumed to implement an (approximately) mathematically associative operation. OPERATION must be the same function on all images in corresponding references. IDENTITY must be conformable with A. IDENTITY must have the same value in corresponding references. 6.2 Commutativity ----------------- CO_REDUCE assumes that OPERATION is commutative. A new optional logical parameter could be introduced, enabling the caller to compute a prefix reduction using a non-commutative operation. Examples of common operations that are associative but not commutative include string concatenation and matrix multiplication. straw poll: