To: J3                                                     J3/19-193
From: Tom Clune and Hamid Oloso
Subject: Performance Portability - use case for generics
Date: 2019-August-02

#Reference:

1. Introduction

An unfortunate reality of extant high-performance computing platforms
is that for a given computational kernel the optimal loop ordering
_and_ data layout may vary from architecture to architecture.  This is
particularly accute for GPU's vs CPU's where the optimal data layout
for one may result in a 10x performance hit for the other.  (The exact
driver for this is not crucial to the use case beyond the mere fact of
the observed performance differences.)  Perhaps in theory, an
optimizing compiler could eliminate this issue, but in practice this
is a real problem facing HPC applications today.

E.g., from Che et al:
  "On the other hand, CPUs and GPUs may prefer different data layouts
   for certain applications. This is due to the fact that better cache
   locality is needed for contiguous memory accesses issued by
   individual CPU threads, while an efficient GPU memory transaction
   is desirable to feed data to multiple simultaneous SIMD threads."

(http://www.cs.virginia.edu/~skadron/Papers/sc11_dymaxion_dist.pdf)

For an application that is trying to optimize performance for multiple
architectures there are currently few good options.  The most
straightforward is to have 2 distinct implementations of each kernel
where the data ordering of the arrays and the loop ordering differ.
For smaller applications this may be possible, but for large evolving
applications this approach is fundamentally impractical.

A growing number of organizations are now choosing to address this
concern through higher-level domain specific languages (DSL's) in
which algorithms are expressed in a more architecture-neutral manner
but are then effectively translated to different implementations on
different machines.  Fortran may intend to achieve such performance
portability directly, but these other approaches are successful now.

Two such DSL layers that are currently being adopted within the
weather and climate communites are kokkos from DOE, and GridTools from
MeteoSwiss and ECMWF.  Both of these layers rely heavily on C++
templating to achieve the necessary flexibility without sacrificing
performance.

Unfortunately, adopting these layers requires that the performance
critical sections of an application _and_ the associated data
structures must be converted to C++.  Such a (hopefully/ideally)
one-time conversion is considered to be far more acceptable than
maintainance of multiple versions within Fortran.  MeteoSwiss has
essentially rewritten their entire weather model, and rerites of other
major climate and weather models are well under way.  At NASA, we are
evaluating kokkos and GridTools and have planend to hire several
developers in the next few months to accelerate the conversion.


2. Quick sketch of kokkos

To better understand what might be involved in providing capabilities
analogous to kokkos and GridTools in a Fortran context it is useful to
understand the major elements of the existng C++ solutions.  While
they rely heavily on C++ templates, other language features also play
an important role in enabling user code to be expressed in a clean and
familiar manner.  Here we describe the major aspects of kokkos, as the
authors are more familiar with its details.

Kokkos references:
  https://github.com/kokkos/kokkos
  https://github.com/kokkos/kokkos-tutorials

2.1 Kokkos: user kernels

Kokkos users write each of their numerical kernels as a functor or a
lambda function that takes just one or more integer arguments that
indicate a specific granule of work.

2.2 Kokkos: Execution spaces

Execution spaces are an abstraction that controls which granules of
work will be exercised within a loop.  In the simplest cases,
execution spaces can be thought of as loop bounds, but in general they
can be set to chunk work granules and in multidimensional cases to
reverse the order of nesting.

2.3 Kokkos: Views

Views can be thought of as generalizations of arrays, and their use
strongly resembles array access: A(I,J) references an element.  The
declaration of a View includes an optional "layout" template parameter
that changes how elements are mapped to memory.  There are different
defaults on each architecture.  Through this mechanism, one can have
the same source code where A(I,J) is adjacent to A(I+1,J) on CPU's
while A(I,J) is adjacent to A(I,J+1) on GPU's.  This is a key aspect
that enables kernels to be written in a platform-agnostic manner.

2.4 Kokkos: Parallel loops

The canonical parallel mechanism in kokkos is the "parallel_for"
template.  The template takes an execution space and a user defined
kernel and implements a loop with appropriate ordering and any extra
compiler directives (OpenMP, CUDA, OpenACC, ...)  A crucial
performance issue here is that (a) user kernel's (functors and
lambdas) are inlined and (b) access to elements of views incurs no
overhead beyond that of native array accesses.  For (a) users
instrument their kernels with a kokkos inline macro, which apparently
works well enough in practice even though the language is not required
to inline.  For (b) templates are crucial to avoid the run-time
overhead that would otherwise be encountered in provide seemingly
indirect access to memory.

Kokkos has analogous templates for parallel reduction and parallel
scan operations.


3. Can we do something similar in Fortran?

Converting large chunks of working Fortran code to C++ is unpalatable
for a number of reasons, and it is highly desired to provide a
mechanism that can support realistic performance portability in a
Fortran-only context.

Some of the generic programming mechanisms that have been floated for
Fortran can support some of the necessary machinery for a kokkos-like
approach.  In particular, functors for kernels, execution spaces and
parallel loops seem relatively straigtforward with Intelligent Macros.
Unfortunately analogs for kokkos Views with the ability to efficiently
address data under multiple layouts are problematic for several
reasons.

Consider an simple kokkos kernel of the form:

    a(i,j) = b(i,j) + c(i,j)

where a, b, and c are Views, not arrays.  As in C++, Fortran arrays do
not directly support variant data layouts, so we must try to represent
the indirect addressing through some sort of user procedure.  Ideally
this would be some type-bound function, "ref" on variables of derived
type.  E.g.

   a%ref(i,j) = b%ref(i,j) + c%ref(i,j)

while klunky, the RHS is possibly tolerable in practice and would
support the necessary layout variations.  Depending on the details of
construction, ref(i,j) might access the (i,j) element or the (j,i)
element of some private data component within the object.  The LHS is
more problematic as Fortran does not permit such evaluations on the
LHS.  We are thus left with either a subroutine call:

   CALL a%set(i,j, b%ref(i,j) + c%ref(i,j))

which gets much less friendly with more realistic expressions.  Or one
introduces an intermediate variable with a custom type that uses
overloaded assignment to accomplish the intented assignment in two steps


   tmp = a%ref(i,j) ! holds a reference to the memory address
   tmp = b(i,j) + c(i,j) ! stores into the referenced memory address


A much better approach would be if there were some mechanism to allow
the evaluation of the indirect layout on the LHS.  And while less
critical, a cleaner mechanism for indirect addressing on the RHS would
also simplify conversion of legacy kernels as well as keeping
expressions looking like conventional Fortran inner-loop expressions.

The C++ syntax for these types of operations results in
easily-understood code that looks more familiar to Fortran programmers
than the ugly expressions above that would be required for Fortran.


Summary:

If Fortran wishes to remain a contender for high-performance computing
in the GPU era then it must address the ability to write performance
portable code.  Existing Fortran capabilities are provably inadequate
and not just by percentages.  The differences discussed here can often
make 3x-10x differences in actual performance on modern hardware.