To: J3                                                     J3/18-110r1
From: Tom Clune
Subject: Use case for containers
Date: 2018-February-08


The contents of this paper are in paper J3/##-###.xxx


Introduction:
-------------

1. As scientific models become more complex, an increasing fraction of
   the lines of code are infrastructure as-opposed to direct
   numerical calculation.  Examples of infrastructure include software
   layers that couple independently developed subsystems, frameworks
   for managing distributed parallelism, advanced I/O (e.g.,
   checkpoint/restart via NetCDF), etc.  Generally these
   infrastructure layers evolve as an attempt to avoid code
   duplication as multiple parts of the system require similar
   functionality.

2. Software "containers" are abstractions that enable aggregating
   groups of related entities for convenient and efficient access.
   Many categories of software containers have been defined, with each
   category generally tailored to a common design issue.  By providing a
   "standard" and reliable means to perform commonly occurring
   operations, containers can greatly simplify design and
   implementation of complex algorithms.

3. Fortran provides just one category of software container - Array.
   Array containers are designed to optimize random access to a
   _fixed_ collection of elements, and generally requires all
   contained elements to be of the same dynamic type.  Fortran
   provides excellent mechanism for declaring and constructing arrays
   of any type as well as for accessing, storing, and modifying array
   members (elements) with succinct, tailored syntax: tuples of
   indices within parens:

   a(i,j) = x
   x = a(i,j)
   a(i,j) = a(i,j)**2

4. Other commonly used container categories include Set, Vector (or
   List), Map (or Associative Array, Dictionary), Stack, Queue, etc.
   Unlike Array containers, these others generally are more dynamic -
   growing as necessary when new elements are added to the container.
   Here I briefly describe some of the most commonly categories of
   containers.

   a. Vectors are somewhat similar to 1-D Arrays in that the provide
      efficient random accesss.  But whereas the size of an Array is
      fixed at the time of its construction, a Vector can grow as
      elements are added.  (For those unfamiliar with containers,
      there is no magic here.  Internally arrays are reallocated with
      copies but in a manner that adding elements is of O(1)
      complexity on average.)

   b. Set containers provide efficient (O(log n)) mechanisms for
      searching and inserting distinct elements that can be ordered
      according to some criterion.  Typical implementations use a
      balanced binary tree data structure to store the elements.

   c. Maps are similar to Sets, except that the interfaces emphasize
      that the contained elements are key-value pairs.  Typically the
      keys are either integers or strings, but can generally be any
      data type that can be ordered.  A simple example would be a Map
      that whose keys are the names of students in a classroom, and
      the values are their grades.  A more relevant example would be a
      Map that provides efficient acces to a sparse array where the
      keys are the indices for the non-zero array elemnts.


5. Iterators.  Iterators provide a simple/starndard mechanism to
   efficiently loop through all of the elements in a container.  For
   containers such as Vector and Array, these are relatively simple,
   but for containers such as Set and Map that are implemented with
   binary trees, the iterator abstraction is extremely beneficial to
   the user of the container.  Iterators simultaneously hide
   complexity and encourage safe coding styles.  In pseudo-code iterator
   usage is typically something like:

   < declare container > C
   < declare container iterator > I
   < declare element > e

   I = begin(C)
   loop while I is not end(C)
       e = get(I)
       < do something with e >
       next(I)
   end loop


Use cases:
----------

The following are use cases for containers that I have worked on just
within the last year.

1. Atmospheric tracers metadata in the NASA GISS climate model.  Each
   species (CO2, CH4, NO3, ...) in the atmosphere is associated with a
   variety of metadata.  These include the molar mass, the radioactive
   decay rate, diffusivity, a category label (dust, aerosol, etc.),
   and so on.  There are O(30) such fields in the model currently.
   Most are required to have a value for all tracers, but some fields
   are only added for tracers where they are relevant.  The types of
   these metadata are LOGICAL, INTEGER, REAL, and all but one are
   scalars.  The natural representation of this is a Map container
   where the keys are the property name and the values are the various
   items of metadata.

2. Collection of tracers in NASA GISS climate model.  Different
   configurations of the model utilize different subsets of the
   available tracers in the source code.  The number ranges from 0 to
   over 100 tracers.  Each tracer has a natural name, usually the
   chemical formula, but sometimes regular names like 'terpene' or
   'dust'.  A natural mechanism to manage data associated with each
   tracer is again a Map container where the keys are the tracer names
   and the values are a derived type containing all tracer data.  For
   historical/conservative reasons only the metadata dictionaries are
   actually managed this way.  Other data are in dynamically allocated
   Arrays (Array containers) whose size can be computed after all of
   the metadata has been processed.

3. Regridding registry in NASA GEOS5 model.  This model must represent
   data on varying grids that discretize the Earth's atmosphere.  To
   transfer data between grids, largish interpolation tables are
   generated and stored.  Because of the expense of computing them,
   they are cached at the time they are first computed.  The mechanism
   used to manage this is again a Map where the keys are a pair of
   specifiers associated with the two grids that determine the
   regridding.


4. A new I/O client-servere layer in NASA GEOS5 model.

   i) Each server process manages I/O requests from a varying
      number of application processes.  This is managed as a Vector of
      clients that grows as needed.

   ii) To encode/decode messages, a Map container with integer keys is
       used to store message prototypes.  This allows each subclass of
       message to decode itself and while the top-level processing
       only needs to handle the integer label.

   iii) The server processes accumulate data requests from the client
        in a Map where they key is a message ID and the details are in
        a data type.

   iv) NetCDF metadata itself uses multiple containers.  NetCDF
       variables are represented as a Map with the variable names as
       keys and values are Variable derived type.  The Variable
       derived type itself has a Map where the keys are Attribute
       names and the values are Attribute values.  The Variable
       derived type also has a Vector of "dims" that relate the
       variable dimensions to a global list for the file.

5. Next generation pFUnit unit-testing framework for Fortran.  The
   framework maintains a global list of "exceptions" that are thrown
   by user tests.  This is naturally represented as a vector of
   entities of derived type.

6. New configurable logging utiltity (pFlogger).  This framework
   manages collections of abstract data types: Handlers (abstractions
   of Files), Loggers, Filters, and Formatters.  To enable runtime
   configurability each such entity is associated with a name.  Map's
   are then used to manage the associations.  E.g., each Logger can
   have multiple Handlers.  Loggers and Handlers can both have
   multiple Filters.  These Maps all all polymorphic.  The package
   also uses a number of containers (Vector and Map) involving
   intrinsic types.


Containers and Fortran:
------------------------

Infrastructure layers for complex Fortran applications often
implicitly/unknowingly implement crude versions of containers like
vectors and associative arrays.  The Vector pattern arises naturally
when the total number of instances of some collection cannot be
determined via a simple formula or when elements are added
sporadically through different drivers.  Intead the size is
"discovered" through some iterative process.  A common implementation
in the simplest case is a two-pass algorithm.  The passes are nearly
identical except that the first pass just accumulates the number of
elements.  An allocation is then made and the 2nd pass fills in the
allocated structure.

The need for Map containers arises when we need to find entities
through som registry.  E.g., if an ocean component needs the surface
winds from an atmospheric component, an abstract coupler enables this
by allowing both components to use the names 'U-wind' and 'V-wind' to
retrieve the associated arrays.  This avoids hardcoding indices, which
is important when considering independently developed components.

Typical Fortran implementations of these patterns are ad-hoc.  Even
within the same application two different layers may implement the
same pattern in rather different ways.  The layers also often contain
latent bugs that are not exercised in the current component, but
prevent their adoption in other components.


Obstacles to clean/robust implementation of containers in Fortran:
------------------------------------------------------------------

One large obstacle to development and use of containers in Fortran is
the lack of support for generic coding.  Developers must either
duplicate the logic of a container for each potential type of element,
or must resort to non-Fortran means (preprocessors) to have a single
implementation that is applicable to all cases.  E.g., one may wish to
have Vectors of integers, reals, logicals, and/or various derived
types, leading to many almost identical implementations and all the
associated dangers of long-term maintenance.  With Maps, the problem
just grows combinatorially as one considers variant types of keys.

Another obstacle, albeit less severe, is the inability to overload
suitable operators for container accessors.  With Fortran arrays,
parens can be used to access individual elements on both the left and
right hand sides of assignment:

      x = a(i,j,k)
      b(i,j,k) = y


Ideally, Vector and Map containers would have similar mechanism for
accessing and modifying elements.  Note that containers generally
return pointers to contained elements rather than copies of those
elements.


State of the art?
-----------------

With aggressive and complicated use of CPP/FPP preprocessing it is
possible to write robust generic Vector and Map containers in Fortran.
I and my colleague Doron Feldman have produced such a package which
was recently released as open source: gFTL.  Another similar package,
FTL, has also been independently released.  (They beat us to the
cooler name.)  These implementations are a nice step forward, and have
been of enormous benefit in the development of new powerful software
layers.  However, these containers are still not as simple to use as
one could hope for number of reasons:

1. A separate Fortran class must be constructed for each contained
   type.  In languages with stronger support for containers, this step
   is performed by the compiler.

2. Even if two separate projects use the same container package (e.g.,
   gFTL), they cannot assume that the other is providing a common
   container and must redeclare for themselves.  One easily ends up
   with many all-but-identical modules for say IntegerVector.

3. Iterators are handled via Fortrans DO WHILE construct.  Users then
   put a call to iter%next() at the end of the loop.  This works well
   in simple cases, but is dangerous in the presence of CYCLE because
   the user must remember to invoke iter%next() there as well.  Other
   languages that have an "increment" component to their loop
   constructs (e.g., C, C++) can handle iterators more robustly by
   putting the iterator increment into the header of the loop.  This
   concern is probably more general than just iterators and should
   perhaps be a paper of its own.

4. As mentioned above, the supported accessors are clunky.  The
   inability to use syntax analogous to that of the tuples of indices
   for arrays is unfortunate.  It is a minor annoyance for references
   on the RHS of an asssignment, but much more so for references that
   would otherwise be on the LHS.  E.g., if we hava a containe for a
   sparse array and want to modify an element:

      CALL sparse%set([i,j], x)  ! supported but inelegant

   compared to

      sparse(i,j) = x  ! not implementable in Fortran 2018

5. Containers of pointers are problematic.  Fortran lacks a robusts
   mechanism to order pointers via some COMPARE method.  This prevents
   any standard-conforming implementation of a Set of pointers or a
   Map that has keys that are pointers.  Note that usual trick wrapping the
   pointers in a derived type does not help with this issue.


This is a multi-faceted use case and there are a number of fronts on
which the language could be improved to support containers.  Any of
the following would be of some value even without the others:

1) Generic programming to facilitate user-implementation of
   containers that could be used with arbitrary types.  (Ideally,
   community supported packages would then emerge ala C++ STL).

2) Widen the class of overloaded "punctuation" to allow some form of
   paren/bracket notation for specifying container elements on both
   LHS and RGS of assignments.

3) Make specific types containers first class language entities, ala
   existing arrays.

4) Provide an intrinsic COMPARE that arbitrarily (but consistently)
   orders two pointers with the same declared type. Here there are a
   variety of Map and Vector containers that arise.  In many cases,
   the entities in the container are polymorphic.  I.e. the container
   allows entities that are any type that extends some base type.
   (And in at least one case the contaire entities are of unlimited
   polymorphic type.)  Here are some specific containers in the
   package that are easier to explain: