To: J3 J3/18-110r1 From: Tom Clune Subject: Use case for containers Date: 2018-February-08 The contents of this paper are in paper J3/##-###.xxx Introduction: ------------- 1. As scientific models become more complex, an increasing fraction of the lines of code are infrastructure as-opposed to direct numerical calculation. Examples of infrastructure include software layers that couple independently developed subsystems, frameworks for managing distributed parallelism, advanced I/O (e.g., checkpoint/restart via NetCDF), etc. Generally these infrastructure layers evolve as an attempt to avoid code duplication as multiple parts of the system require similar functionality. 2. Software "containers" are abstractions that enable aggregating groups of related entities for convenient and efficient access. Many categories of software containers have been defined, with each category generally tailored to a common design issue. By providing a "standard" and reliable means to perform commonly occurring operations, containers can greatly simplify design and implementation of complex algorithms. 3. Fortran provides just one category of software container - Array. Array containers are designed to optimize random access to a _fixed_ collection of elements, and generally requires all contained elements to be of the same dynamic type. Fortran provides excellent mechanism for declaring and constructing arrays of any type as well as for accessing, storing, and modifying array members (elements) with succinct, tailored syntax: tuples of indices within parens: a(i,j) = x x = a(i,j) a(i,j) = a(i,j)**2 4. Other commonly used container categories include Set, Vector (or List), Map (or Associative Array, Dictionary), Stack, Queue, etc. Unlike Array containers, these others generally are more dynamic - growing as necessary when new elements are added to the container. Here I briefly describe some of the most commonly categories of containers. a. Vectors are somewhat similar to 1-D Arrays in that the provide efficient random accesss. But whereas the size of an Array is fixed at the time of its construction, a Vector can grow as elements are added. (For those unfamiliar with containers, there is no magic here. Internally arrays are reallocated with copies but in a manner that adding elements is of O(1) complexity on average.) b. Set containers provide efficient (O(log n)) mechanisms for searching and inserting distinct elements that can be ordered according to some criterion. Typical implementations use a balanced binary tree data structure to store the elements. c. Maps are similar to Sets, except that the interfaces emphasize that the contained elements are key-value pairs. Typically the keys are either integers or strings, but can generally be any data type that can be ordered. A simple example would be a Map that whose keys are the names of students in a classroom, and the values are their grades. A more relevant example would be a Map that provides efficient acces to a sparse array where the keys are the indices for the non-zero array elemnts. 5. Iterators. Iterators provide a simple/starndard mechanism to efficiently loop through all of the elements in a container. For containers such as Vector and Array, these are relatively simple, but for containers such as Set and Map that are implemented with binary trees, the iterator abstraction is extremely beneficial to the user of the container. Iterators simultaneously hide complexity and encourage safe coding styles. In pseudo-code iterator usage is typically something like: < declare container > C < declare container iterator > I < declare element > e I = begin(C) loop while I is not end(C) e = get(I) < do something with e > next(I) end loop Use cases: ---------- The following are use cases for containers that I have worked on just within the last year. 1. Atmospheric tracers metadata in the NASA GISS climate model. Each species (CO2, CH4, NO3, ...) in the atmosphere is associated with a variety of metadata. These include the molar mass, the radioactive decay rate, diffusivity, a category label (dust, aerosol, etc.), and so on. There are O(30) such fields in the model currently. Most are required to have a value for all tracers, but some fields are only added for tracers where they are relevant. The types of these metadata are LOGICAL, INTEGER, REAL, and all but one are scalars. The natural representation of this is a Map container where the keys are the property name and the values are the various items of metadata. 2. Collection of tracers in NASA GISS climate model. Different configurations of the model utilize different subsets of the available tracers in the source code. The number ranges from 0 to over 100 tracers. Each tracer has a natural name, usually the chemical formula, but sometimes regular names like 'terpene' or 'dust'. A natural mechanism to manage data associated with each tracer is again a Map container where the keys are the tracer names and the values are a derived type containing all tracer data. For historical/conservative reasons only the metadata dictionaries are actually managed this way. Other data are in dynamically allocated Arrays (Array containers) whose size can be computed after all of the metadata has been processed. 3. Regridding registry in NASA GEOS5 model. This model must represent data on varying grids that discretize the Earth's atmosphere. To transfer data between grids, largish interpolation tables are generated and stored. Because of the expense of computing them, they are cached at the time they are first computed. The mechanism used to manage this is again a Map where the keys are a pair of specifiers associated with the two grids that determine the regridding. 4. A new I/O client-servere layer in NASA GEOS5 model. i) Each server process manages I/O requests from a varying number of application processes. This is managed as a Vector of clients that grows as needed. ii) To encode/decode messages, a Map container with integer keys is used to store message prototypes. This allows each subclass of message to decode itself and while the top-level processing only needs to handle the integer label. iii) The server processes accumulate data requests from the client in a Map where they key is a message ID and the details are in a data type. iv) NetCDF metadata itself uses multiple containers. NetCDF variables are represented as a Map with the variable names as keys and values are Variable derived type. The Variable derived type itself has a Map where the keys are Attribute names and the values are Attribute values. The Variable derived type also has a Vector of "dims" that relate the variable dimensions to a global list for the file. 5. Next generation pFUnit unit-testing framework for Fortran. The framework maintains a global list of "exceptions" that are thrown by user tests. This is naturally represented as a vector of entities of derived type. 6. New configurable logging utiltity (pFlogger). This framework manages collections of abstract data types: Handlers (abstractions of Files), Loggers, Filters, and Formatters. To enable runtime configurability each such entity is associated with a name. Map's are then used to manage the associations. E.g., each Logger can have multiple Handlers. Loggers and Handlers can both have multiple Filters. These Maps all all polymorphic. The package also uses a number of containers (Vector and Map) involving intrinsic types. Containers and Fortran: ------------------------ Infrastructure layers for complex Fortran applications often implicitly/unknowingly implement crude versions of containers like vectors and associative arrays. The Vector pattern arises naturally when the total number of instances of some collection cannot be determined via a simple formula or when elements are added sporadically through different drivers. Intead the size is "discovered" through some iterative process. A common implementation in the simplest case is a two-pass algorithm. The passes are nearly identical except that the first pass just accumulates the number of elements. An allocation is then made and the 2nd pass fills in the allocated structure. The need for Map containers arises when we need to find entities through som registry. E.g., if an ocean component needs the surface winds from an atmospheric component, an abstract coupler enables this by allowing both components to use the names 'U-wind' and 'V-wind' to retrieve the associated arrays. This avoids hardcoding indices, which is important when considering independently developed components. Typical Fortran implementations of these patterns are ad-hoc. Even within the same application two different layers may implement the same pattern in rather different ways. The layers also often contain latent bugs that are not exercised in the current component, but prevent their adoption in other components. Obstacles to clean/robust implementation of containers in Fortran: ------------------------------------------------------------------ One large obstacle to development and use of containers in Fortran is the lack of support for generic coding. Developers must either duplicate the logic of a container for each potential type of element, or must resort to non-Fortran means (preprocessors) to have a single implementation that is applicable to all cases. E.g., one may wish to have Vectors of integers, reals, logicals, and/or various derived types, leading to many almost identical implementations and all the associated dangers of long-term maintenance. With Maps, the problem just grows combinatorially as one considers variant types of keys. Another obstacle, albeit less severe, is the inability to overload suitable operators for container accessors. With Fortran arrays, parens can be used to access individual elements on both the left and right hand sides of assignment: x = a(i,j,k) b(i,j,k) = y Ideally, Vector and Map containers would have similar mechanism for accessing and modifying elements. Note that containers generally return pointers to contained elements rather than copies of those elements. State of the art? ----------------- With aggressive and complicated use of CPP/FPP preprocessing it is possible to write robust generic Vector and Map containers in Fortran. I and my colleague Doron Feldman have produced such a package which was recently released as open source: gFTL. Another similar package, FTL, has also been independently released. (They beat us to the cooler name.) These implementations are a nice step forward, and have been of enormous benefit in the development of new powerful software layers. However, these containers are still not as simple to use as one could hope for number of reasons: 1. A separate Fortran class must be constructed for each contained type. In languages with stronger support for containers, this step is performed by the compiler. 2. Even if two separate projects use the same container package (e.g., gFTL), they cannot assume that the other is providing a common container and must redeclare for themselves. One easily ends up with many all-but-identical modules for say IntegerVector. 3. Iterators are handled via Fortrans DO WHILE construct. Users then put a call to iter%next() at the end of the loop. This works well in simple cases, but is dangerous in the presence of CYCLE because the user must remember to invoke iter%next() there as well. Other languages that have an "increment" component to their loop constructs (e.g., C, C++) can handle iterators more robustly by putting the iterator increment into the header of the loop. This concern is probably more general than just iterators and should perhaps be a paper of its own. 4. As mentioned above, the supported accessors are clunky. The inability to use syntax analogous to that of the tuples of indices for arrays is unfortunate. It is a minor annoyance for references on the RHS of an asssignment, but much more so for references that would otherwise be on the LHS. E.g., if we hava a containe for a sparse array and want to modify an element: CALL sparse%set([i,j], x) ! supported but inelegant compared to sparse(i,j) = x ! not implementable in Fortran 2018 5. Containers of pointers are problematic. Fortran lacks a robusts mechanism to order pointers via some COMPARE method. This prevents any standard-conforming implementation of a Set of pointers or a Map that has keys that are pointers. Note that usual trick wrapping the pointers in a derived type does not help with this issue. This is a multi-faceted use case and there are a number of fronts on which the language could be improved to support containers. Any of the following would be of some value even without the others: 1) Generic programming to facilitate user-implementation of containers that could be used with arbitrary types. (Ideally, community supported packages would then emerge ala C++ STL). 2) Widen the class of overloaded "punctuation" to allow some form of paren/bracket notation for specifying container elements on both LHS and RGS of assignments. 3) Make specific types containers first class language entities, ala existing arrays. 4) Provide an intrinsic COMPARE that arbitrarily (but consistently) orders two pointers with the same declared type. Here there are a variety of Map and Vector containers that arise. In many cases, the entities in the container are polymorphic. I.e. the container allows entities that are any type that extends some base type. (And in at least one case the contaire entities are of unlimited polymorphic type.) Here are some specific containers in the package that are easier to explain: