To: J3 J3/18-110 From: Tom Clune Subject: Use case for containers Date: 2018-February-04 The contents of this paper are in paper J3/##-###.xxx Introduction: ------------- 1. As scientific models become more complex, an increasing fraction of the lines of code are infrastructure as-opposed to direct numerical calculation. Examples of infrastructure include software layers that couple independently developed subsystems, frameworks for managing distributed parallelism, advanced I/O (e.g., checkpoint/restart via NetCDF), etc. Generally these infrastructure layers evolve as an attempt to avoid code duplication as multiple parts of the system require similar functionality. 2. Software "containers" are abstractions that enable aggregating groups of related entities for convenient and efficient access. Many categories of software containers have been defined, with each category generally tailored to a common design issue. By providing a "standard" and reliable means to perform commonly occurring operations, containers can greatly simplify design and implementation of complex algorithms. 3. Fortran provides just one category of software container - Array. Array containers are designed to optimize random access to a _fixed_ collection of elements, and generally requires all contained elements to be of the same dynamic type. Fortran provides excellent mechanism for declaring and constructing arrays of any type as well as for accessing, storing, and modifying array members (elements) with succinct, tailored syntax: tuples of indices within parens: a(i,j) = x x = a(i,j) a(i,j) = a(i,j)**2 4. Other commonly used container categories include Set, Vector (or List), Map (or Associative Array, Dictionary), Stack, Queue, etc. Unlike Array containers, these others generally are more dynamic - growing as necessary when new elements are added to the container. Here I briefly describe some of the most commonly categories of containers. a. Vectors are somewhat similar to 1-D Arrays in that the provide efficient random accesss. But whereas the size of an Array is fixed at the time of its construction, a Vector can grow as elements are added. (For those unfamiliar with containers, there is no magic here. Internally arrays are reallocated with copies but in a manner that adding elements is of O(1) complexity on average.) b. Set containers provide efficient (O(log n)) mechanisms for searching and inserting distinct elements that can be ordered according to some criterion. Typical implementations use a balanced binary tree data structure to store the elements. c. Maps are similar to Sets, except that the interfaces emphasize that the contained elements are key-value pairs. Typically the keys are either integers or strings, but can generally be any data type that can be ordered. A simple example would be a Map that whose keys are the names of students in a classroom, and the values are their grades. A more relevant example would be a Map that provides efficient acces to a sparse array where the keys are the indices for the non-zero array elemnts. 5. Iterators. Iterators provide a simple/starndard mechanism to efficiently loop through all of the elements in a container. For containers such as Vector and Array, these are relatively simple, but for containers such as Set and Map that are implemented with binary trees, the iterator abstraction is extremely beneficial to the user of the container. Iterators simultaneously hide complexity and encourage safe coding styles. In pseudo-code iterator usage is typically something like: < declare container > C < declare container iterator > I < declare element > e I = begin(C) loop while I is not end(C) e = get(I) < do something with e > next(I) end loop Containers and Fortran: ------------------------ Infrastructure layers for complex Fortran applications often implicitly/unknowingly implement crude versions of containers like vectors and associative arrays. The Vector pattern arises naturally when the total number of instances of some collection cannot be determined via a simple formula or when elements are added sporadically through different drivers. Intead the size is "discovered" through some iterative process. A common implementation in the simplest case is a two-pass algorithm. The passes are nearly identical except that the first pass just accumulates the number of elements. An allocation is then made and the 2nd pass fills in the allocated structure. The need for Map containers arises when we need to find entities through som registry. E.g., if an ocean component needs the surface winds from an atmospheric component, an abstract coupler enables this by allowing both components to use the names 'U-wind' and 'V-wind' to retrieve the associated arrays. This avoids hardcoding indices, which is important when considering independently developed components. Typical Fortran implementations of these patterns are ad-hoc. Even within the same application two different layers may implement the same pattern in rather different ways. The layers also often contain latent bugs that are not exercised in the current component, but prevent their adoption in other components. Obstacles to clean/robust implementation of containers in Fortran: ------------------------------------------------------------------ One large obstacle to development and use of containers in Fortran is the lack of support for generic coding. Developers must either duplicate the logic of a container for each potential type of element, or must resort to non-Fortran means (preprocessors) to have a single implementation that is applicable to all cases. E.g., one may wish to have Vectors of integers, reals, logicals, and/or various derived types, leading to many almost identical implementations and all the associated dangers of long-term maintenance. With Maps, the problem just grows combinatorially as one considers variant types of keys. Another obstacle, albeit less severe, is the inability to overload suitable operators for container accessors. With Fortran arrays, parens can be used to access individual elements on both the left and right hand sides of assignment: x = a(i,j,k) b(i,j,k) = y Ideally, Vector and Map containers would have similar mechanism for accessing and modifying elements. Note that containers generally return pointers to contained elements rather than copies of those elements. State of the art? ----------------- With aggressive and complicated use of CPP/FPP preprocessing it is possible to write robust generic Vector and Map containers in Fortran. I and my colleague Doron Feldman have produced such a package which was recently released as open source: gFTL. Another similar package, FTL, has also been independently released. (They beat us to the cooler name.) These implementations are a nice step forward, and have been of enormous benefit in the development of new powerful software layers. However, these containers are still not as simple to use as one could hope for number of reasons: 1. A separate Fortran class must be constructed for each contained type. In languages with stronger support for containers, this step is performed by the compiler. 2. Even if two separate projects use the same container package (e.g., gFTL), they cannot assume that the other is providing a common container and must redeclare for themselves. One easily ends up with many all-but-identical modules for say IntegerVector. 3. Iterators are handled via Fortrans DO WHILE construct. Users then put a call to iter%next() at the end of the loop. This works well in simple cases, but is dangerous in the presence of CYCLE because the user must remember to invoke iter%next() there as well. Other languages that have an "increment" component to their loop constructs (e.g., C, C++) can handle iterators more robustly by putting the iterator increment into the header of the loop. This concern is probably more general than just iterators and should perhaps be a paper of its own. 4. As mentioned above, the supported accessors are clunky. The inability to use syntax analogous to that of the tuples of indices for arrays is unfortunate. It is a minor annoyance for references on the RHS of an asssignment, but much more so for references that would otherwise be on the LHS. E.g., if we hava a containe for a sparse array and want to modify an element: CALL sparse%set([i,j], x) ! supported but inelegant compared to sparse(i,j) = x ! not implementable in Fortran 2018 5. Containers of pointers are problematic. Fortran lacks a robusts mechanism to order pointers via some COMPARE method. This prevents any standard-conforming implementation of a Set of pointers or a Map that has keys that are pointers. Note that usual trick wrapping the pointers in a derived type does not help with this issue. What could be done in the language? ------------------------------------ This is a multi-faceted use case and there are a number of fronts on which the language could be improved to support containers. Any of the following would be of some value even without the others: 1) Generic programming to facilitate user-implementation of containers that could be used with arbitrary types. (Ideally, community supported packages would then emerge ala C++ STL). 2) Widen the class of overloaded "punctuation" to allow some form of paren/bracket notation for specifying container elements on both LHS and RGS of assignments. 3) Make specific types containers first class language entities, ala existing arrays. 4) Provide an intrinsic COMPARE that arbitrarily (but consistently) orders two pointers with the same declared type.