J3/97-244 To: J3 From: William Clodius Subject: Numeric Type Extensions and Fortran Date: October 14, 1997 I. Introduction Recent proposals to add interval arithmetic to Fortran have revealed some weaknesses. This paper does not attempt to address the specific merits of the interval arithmetic proposal, but instead addresses those weaknesses revealed by this proposal that can impact other potential extensions to the language, with an emphasis on its impact on numerical extensions. In particular, it tries to summarize different potential "fixes" for 1. Fortran's limited character set 2. The restrictions on the precedences of user defined operators 3. Restrictions on initialization expressions 4. The automatic conversion of literal values to intrinsic values 5. The inability to express certain properties of such extensions II. Fortran's limited character set In examining some of the following potential fixes I have been struck by how much awkwardness is introduced by the limited size of Fortran's current character set. Although in some ways this limitation has been a source of strength for the language, by ensuring high portability, there is genuine cause for concern at the costs of this restriction for a language intended for numerical programming. In particular it is difficult to extend Fortran in some aspects without also increasing the number of bracketing or operator constructs, but increasing the number of such constructs without also increasing the character set decreases program legibility by overloading the meaning of symbols, e.g., the many overloadings of the character '*'. I am aware of three potential reasons for restricting the character sets used in a language: first, the reduction of the number of characters needed to understand the language; second, the unavailability of the characters in the character sets commonly used on a given processor; and, third, the difficulty in accessing characters nominally available on a given processor through the keyboards provided with the processor. The first argument is largely spurious, as long as the characters are readily distinguished visually humans can deal with character sets several orders of magnitude larger than that used in Fortran. The second argument was valid through most of the 80's, but I believe all processors now support 8 or 16 bit character sets and provide means of mapping the character sets to representations used by the programmer, i.e., the character set used in an application need not be the same as the default character set of the system. The third argument undoubtably retains some validity. Although programmers can usually circumvent the keyboard problems by remapping the keys or using special non-national keyboards, there still remain times where such options are not available. There are several ways the standard can address this limitation: 1. Maintain the status quo. This will retain high portability, but will make it difficult to extend the language in some ways without simultaneously decreasing the legibility of the language. 2. Adopt a purely abstract syntax and leave the mapping of the syntax to a processor's character set up to the processor. This was used in Algol 60 and was a minor, but definite, hindrance to portability. This might, however, facilitate availability of the language on processors which by default do not use a Roman based alphabet. 3. Adopt a specific character set for the standard's representation, possibly requiring a processor to provide a converter to that character set. This is used by Java. 4. Increase the number of characters used by the standard without committing to a specific character set, possibly requiring a processor to provide a converter to a character set with those characters. Ideally this would involve a detailed study of all the widely used character sets in order to avoid gratuitous conflicts. In practice, language definitions typically use the intersection of ASCII with the most widely used variants of EBCDIC. This results in a character set that extends the current Fortran character set with the at-sign, @, backslash, \, circumflex, ^, sharp, #, vertical bar, |, tilde, ~, the opening and closing brackets, [ and ], and the opening and closing braces, { and }. There are a few characters that are slightly less widely available that might also be considered, i.e., the upside down question mark, ?, and the degree, x. 5. Provide multiple standard defined representations for the same syntactic entity, some representable (awkwardly) in the current set, others requiring additional characters. This might result in such mappings as '(/', '/)', '(*', '*)', '(<', '>)', '<=', '>=', '/=', etc., to '[', ']', '{', '}', left angle bracket, right angle bracket, 'x', 'x', 'x', etc. Note that such extensions need not be limited to widely used symbols, i.e., left and right angle brackets are not available on most standard eight bit character sets. This duplication complicates the parsing of the language, but ensures portability. However, care would need to be taken to avoid infelicities similar to C's triglyphs and diglyphs. 6. Keep most symbols well defined, but allow a few symbols to be processor defined similar to what is done with '$' in the current standard. I would like to note that my preference would be for a combination of 5 and 6:, i.e., extend the character set with a few more widely available characters, i.e, [, ], {, }, |, ~, \, ^, and #, but provide alternative mappings to less legible combinations of current Fortran characters. Possible symbol names and mappings both to widely, but not universally, available characters and to combinations of more widely available characters are: Symbol Nominal Mapped Name Character Combination(s) Left bracket [ (/ or (= Right bracket ] /) or =) Left brace { (* Right brace } *) Approximate ~ -= or %= or ?= At @ */ or /* or () or (.) or (/) Bar | +/ or // Circumflex ^ +* or ** or *% Degree x $/ or /$ or */ or /* Inverted Question ? ?? or -? or /? or ?= Sharp # %% Slash \ -/ or // Note in some cases I am suggesting that the nominal character could be used to substitute for existing character combinations. III. The restrictions on the precedences of user defined operators The current restrictions on the precedences of user defined operators is awkward for the design, implementation, and use of sophisticated types. Such types often benefit from more operators than are provided among Fortran's intrinsics, but the precedences of the user defined operators, particularly the binary operators, is awkward. Having the binary user defined operators have the lowest possible precedence was a mistake in terms of user convenience. Most additional operators are typically comparison operators which should be able to have the same precedence as the intrinsic comparison operators. The few non-comparison operators may yield results that are intended for use in comparisons and should be able to have precedences higher than that of the comparison operators. The alternatives as I view them are: 1. Maintain the status quo The current precedences are awkward, but their effect is more at the irritant than the major problem level. They require more parenthesization than the ideal, but that is their primary effect. 2. Change the default precedences for the user defined binary operators Because the current binary operators have the lowest possible precedences, I believe that extensive parenthesization is need for them to be used properly in all but the simplest expressions. I therefore suspect that that a change in their precedence level would not change the validity of current user code, including simple expressions (with the possible exception of user defined operators that can have logical arguments), as long as that precedence level remained no higher than the comparison operators (for operators that return logical values) or remained lower than the '//' operator (for operators that return non-logical values). If the precedences can be redefined in this manner, then it might also be useful to distinguish between user defined binary operators that return logical values and those that return non-logical values, or whether another approach would solve this problem more elegantly. Still this is perhaps the most dangerous of the alternatives in terms of its effect on heritage code. 3. Increase the number of intrinsic operators. The problem with the current precedence levels could also be (partly) addressed by increasing the number of operators explicitly given precedences in the language beyond those used by the current intrinsic types. This requires a syntax not used for user defined types in order to avoid name conflicts. For comparison operations the most natural implementation would be additional combinations of the current comparison operators, e.g, combinations such as '<<', '>>', '<>', '><', '<=>', etc. In addition, the symbol space of the Fortran language could be increased to introduce new symbols for operators, e.g., '@', '~', '^', etc. Care would have to be taken to verify that such changes do not conflict with compiler extensions, or widely used preprocessors and that they can be implemented in available character sets. 4. Allow alternative syntaxes for user defined operator Another approach would be to allow additional syntaxes for user defined operators. The new syntaxes could have different precedences from the current precedences. The most obvious such syntax, for relational operators, would be something like ?operator-name? or ?operator-name?. However, this might conflict with common extensions. 5. Allow user defined precedences for user defined operators This is commonly done for functional, but not imperative, languages. If this is provided in Fortran, the absence of a precedence declaration should result in the operator receiving its current default precedence. The availability of user defined precedences by itself potentially results in the same operator name being associated with more than one precedence, which is confusing. This source of confusion can be greatly reduced by providing operator renaming on USE and making it illegal to have the same name available within scope at different precedences. It may also be useful to further constrain user defined precedences to enforce semantics similar to the current intrinsic operators, e.g., that the precedence of the intrinsic relational operators may only be associated with functions that return logical results, or that the precedence of the intrinsic relational operators must be consistent with its number of operands. If user defined precedences are allowed possible syntaxes include INTERFACE OPERATOR (defined-binary-op [, PRECEDENCE=binding] ) ... END INTERFACE OPERATOR or PRECEDENCE defined-operator-name=binding where binding is either an intrinsic operator or an integer within a set range, 0 to 10 or 0 to 100 appear to be common in functional languages. If an integer definition of binding is chosen then an explicit association of integer values with the intrinsic operators must be defined. IV. The automatic conversion of literal values to intrinsic values In Fortran provides only a few means of defining literal-constants, all of which are interpreted directly in terms of an intrinsic type before conversion to the desired derived type value. As a result, the additional precision, range, or other attribute of a type can be lost. In the case of interval arithmetic the desired additional information is that the literal might not be exactly representable by an intrinsic value and is best represented by bounding values. Although the interval arithmetic proposal has made this problem more visible, it also potentially impacts the addition of extended precision arithmetic, large integers, rational arithmetic, alternative bases, etc. to Fortran. The available alternatives appear to be: 1. Maintain the status quo Work arounds for the above problem exist through the use of CHARACTER strings or arrays and procedures that convert such strings to appropriate values, but, although these work arounds retain the essential value information, they prevent the static type checking of literals, these work arounds will usually entail run time overhead, and the syntax is unnecessarily verbose. Another limitation, that they cannot be currently used to initialize entities with the PARAMETER attribute, will be addressed separately. Example: TYPE BINARY PRIVATE INTEGER :: A END TYPE BINARY INTERFACE ASSIGNMENT (=) SUBROUTINE CHAR_TO_BINARY( B, C) TYPE(BINARY), INTENT(OUT) :: B CHARACTER(*), INTENT(IN) :: C END SUBROUTINE CHAR_TO_BINARY END INTERFACE ... Any of the other methods discussed below also require user defined procedures to convert the literal representation to the internal representation. 2. Allow users to define their own character types This extension allows limited compile time checking of values. It appears to be more usable for "integer" types, but not for data with more complex interpretations. A possible syntax would be TYPE, CHARACTER :: B = '01' which defines a character set used to represent unsigned binary values. Example: TYPE BINARY PRIVATE INTEGER :: A END TYPE BINARY INTERFACE ASSIGNMENT (=) SUBROUTINE CHAR_TO_BINARY( B, C) TYPE(BINARY), INTENT(OUT) :: B CHARACTER(*, KIND=B), INTENT(IN) :: C END SUBROUTINE CHAR_TO_BINARY END INTERFACE ... Then for TYPE(BINARY) :: BIN BIN = '1001' ! Valid presumably defines a binary integer with ! decimal value 9 BIN = '1002' ! Is invalid and can be determined at compile time 3. Allow users to define their own literal constant syntax More complicated types than integers are not well described by the simple single character string types defined above. Instead they require multiple components, and associated separators, where the components and separators are defined by distinctive character sets. This could be defined by a syntax such as ! A Hexadecimal integer constant is an optional sign ! followed by one or more characters REPRESENTATION HEX = 0:1'+-' // 1:*'0123456789ABCDabcd' ! An interval constant is a floating point literal constant ! or a pair of such constants separated by commas and ! enclosed in a special bracketing pair REPRESENTATION INTERVAL = 0:1'+-' // & 1:*'0123456789' // & 0:1('.' // 0:*'0123456789') // & 0:1('dDeE' // 0:*'0123456789') .OR. & 1'(' // 1'*' // INTERVAL // 1',' // & INTERVAL // 1'*' // 1')' There are four possible models for identifying the constant's type: Fortran character literal constants, e.g., SEVENTY_FIVE = HEX_'10B'; Fortran binary, octal, or hexadecimal constants, e.g., SEVENTY_FIVE = HEX'10B'; Fortran functions, e.g., SEVENTY_FIVE = HEX(10B), or Fortran real and integer literal constants, e.g., SEVENTY_FIVE = 10B_HEX. Note, however, that some versions of the literal constant representations could cause confusion of literals with entity names. This can be avoided by either choosing only a syntax that uses delimiters such as ' or ", or by the constraint that such literal constants cannot start with a letter. V. Restrictions on parameter initialization expressions The avoidance of the automatic conversion of literal values to intrinsic values is significantly enhanced if the strings or user defined literal values can be converted at compile time to an internal representation, i.e., they can be used as initialization expressions. This complicates some aspects of the language implementation, but appears to be required for some usages of C++ templates so there is a precedence for this in other languages. The alternatives are: 1. Maintain the status quo. 2. Allow restricted forms of user defined procedures in initialization expressions I suspect that the initial need for the conversion of "strings" to the values of user defined types in initialization expressions can be met by a restricting the use of user defined procedures to those that satisfy selected constraints. The most obvious constraints are: A. The procedure should only invoke other procedures that can be used in initialization expressions. B. Those language defined procedures that can be used in initialization expressions remain subject to some restrictions. C. The user defined procedures must be pure or elemental. D. Those procedures that can be used in initialization expressions must be identified as such by a keyword, i.e., INITIALIZATION. 3. Allow unrestricted use of user defined procedures in initialization expressions. This avoids an irregularity in the language, but potentially greatly complicates the compilation stage. VI. The inability to express certain properties of such extensions In order to properly define an arithmetic derived type, it can be useful to express certain properties such as commutivity, associativity, distributivity, and ideality (the complete avoidance of dependence on side effects). These mostly provide information to readers, but may also be of benefit in terms of compiler verification, optimization, or a reduction in the amount of coding. The following are attributes that might be useful 1. IDEAL: A procedure attribute that is a special case of the PURE attribute. This attribute would indicate a procedure whose effects depend only on its arguments, or on global entities with the PARAMETER attribute. This is in contrast with PURE procedures which can depend on arbitrary global entities, although they may not modify such entities. Procedures with this attribute are subject to optimizations beyond those applicable to PURE procedures. 2. COMMUTABLE: A procedure attribute whose primary intent is to reduce the amount of coding. It would indicate that the result of the procedure does not depend on the order of the arguments. It would allow a user to define a procedure with the arguments of different types in one order so that a compiler could map arguments supplied in any possible order to a single procedure. This is most useful for binary operators. It may cause type safety problems in combination with some forms of object orientation. 3. DISTRIBUTABLE: A relation between operators. While useful in standard algebra, I suspect that this would primarily provide information to the readers and would not be used by compilers. This might also be a source of difficult to understand errors as in some cases the implementation might not fully maintain this property although the code would do so for infinite precision arithmetic. 4. ASSOCIATIVE: This allows some optimizations for highly efficient procedures, i.e., those intrinsics provided directly by a processor, but this optimization would not normally be used for user defined procedures.. This might also be a source of difficult to understand errors as in some cases the implementation might not fully maintain this property although the code would do so for infinite precision arithmetic. 5. INITIALIZATION: Identifies a procedure that can be used in an initialization expression.