J3/97-244

To:      J3
From:    William Clodius
Subject: Numeric Type Extensions and Fortran
Date:    October 14, 1997

  I. Introduction

Recent proposals to add interval arithmetic to Fortran have revealed some
weaknesses. This paper does not attempt to address the specific merits of the
interval arithmetic proposal, but instead addresses those weaknesses revealed
by this proposal that can impact other potential extensions to the language,
with an emphasis on its impact on numerical extensions. In particular, it
tries to summarize different potential "fixes" for

     1. Fortran's limited character set

 2. The restrictions on the precedences of user defined operators

 3. Restrictions on initialization expressions

 4. The automatic conversion of literal values to intrinsic values

 5. The inability to express certain properties of such extensions

 II. Fortran's limited character set

In examining some of the following potential fixes I have been struck by how
much awkwardness is introduced by the limited size of Fortran's current
character set. Although in some ways this limitation has been a source of
strength for the language, by ensuring high portability, there is genuine
cause for concern at the costs of this restriction for a language intended
for numerical programming. In particular it is difficult to extend Fortran in
some aspects without also increasing the number of bracketing or operator
constructs, but increasing the number of such constructs without also
increasing the character set decreases program legibility by overloading the
meaning of symbols, e.g., the many overloadings of the character '*'.

I am aware of three potential reasons for restricting the character sets used
in a language: first, the reduction of the number of characters needed to
understand the language; second, the unavailability of the characters in the
character sets commonly used on a given processor; and, third, the difficulty
in accessing characters nominally available on a given processor through the
keyboards provided with the processor. The first argument is largely
spurious, as long as the characters are readily distinguished visually humans
can deal with character sets several orders of magnitude larger than that
used in Fortran. The second argument was valid through most of the 80's, but
I believe all processors now support 8 or 16 bit character sets and provide
means of mapping the character sets to representations used by the
programmer, i.e., the character set used in an application need not be the
same as the default character set of the system. The third argument
undoubtably retains some validity. Although programmers can usually
circumvent the keyboard problems by remapping the keys or using special
non-national keyboards, there still remain times where such options are not
available.

There are several ways the standard can address this limitation:

     1. Maintain the status quo. This will retain high portability, but will
make it difficult to extend the language in some ways without simultaneously
decreasing the legibility of the language.

     2. Adopt a purely abstract syntax and leave the mapping of the syntax to
a processor's character set up to the processor. This was used in Algol 60
and was a minor, but definite, hindrance to portability. This might, however,
facilitate availability of the language on processors which by default do not
use a Roman based alphabet.

     3. Adopt a specific character set for the standard's representation,
possibly requiring a processor to provide a converter to that character set.
This is used by Java.

     4. Increase the number of characters used by the standard without
committing to a specific character set, possibly requiring a processor to
provide a converter to a character set with those characters. Ideally this
would involve a detailed study of all the widely used character sets in order
to avoid gratuitous conflicts. In practice, language definitions typically
use the intersection of ASCII with the most widely used variants of EBCDIC.
This results in a character set that extends the current Fortran character
set with the at-sign, @, backslash, \, circumflex, ^, sharp, #, vertical bar,
|, tilde, ~, the opening and closing brackets, [ and ], and the opening and
closing braces, { and }. There are a few characters that are slightly less
widely available that might also be considered, i.e., the upside down
question mark, ?, and the degree, x.

     5. Provide multiple standard defined representations for the same
syntactic entity, some representable (awkwardly) in the current set, others
requiring additional characters. This might result in such mappings as '(/',
'/)', '(*', '*)', '(<', '>)', '<=', '>=', '/=', etc., to '[', ']', '{', '}',
left angle bracket, right angle bracket, 'x', 'x', 'x', etc. Note that such
extensions need not be limited to widely used symbols, i.e., left and right
angle brackets are not available on most standard eight bit character sets.
This duplication complicates the parsing of the language, but ensures
portability. However, care would need to be taken to avoid infelicities
similar to C's triglyphs and diglyphs.

     6. Keep most symbols well defined, but allow a few symbols to be
processor defined similar to what is done with '$' in the current standard.

I would like to note that my preference would be for a combination of 5 and
6:, i.e., extend the character set with a few more widely available
characters, i.e, [, ], {, }, |, ~, \, ^, and #, but provide alternative
mappings to less legible combinations of current Fortran characters. Possible
symbol names and mappings both to widely, but not universally, available
characters and to combinations of more widely available characters are:

Symbol             Nominal      Mapped
Name               Character    Combination(s)

Left bracket       [            (/ or (=
Right bracket      ]            /) or =)
Left brace         {            (*
Right brace        }            *)
Approximate        ~            -= or %= or ?=
At                 @            */ or /* or () or (.) or (/)
Bar                |            +/ or //
Circumflex         ^            +* or ** or *%
Degree             x            $/ or /$ or */ or /*
Inverted Question  ?            ?? or -? or /? or ?=
Sharp              #            %%
Slash              \            -/ or //

Note in some cases I am suggesting that the nominal character could be used
to substitute for existing character combinations.

III. The restrictions on the precedences of user defined operators

The current restrictions on the precedences of user defined operators is
awkward for the design, implementation, and use of sophisticated types. Such
types often benefit from more operators than are provided among Fortran's
intrinsics, but the precedences of the user defined operators, particularly
the binary operators, is awkward. Having the binary user defined operators
have the lowest possible precedence was a mistake in terms of user
convenience. Most additional operators are typically comparison operators
which should be able to have the same precedence as the intrinsic comparison
operators. The few non-comparison operators may yield results that are
intended for use in comparisons and should be able to have precedences higher
than that of the comparison operators. The alternatives as I view them are:

     1. Maintain the status quo

The current precedences are awkward, but their effect is more at the irritant
than the major problem level. They require more parenthesization than the
ideal, but that is their primary effect.

     2. Change the default precedences for the user defined binary operators

Because the current binary operators have the lowest possible precedences, I
believe that extensive parenthesization is need for them to be used properly
in all but the simplest expressions. I therefore suspect that that a change
in their precedence level would not change the validity of current user code,
including simple expressions (with the possible exception of user defined
operators that can have logical arguments), as long as that precedence level
remained no higher than the comparison operators (for operators that return
logical values) or remained lower than the '//' operator (for operators that
return non-logical values). If the precedences can be redefined in this
manner, then it might also be useful to distinguish between user defined
binary operators that return logical values and those that return non-logical
values, or whether another approach would solve this problem more elegantly.
Still this is perhaps the most dangerous of the alternatives in terms of its
effect on heritage code.

     3. Increase the number of intrinsic operators.

The problem with the current precedence levels could also be (partly)
addressed by increasing the number of operators explicitly given precedences
in the language beyond those used by the current intrinsic types. This
requires a syntax not used for user defined types in order to avoid name
conflicts. For comparison operations the most natural implementation would be
additional combinations of the current comparison operators, e.g,
combinations such as '<<', '>>', '<>', '><', '<=>', etc. In addition, the
symbol space of the Fortran language could be increased to introduce new
symbols for operators, e.g., '@', '~', '^', etc. Care would have to be taken
to verify that such changes do not conflict with compiler extensions, or
widely used preprocessors and that they can be implemented in available
character sets.

     4. Allow alternative syntaxes for user defined operator

Another approach would be to allow additional syntaxes for user defined
operators. The new syntaxes could have different precedences from the current
precedences. The most obvious such syntax, for relational operators, would be
something like ?operator-name? or ?operator-name?. However, this might
conflict with common extensions.

     5. Allow user defined precedences for user defined operators

This is commonly done for functional, but not imperative, languages. If this
is provided in Fortran, the absence of a precedence declaration should result
in the operator receiving its current default precedence. The availability of
user defined precedences by itself potentially results in the same operator
name being associated with more than one precedence, which is confusing. This
source of confusion can be greatly reduced by providing operator renaming on
USE and making it illegal to have the same name available within scope at
different precedences. It may also be useful to further constrain user
defined precedences to enforce semantics similar to the current intrinsic
operators, e.g., that the precedence of the intrinsic relational operators
may only be associated with functions that return logical results, or that
the precedence of the intrinsic relational operators must be consistent with
its number of operands. If user defined precedences are allowed possible
syntaxes include

INTERFACE OPERATOR (defined-binary-op [, PRECEDENCE=binding] )
...
END INTERFACE OPERATOR

or

PRECEDENCE defined-operator-name=binding

where binding is either an intrinsic operator or an integer within a set
range, 0 to 10 or 0 to 100 appear to be common in functional languages. If an
integer definition of binding is chosen then an explicit association of
integer values with the intrinsic operators must be defined.

 IV. The automatic conversion of literal values to intrinsic values

In Fortran provides only a few means of defining literal-constants, all of
which are interpreted directly in terms of an intrinsic type before
conversion to the desired derived type value. As a result, the additional
precision, range, or other attribute of a type can be lost. In the case of
interval arithmetic the desired additional information is that the literal
might not be exactly representable by an intrinsic value and is best
represented by bounding values. Although the interval arithmetic proposal has
made this problem more visible, it also potentially impacts the addition of
extended precision arithmetic, large integers, rational arithmetic,
alternative bases, etc. to Fortran. The available alternatives appear to be:

 1. Maintain the status quo

Work arounds for the above problem exist through the use of CHARACTER strings
or arrays and procedures that convert such strings to appropriate values,
but, although these work arounds retain the essential value information, they
prevent the static type checking of literals, these work arounds will usually
entail run time overhead, and the syntax is unnecessarily verbose. Another
limitation, that they cannot be currently used to initialize entities with
the PARAMETER attribute, will be addressed separately.

Example:

   TYPE BINARY
      PRIVATE
      INTEGER :: A
   END TYPE BINARY

   INTERFACE ASSIGNMENT (=)
      SUBROUTINE CHAR_TO_BINARY( B, C)
         TYPE(BINARY), INTENT(OUT) :: B
         CHARACTER(*), INTENT(IN) :: C
      END SUBROUTINE CHAR_TO_BINARY
   END INTERFACE

   ...

Any of the other methods discussed below also require user defined procedures
to convert the literal representation to the internal representation.

 2. Allow users to define their own character types

This extension allows limited compile time checking of values. It appears to
be more usable for "integer" types, but not for data with more complex
interpretations. A possible syntax would be

   TYPE, CHARACTER :: B = '01'

which defines a character set used to represent unsigned binary values.

Example:

   TYPE BINARY
      PRIVATE
      INTEGER :: A
   END TYPE BINARY

   INTERFACE ASSIGNMENT (=)
      SUBROUTINE CHAR_TO_BINARY( B, C)
         TYPE(BINARY), INTENT(OUT) :: B
         CHARACTER(*, KIND=B), INTENT(IN) :: C
      END SUBROUTINE CHAR_TO_BINARY
   END INTERFACE

      ...

Then for

   TYPE(BINARY) :: BIN

   BIN = '1001' ! Valid presumably defines a binary integer with
                ! decimal value 9

   BIN = '1002' ! Is invalid and can be determined at compile time

 3. Allow users to define their own literal constant syntax

More complicated types than integers are not well described by the simple
single character string types defined above. Instead they require multiple
components, and associated separators, where the components and separators
are defined by distinctive character sets. This could be defined by a syntax
such as

   ! A Hexadecimal integer constant is an optional sign
   ! followed by one or more characters
   REPRESENTATION HEX = 0:1'+-' // 1:*'0123456789ABCDabcd'

   ! An interval constant is a floating point literal constant
   ! or a pair of such constants separated by commas and
   ! enclosed in a special bracketing pair

   REPRESENTATION INTERVAL = 0:1'+-' // &
                     1:*'0123456789' // &
                     0:1('.' // 0:*'0123456789') // &
                     0:1('dDeE' // 0:*'0123456789') .OR. &
                     1'(' // 1'*' // INTERVAL // 1',' // &
                                     INTERVAL // 1'*' // 1')'

There are four possible models for identifying the constant's type: Fortran
character literal constants, e.g., SEVENTY_FIVE = HEX_'10B'; Fortran binary,
octal, or hexadecimal constants, e.g., SEVENTY_FIVE = HEX'10B'; Fortran
functions, e.g., SEVENTY_FIVE = HEX(10B), or Fortran real and integer literal
constants, e.g., SEVENTY_FIVE = 10B_HEX.

Note, however, that some versions of the literal constant representations
could cause confusion of literals with entity names. This can be avoided by
either choosing only a syntax that uses delimiters such as ' or ", or by the
constraint that such literal constants cannot start with a letter.

  V. Restrictions on parameter initialization expressions

The avoidance of the automatic conversion of literal values to intrinsic
values is significantly enhanced if the strings or user defined literal
values can be converted at compile time to an internal representation, i.e.,
they can be used as initialization expressions. This complicates some aspects
of the language implementation, but appears to be required for some usages of
C++ templates so there is a precedence for this in other languages. The
alternatives are:

     1. Maintain the status quo.

     2. Allow restricted forms of user defined procedures in initialization
expressions

     I suspect that the initial need for the conversion of "strings" to the
values of user defined types in initialization expressions can be met by a
restricting the use of user defined procedures to those that satisfy selected
constraints. The most obvious constraints are:

     A. The procedure should only invoke other procedures that can be used in
initialization expressions.

     B. Those language defined procedures that can be used in initialization
expressions remain subject to some restrictions.

     C. The user defined procedures must be pure or elemental.

     D. Those procedures that can be used in initialization expressions must
be identified as such by a keyword, i.e., INITIALIZATION.

     3. Allow unrestricted use of user defined procedures in initialization
expressions. This avoids an irregularity in the language, but potentially
greatly complicates the compilation stage.

 VI. The inability to express certain properties of such extensions

In order to properly define an arithmetic derived type, it can be useful to
express certain properties such as commutivity, associativity,
distributivity, and ideality (the complete avoidance of dependence on side
effects). These mostly provide information to readers, but may also be of
benefit in terms of compiler verification, optimization, or a reduction in
the amount of coding. The following are attributes that might be useful

     1. IDEAL: A procedure attribute that is a special case of the PURE
attribute. This attribute would indicate a procedure whose effects depend
only on its arguments, or on global entities with the PARAMETER attribute.
This is in contrast with PURE procedures which can depend on arbitrary global
entities, although they may not modify such entities. Procedures with this
attribute are subject to optimizations beyond those applicable to PURE
procedures.

     2. COMMUTABLE: A procedure attribute whose primary intent is to reduce
the amount of coding. It would indicate that the result of the procedure does
not depend on the order of the arguments. It would allow a user to define a
procedure with the arguments of different types in one order so that a
compiler could map arguments supplied in any possible order to a single
procedure. This is most useful for binary operators. It may cause type safety
problems in combination with some forms of object orientation.

     3. DISTRIBUTABLE: A relation between operators. While useful in standard
algebra, I suspect that this would primarily provide information to the
readers and would not be used by compilers. This might also be a source of
difficult to understand errors as in some cases the implementation might not
fully maintain this property although the code would do so for infinite
precision arithmetic.

     4. ASSOCIATIVE: This allows some optimizations for highly efficient
procedures, i.e., those intrinsics provided directly by a processor, but this
optimization would not normally be used for user defined procedures.. This
might also be a source of difficult to understand errors as in some cases the
implementation might not fully maintain this property although the code would
do so for infinite precision arithmetic.

     5. INITIALIZATION: Identifies a procedure that can be used in an
initialization expression.