J3/06-180

Date:    08-May-2006
To:      J3
From:    Bill Long
Subject: Co-arrays to be required

In paper 06-165 the authors contend that the co-array proposal,
UK-001, has significant problems and should be an optional, rather
than mandatory, part of the standard. This paper presents arguments to
the contrary.

Background
----------

1) Fortran has remained an active language for 50 years by repeatedly
   adapting to evolving hardware and software environments, while
   maintaining its primary strengths of ease of use for scientific and
   mathematical programming and efficient program execution.

2) The primary shift occurring now in computer hardware is toward
   increased parallelism. Over half of the processor chips to be
   shipped this year will support multiple independent threads of
   execution. By 2010, the time frame of relevance for initial
   shipments of compilers for Fortran 2008, there will likely be zero
   single-threaded chips made for general purpose computers.

3) Effective programming of parallel systems has proved difficult with
   the most widely available tools, OpenMP and MPI. OpenMP scales
   poorly and MPI is difficult to use and poorly integrated with
   Fortran concepts. DARPA has recognized the fundamental problem of
   poor productivity in programming modern systems and is now
   conducting a competition for the development of a new language. Of
   the three remaining candidates, all are explicitly designed for
   parallel execution environments.

4) Co-arrays as part of the Fortran language provide a dramatic
   improvement in programmer productivity compared to MPI. As an
   adaptation to hardware changes, as well as an easy to use and easy
   to understand model, co-arrays fit squarely into the natural
   evolution of Fortran.  Because of impressive experience with
   existing implementations, both the US departments of Defense and
   Energy are strongly encouraging the standardization and use of
   Fortran with co-arrays as well as UPC, the C extension with
   essentially the same parallel programming model. (See upc.lbl.gov
   for more information on UPC.)

5) With the public release of the feature list for Fortran 2008,
   listing co-arrays as the major feature of the required sublist, J3
   and WG5 have received enthusiastic congratulations from many high
   profile user groups, especially ones for whom Fortran is still the
   principal programming language.  The major complaint about
   co-arrays from the user community is the lack of code portability.
   The central goal of language standardization is to ensure that
   portability, which is why the inclusion of co-arrays as a
   nonoptional part of Fortran is of such high importance.  Because of
   its inclusion in Fortran 2008, some customers are already
   requesting implementations in the 2007-2008 time frame, and (at
   least in the Opteron space) compiler vendors are starting to commit
   to this timetable. If we fail to meet these user's and vendor's
   expectations, the credibility of, and support for, J3 and WG5 will
   be severely damaged. There are two potential results. One is that
   implementation of co-arrays is abandoned by most vendors and the
   users migrate future codes to UPC, which is already more widely
   available.  The other is repeat of "Cray pointers" on a large
   scale, with J3 becoming irrelevant and the de-facto definition of
   Fortran being taken over by a subset of the vendors and US
   government agencies.  Neither outcome is desirable.

Specific issues raised in paper 06-165
--------------------------------------

Lines beginning with >> are taken from 06-165.

>>- the conceptual model is restricted to a small class of parallel
>>   machines

All systems generally fall into one of these four groups:

1) One single-threaded processor and associated memory.  On such a
   system, the implementation if co-arrays becomes trivial because
   there is only one image; parallel execution is a non-issue.  The
   compiler is required only to accept the new syntax and diagnose
   constraint violations. The generated code can ignore the image
   indices since either they all have the value 1 or the program has
   an error. All of the synchronization routines become trivial.
   However, as noted above, such systems will be essentially extinct
   by 2010.

2) Multiple processor threads sharing a common memory space. On such a
   system, the compiler can partition the user's memory such that each
   image and its associated processor thread has one of the
   blocks. Alternatively, each co-array could be separately
   partitioned.  Accessing memory in other images amounts to trivial
   address manipulation followed by ordinary load and store
   operations. Implementation of the synchronization statements will
   be no more difficult than, and can make use of, the similar
   technology used for OpenMP. The conceptual model maps easily to
   this architecture, and the implementation is simple and efficient.

3) Systems with distributed memory, and for which each processor or
   small group of processors has fast access to its local memory, but
   also has access to memory in other parts of the system without
   interrupting the processors for which that memory is local. The
   ability to support remote memory access is provided by a
   high-capability network integrated into the system. The co-array
   model is easily adapted to this class of systems, as proven by
   existing implementations.

4) Systems with distributed memory, but for which the network does not
   support direct remote memory access. This is typical of the
   low-cost cluster systems. Features have been added to the co-array
   proposal specifically to address performance issues on such
   systems.


In summary, all systems that support efficient parallel processing
work well with co-arrays, and users of those systems with poor
parallel characteristics still benefit from ease of use. Furthermore,
it is reasonable to assume that system performance will continue to
improve in the future.

>>- the I/O model is overly complicated

The proposal includes only the simplest extensions for parallel
I/O. Suggestions for expansion in this area have been
resisted. Noncolliding read and write operations to records of a
direct access file connected to multiple images are allowed. Also
allowed is writing to a sequential access file connected to multiple
images.  Both of these are in direct response to the expressed needs
of users. Users frequently assume that writing to a sequential file
from the processes in a parallel job coded with Fortran and MPI should
work in the "obvious" way. Whether this works is highly nonportable.
Parallel I/O very much needs standardization, at least for the
simple cases included in the co-array proposal.

>>- the synchronization model is overly complicated and too low-level

For many applications, sync_all, sync_images, and sync_memory are
sufficient.  The notify and query statements, while simulatable with
user written code, were included for user convenience.  The sync_team
statement was included to allow for special optimization of that
particular case.  The goal is to provide both the simplest
synchronization, sync_all, for simple situations, while also providing
more powerful methods for more complicated situations.  None of the
included synchronization capabilities is new or untested technology.

>>- significant further technical work on the proposal is still needed
   so there is a serious risk of delaying the whole revision schedule

Significant further technical work has been completed between meetings
175 and 176. It is our belief that the feared "serious risk" no longer
exists.

>>- the co-array model is presented as the normal execution model with
   single processors as the exception; this is the wrong emphasis given
   that much of the new language is irrelevant to most users

There is only one execution model presented. There is no "abnormal"
one available. Whether the program is executing on one image or
multiple images is irrelevant to most of the standard. Where it does
matter, is there any other sensible presentation for a unified
language?  We do not, for example, call out special cases of arrays
with only one element as "normal" (and can be handled as if they were
scalars), and then distinguish the "abnormal" case of larger arrays
that need different handling. While the concept of parallel execution
might seem "irrelevant to most users" today, there is very little
reason to expect that will be the case by 2010.

>>- the proposals as currently formulated are large and experimental and
   should be evaluated in practice before being standardized

The proposal differs only minimally from implementations that have
spanned two different systems over a period of 10 years.  The reason
there is such high user enthusiasm for the standardization of
co-arrays is because of that extensive and positive
experience. Co-arrays are not even remotely experimental.

>>We believe the facts that co-arrays are of minority interest and that
>>these proposals in detail are untested in practice point strongly to
>>their being an optional, rather than a mandatory, part of the Fortran
>>standard.

The "minority interest" will grow rapidly to a majority with available
implementations, and even at today's level exceeds the interest in
some other features incorporated into Fortran. Co-arrays have been
tested extensively in practice and proven valuable.  In that both of
the premises appear to be false, the conclusion should be rejected.

>>We believe that the best way to do this would be for
>>co-arrays to appear as a Technical Report (without necessarily
>>guaranteeing inclusion in the next standard).

Failure to mandate co-arrays defeats the central point and value of
the standard. It would be a recipe for failure of the feature, and the
irrelevance of the whole language in the long run.

>>This would allow those
>>vendors whose primary markets are for single processors to avoid the
>>burden of implementing features of little interest to their customers.

If vendors find a continuing market for single processor compilers,
the burden to implement co-arrays in such a compiler is small.

>>We urge J3 to proceed on these lines and to so recommend to WG5.

On the contrary, finishing the work remaining and completing Fortran
2008 on schedule is the right course of action.