To: J3                                                     J3/15-186r1
From:       Nick Maclaren
Subject:    Ordering, Consistency, Atomics, and Progress
Date: 2015 July 27
References: 13-269, 15-139, 15-178, 15-185


The revision to this paper has been to correct one error in two examples
(Collective_One and Collective_Two), and add some clarificatory text.
Otherwise, it is identical to the first version.


Fundamental Question
--------------------

So far, we have not had a discussion about what the intended data
consistency model is, and I know that different people on WG5 are making
different assumptions.  This paper does not propose anything, but is an
attempt to get some consensus on the intent.

As the saying goes, "There ain't no Sanity Clause", and it is especially
true for parallelism.  What is 'obviously true' or 'reasonable' for one
reader is not so for others, and what seems to be a clear specification
is often (even usually) interpretable in several different ways.  My
couple of decades' experience of helping people with parallel problems
indicates that such alternate interpretations are probably the main cause
of hard-to-debug problems in real programs.

We need to agree on the intended consistency model (sequential, causal
or whatever), so that we can start closing the loopholes.  Speaking to
other people in WG5, I know that different people have different mental
models of the intent.

Note that neither N2056 A.4.2 nor 15-139 consider the inter-facility
consistency issues, and the problem of acausality, which are precisely
the aspects that computer science was stuck on for so many decades.
This is a HARD problem.

Note: there may be errors in this paper, as I am beginning to go
cross-eyed.  This issue is hard enough in the abstract, and is much
worse in combinations with Fortran's existing serial specification.


Atomics per se
--------------

The following examples may show why 15-139 will not help, even for
simple use of atomics.  They are standard examples, and there are a
great many more, so there is no point in addressing these specifically.
In these cases, the interaction is with the basic rules of execution
(i.e. serial between statements).  15-139 allows all of them to print
the 'impossible' value, which will confuse almost all programmers.

Note that the really nasty one is Atomic_Three, because it is hard to
think of wording that excludes that and not the others, but there is
just no chance of ordinary users getting their heads around acausal
logic!


    PROGRAM Atomic_One
    ! Is this allowed to print '0' from both images?
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: temp
        IF (NUM_IMAGES() /= 4) STOP
        IF (THIS_IMAGE() <= 2) THEN
            CALL ATOMIC_DEFINE(atom[THIS_IMAGE()+2],1)
            CALL ATOMIC_REF(temp,atom[5-THIS_IMAGE()])
            PRINT *, temp
        END IF
    END PROGRAM Atomic_One


    PROGRAM Atomic_Two
    ! Is this allowed to print '1' from both images?
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: temp
        IF (NUM_IMAGES() /= 4) STOP
        IF (THIS_IMAGE() <= 2) THEN
            CALL ATOMIC_REF(temp,atom[5-THIS_IMAGE()])
            CALL ATOMIC_DEFINE(atom[THIS_IMAGE()+2],1)
            PRINT *, temp
        END IF
    END PROGRAM Atomic_Two


    PROGRAM Atomic_Three
    ! Is this allowed to print '1' from both images?  The difference
    ! from Atomic_Two is that the value read from one atomic is used
    ! to set the other.
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: temp
        IF (NUM_IMAGES() /= 4) STOP
        IF (THIS_IMAGE() <= 2) THEN
            CALL ATOMIC_REF(temp,atom[5-THIS_IMAGE()])
            CALL ATOMIC_DEFINE(atom[THIS_IMAGE()+2],temp)
            PRINT *, temp
        END IF
    END PROGRAM Atomic_Three


Depending on the decisions taken about the issues described in 15-185,
there may well be some ones arising from that.


Collectives
-----------

The collectives are not image control statements, but Note 8.4 describes
an assumed ordering.  However, it is not in normative text and its exact
meaning is not clear.  In particular, whether non-blocking transfers can
be used, what code movements are legal optimisations, whether transfers
can be retried if there are lost network responses, and exactly which
constructs require the completion of all outstanding transfers.  The
issues can be exposed by using atomic subroutines, by using coindexing
and SYNC MEMORY, and in other ways, but only the first is used here.

The collectives are modelled on MPI, and should not be a problem in
themselves.  If one ignores all of the issues mentioned above, there are
two modes in which they work:

    A) CO_BROADCAST needs just a send on SOURCE_IMAGE, and a receive on
the other images; any collective with RESULT_IMAGE needs just a receive
on RESULT_IMAGE and a send on the other images.  No other
synchronisation is needed.

    B) Collectives with neither SOURCE_IMAGE nor RESULT_IMAGE need a
rendezvous (i.e. behave like what is now called a barrier).  That is
still not formally required.

This brings up the question of whether the logical requirements are
visible in the ordering of atomic operations.

Also, optimisation can perform quite large code movements; it can start
a send as soon as the data are available and complete it only when the
variable is next needed, and start a receive as soon as the variable is
available and complete it only when the data is needed.

In the following, the question is whether a processor is allowed to
print either '0' or '1'.  Note that Collective_Six and Collective_Seven
are causally inconsistent, but it is hard to think of any wording that
forbids them from printing '1' but not Collective_Four and
Collective_Five.


    PROGRAM Collective_One
! Does a collective imply any ordering?  The implication of Note 8.4
! is that it does, and this must print '1' - but is that so?
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: data = 0, temp
        IF (NUM_IMAGES() /= 3) STOP
        IF (THIS_IMAGE() == 1) CALL ATOMIC_DEFINE(atom[3],1)
        CALL CO_SUM(data,2)
        IF (THIS_IMAGE() == 2) THEN    ! Correction - 1 was erroneous
             CALL ATOMIC_REF(temp,atom[3])
            PRINT *, temp
        END IF
    END PROGRAM Collective_One


    PROGRAM Collective_Two
! Collective_One, but not using RESULT_IMAGE.
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: data = 0, temp
        IF (NUM_IMAGES() /= 3) STOP
        IF (THIS_IMAGE() == 1) CALL ATOMIC_DEFINE(atom[3],1)
        CALL CO_SUM(data)
        IF (THIS_IMAGE() == 2) THEN    ! Correction - 1 was erroneous
             CALL ATOMIC_REF(temp,atom[3])
            PRINT *, temp
        END IF
    END PROGRAM Collective_Two


    PROGRAM Collective_Three
! Is a collective allowed to not block?  The implication of Note 8.4
! is that it is, and this may print either '0' or '1' - but is that so?
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: data = 0, temp
        IF (NUM_IMAGES() /= 3) STOP
        IF (THIS_IMAGE() == 2) CALL ATOMIC_REF(temp,atom[3])
        CALL CO_SUM(data,2)
        IF (THIS_IMAGE() == 1) THEN
            CALL ATOMIC_DEFINE(atom[3],1)
            PRINT *, temp
        END IF
    END PROGRAM Collective_Three


   PROGRAM Collective_Four
! Can a segment that precedes a 'send' see a value that is set after
! a segment that succeeds a 'receive'?  This is an even less clear
! form of Collective_One.
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: data = 0, temp
        IF (NUM_IMAGES() /= 3) STOP
        IF (THIS_IMAGE() == 1) THEN
            CALL ATOMIC_REF(temp,atom[3])
            PRINT *, temp
        END IF
        CALL CO_SUM(data,2)
        IF (THIS_IMAGE() == 2) CALL ATOMIC_DEFINE(atom[3],1)
    END PROGRAM Collective_Four


    PROGRAM Collective_Five
! Collective_Four, but not using RESULT_IMAGE.
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: data = 0, temp
        IF (NUM_IMAGES() /= 3) STOP
        IF (THIS_IMAGE() == 1) THEN
            CALL ATOMIC_REF(temp,atom[3])
            PRINT *, temp
        END IF
        CALL CO_SUM(data)
        IF (THIS_IMAGE() == 2) CALL ATOMIC_DEFINE(atom[3],1)
    END PROGRAM Collective_THREE


    PROGRAM Collective_Six
! Collective_Four, but using the atomic value as an argument to the
! collective.
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: data = 0, temp
        IF (NUM_IMAGES() /= 3) STOP
        IF (THIS_IMAGE() == 1) CALL ATOMIC_REF(data,atom[3])
        CALL CO_SUM(data,2)
        IF (THIS_IMAGE() == 1) THEN
            CALL ATOMIC_DEFINE(atom[3],1)
            PRINT *, data
        END IF
    END PROGRAM Collective_Six


    PROGRAM Collective_Seven
! Collective_Six, but not using RESULT_IMAGE.
        USE, INTRINSIC :: ISO_FORTRAN_ENV
        TYPE(ATOMIC_INT_TYPE) :: atom[*] = 0
        INTEGER :: data = 0, temp
        IF (NUM_IMAGES() /= 3) STOP
        IF (THIS_IMAGE() == 1) CALL ATOMIC_REF(data,atom[3])
        CALL CO_SUM(data)
        IF (THIS_IMAGE() == 1) THEN
            CALL ATOMIC_DEFINE(atom[3],1)
            PRINT *, data
        END IF
    END PROGRAM Collective_Seven


Events
------

Issues with these are described in 15-178.


Image Failure
-------------

As there is no existing, portable specification that allows for recovery
from 'system' failures, one can only speculate about the problems that
will arise.  However, from my experience, The TS's basic assumption is
unrealistic, and there will be more failures in the network and its
interfaces than in the nodes themselves.  For example, I have seen all
of the following in the network mechanisms that coarrays will be built
upon on commodity systems:

    A) A node stops responding for a long period, or stops responding to
only some nodes, but then starts responding again.  Some of those nodes
then regard it as failed but others do not.  The TS says nothing about
whether image failure is consistent across the whole program.

    B) Some of the inter-node links fail, but not others, often in
topologically complicated ways.  This is completely outside the TS's
model.

    C) The 'master' node stops responding, which then causes very
strange effects on other nodes.  For example, I/O to open files might
continue, but I/O to the standard units and closing files might not.

Coarrays also add a serious issue to do with what failure of a node
means, because there are logically five kinds of activity associated
with a node:

    1) Executing statements.
    2) Servicing ordinary data accesses from other nodes.
    3) Servicing atomic data accesses from other node.
    4) Accessing ordinary data on other nodes.
    5) Accessing atomic data on other nodes.

Any of those are liable to fail, or stop responding, independently.
It is unclear which of them are intended to count as failure.


Background
----------

A great deal of current hardware and software has some very weird
properties, including such things as 'out of thin air' effects and
apparent time reversal, and they really do occur in practice.  Those can
also be caused by redundant networks, retry after transfer failure and
so on.  Speaking to other people in the parallel development and support
area, most of them had encountered similar issues, but resolved them by
rewriting code to use different facilities or a different algorithm,
sometimes repeating that for each new environment!

The other aspect is that four decades of experience shows that atomics
AS SUCH are not the issue.  The problems are (a) the consistency model,
and (b) their interaction with other facilities. Introducing any form of
parallelism necessarily requires specifying the serial ordering (or that
it is unspecified) MORE precisely than is needed for serial execution;
that 'gotcha' has caught out a great many specifications.

The only successful specifications that I have seen have started with a
precise consistency model, and defined the constraints in terms of that,
which is the approach that C++ has taken.  Defining constraints in order
to specify the consistency model has been attempted many times but, as
far as I know, has never been done successfully.

The following is copied from 15-178, but is replicated here to help
readers.

Up until the late 1960s, essentially all parallel experts believed that
parallel correctness was merely a matter of getting the synchronisation
right, but experience with the early parallel systems showed that was
not even approximately true.  It was another decade before any real
progress was made (Hoare in 1978 and Lamport in 1979) - that identified
data consistency as the key concept, and one that could NOT be reduced
to a synchronisation property.

An indication of how deceptive this area is may be seen by the fact that
it was only in the early 1990s that the theoreticians finally managed to
understand the relationship between causality (i.e. that which is
delivered by sychronisation) and the 'obvious' semantics.  See, for
example:

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.6403
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.3356

The situation is far nastier than it seems to be, because a lot of
hardware and software optimisations allow apparent time reversal, causal
violations, and 'out of thin air' effects, and that includes IBM POWER.
These are often, but not always, caused by legal code movements,
caching, retry, asynchronous access and so on.  Despite common belief,
such effects cannot be excluded by ANY specification that limits itself
to the semantics of a single object.

None of the experts I know trust their own intuition, because it is so
common that 'obviously correct' specifications contain serious flaws,
often unusabilities or inconsistencies.  I have been following this area
for 40 years, have 20 years practical experience of supporting parallel
programmers, and am still learning new ways in which 'obvious truths'
are incorrect.  Java has had threading since 1995, but its model was
discovered to be badly flawed only in 2003 or so, and it had to be
replaced in 2004.

    http://www.cs.umd.edu/~pugh/java/memoryModel/


Progress
--------

This is largely a red herring.  While specifying it precisely seems
to be hard, there seems to be little difficulty in specifying it
pragmatically.  There are two known, successful models:

    A) MPI has specified it for the model where each image handles all
data accesses, and requests for data, and there are several known
implementation strategies.

    B) The model that implements virtual shared memory using an
asynchronous agent (e.g. a separate thread or interrupt) is widely used.
On some systems, this can be provided by the hardware but, on most
current systems, POSIX or similar threading will be used.

The key to success of the second is that ALL accesses must be via that
agent, and none must be direct accesses by the main execution thread.
This is because all standard shared-memory specifications need both
threads to participate in a synchronisation.

Model A is adequate for all of Fortran 2008's image control statement
based ordering, and N2020 requires model B for atomic data (including
events and locks).  The requirement for all access to be via the agent
is specified in 8.5.2p3.  There are two, fairly minor, implementation
consequences for POSIX-based systems:

    1) Atomic access to data must be done via the agent, even for data
on the executing image.

    2) All image control statements must synchronise with their agent,
as well as with the other image(s).

The normative text proposed in 15-139 seems plausible, but needs to end
"...  until after that value is defined by image A and the value
received by image C."