13-270
To: J3
From: Nick Maclaren
Subject: Why are atomics a nightmare?
Date: 2013 June 20


I have written this to try to explain why I reacted so negatively to the
last of a precise data consistency mode for the proposed atomics.
Unfortunately it was written in haste, and may contain bugs.  Even if
you don't have the time or interest to read all of this, please look at
the first couple of sections to see why I am concerned.

I have seen almost all of these failures of sanity ('gotchas') over the
years, and most of them can be demonstrated on current systems in
programs that conform to existing, important interfaces.  They are bugs
in neither the implementation nor the program - the only error would be
if the programmer relied on them not happening.  Some people will claim
that some of these are forbidden by the current standard, but I try to
explain in section 4 why that is not the case.

Note that I am NOT proposing that they all be forbidden - merely that we
need an unambiguous data consistency model, so that it all clueful
vendors and programmers can agree on what is required.  Some of these
gotchas
will remain as features.


2. Interaction With Other Features
----------------------------------

I shall give just three examples, for simplicity; there are a lot more
that I could give.


2.1 Synchronisation failure (1)
-------------------------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    CALL ATOMIC_DEFINE(x,42)

Image 2:    CALL ATOMIC_REF(y,x[1])
            PRINT *, y
            SYNC IMAGES ( (/2,3/) )    ! Assumed to be matching

Image 2:    SYNC IMAGES ( (/2,3/) )    ! Assumed to be matching
            CALL ATOMIC_REF(y,x[1])
            PRINT *, y

Could print 42 on image 2 and 0 on image 3.  This can happen, because of
the store buffering described below.  In itself, this isn't a problem,
but see the next example.


2.2 Synchronisation failure (2)
-------------------------------

            REAL :: p[*] = 0
            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    p = 13
            SYNC MEMORY    ! Assumed to be matching
            CALL ATOMIC_DEFINE(x,42)

Image 2:    CALL ATOMIC_REF(y,x[1])
            IF (y == 42) THEN
                SYNC MEMORY    ! Assumed to be matching
                PRINT *, p[1]
            END IF

Could print 0.  This can happen, because of the store buffering
described below.  It obviously conflicts with Note 8.4.1, but we
agreed and stated in 13.1 paragraph 3 that whether that example
would actually work is processor dependent.


2.3 Collectives
---------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    CALL ATOMIC_DEFINE(x,42)
            y = x
            CALL CO_BROADCAST (y)    ! Assumed to be matching

Image 2:    CALL CO_BROADCAST (y)    ! Assumed to be matching
            CALL ATOMIC_REF(z,x[1])
            PRINT *, y, z

Could print 42, 0.  This can happen, because of the store buffering
described below.


3. Breaches of Sanity
---------------------

3.1 Write reordering
--------------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    DO i = 1,3
                CALL ATOMIC_DEFINE(x,i)
            END DO

Image 2:    DO i = 4,5
                CALL ATOMIC_DEFINE(x[1],i)
            END DO

Image 3:    DO i = 1,5
                CALL ATOMIC_REF(y,x[1])
                PRINT *, y
            END DO

Image 4:    DO i = 1,5
                CALL ATOMIC_REF(y,x[1])
                PRINT *, y
            END DO

Could print 0, 1, 2, 3, 4 on image 3 and 4, 5, 6, 0, 1 on image 4.
Everything happens instantaneously, but not in a consistent order.  It
will break most programs that use atomics, but it is a very likely
effect on message passing implementations, due to locality.

It can be worse, and the effect can occur when a single image does all
of the writing, where an implementation relies on single-sided fencing
or has multiple paths.

Apparently, the rule that writes must appear in a single order is now
known as coherence, and every modern parallel interface I have seen has
specified it.  However, older ones did not always do so, UDP and IP do
not, and even TCP and most file access protocols do not at their
transport level.


3.2 Erroneous accumulation
--------------------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    DO i = 1,5
                CALL ATOMIC_ADD(x,i)
            END DO
            PRINT *, x

Image 2:    DO i = 6,10
                CALL ATOMIC_ADD(x[1],i)
            END DO
            PRINT *, x[1]

Could print 47.  Everything is instantaneous, but some actions seem to
get lost.  This can and does happen on Intel and AMD multi-core CPUs,
because of the store buffering described below.


3.3 Confusing return values
---------------------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    DO i = 1,5
                CALL ATOMIC_ADD(x,7,y)
                PRINT *, y
            END DO
            PRINT *, x[1]

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 2:    DO i = 6,10
                CALL ATOMIC_ADD(x[1],13,y)
                PRINT *, y
            END DO
            PRINT *, x[1]

Could print 0, 20, 40, 60, 80 and 100 on image 1 and 7, 7, 20, 27, 30
and 100 on image 2 or, with some implementations, 14, 34, 54, 74, 87 and
100 on image 2.  The total is correct, but the old values are not what
was expected.  This can occur when the update happens atomically, but
the loading of the return value has to be emulated.


3.4 Time warps
--------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    CALL ATOMIC_REF(y,x[3])
            CALL ATOMIC_DEFINE(x,42)
            PRINT *, y

Image 2:    CALL ATOMIC_REF(y,x[1])
            CALL ATOMIC_DEFINE(x,y)

Image 3:    CALL ATOMIC_REF(y,x[2])
            CALL ATOMIC_DEFINE(x,y)

Could print 42.  I have not seen this, but understand that it can occur
on some current architectures and with some compilers, possibly POWER.


3.5 Store buffering
-------------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    CALL ATOMIC_DEFINE(x,42)
            CALL ATOMIC_REF(y,x[2])
            PRINT *, y

Image 2:    CALL ATOMIC_DEFINE(x,42)
            CALL ATOMIC_REF(y,x[1])
            PRINT *, y

Could print 0 from both images.  This one is the root cause of the
strange effects in the previous examples on Intel and AMD CPUs, as it
is allowed by their architectural model for atomic operations.


3.6 Independent reads of independent writes
-------------------------------------------

            INTEGER(ATOMIC_INT_KIND) :: x[*] = 0

Image 1:    CALL ATOMIC_DEFINE(x,13)

Image 2:    CALL ATOMIC_DEFINE(x,69)

Image 3:    CALL ATOMIC_REF(y,x[1])
            CALL ATOMIC_REF(z,x[2])
            PRINT *, y, z

Image 4:    CALL ATOMIC_REF(z,x[2])
            CALL ATOMIC_REF(y,x[1])
            PRINT *, y, z

Could print 13, 0 on image 3 and 0, 69 in image 2.  Curiously, Intel and
AMD CPUs do not have this one, but it is common on both hardware and
software message passing, where the locality of the images groups as
(1,3),(2,4).  It is also extremely hard to exclude on large-scale
systems, as it is an inherently global property.


4. The Current Situation
------------------------

4.1 The meaning of 'atomic'
---------------------------

I have seen at least the following meanings, in order of increasing
strictness:

    a) An action that either happens or does not, and may be used to
bypass the synchronisation rules at the cost of losing all consistency
properties

    b) An action that either happens or does not, and where successive
writes on a single object occur in a total order, but with no other
property (not even that reads get the latest copy)

    c) An action that either happens or does not, and where successive
accesses to a single object occur in a total order, but with no
consistency property between variables

    d) An action that either happens or does not, where successive
actions on a single object occur in a single total order, and where all
atomic actions obey a specified consistency property

Fortran 2008 says:

    An atomic subroutine is an intrinsic subroutine that performs an
    action on its ATOM argument atomically. The effect of executing an
    atomic subroutine is as if the subroutine were executed
    instantaneously, thus not overlapping other atomic actions that
    might occur asynchronously. The sequence of atomic actions within
    ordered segments is specified in 2.3.5.  How sequences of atomic
    actions in unordered segments interleave with each other is
    processor dependent.

Unfortunately, the word "instantaneously" is ambiguous in this context.
The reason for this is that, as most of us know, parallel time is very
like time in special relativity: the ordering of events and simultaneity
depend on the observer.  Even if all of the subroutine calls occur
instantaneously, that says nothing about the consistency of their order
across threads.

Fortran's current rules are generally known as relaxed atomics, though
it does not even require what is now called coherence (see 2.1 below).
We deliberately took the decision to punt any decision on their
semantics into the long grass.  However, the proposed TS adds
read-modify-write operations (which OpenMP calls capture), and that
approach will no longer work.


1.2 Memory models
-----------------

Relaxed atomics are known to be diabolically difficult for ordinary
programmers to use correctly, and typically require the extra constraint
that 2.1 below does not happen.  However, don't know of any that can use
completely relaxed read-modify-write atomics because of the problems
described below.  Many of them are for stores and loads, but that sort
of effect makes it very tricky to emulate read-modify-write atomics
efficiently.

The simplest model to adopt would be sequential consistency (Lamport),
both for the programmers and for standardisation.  Unfortunately, it is
extremely hard to implement both efficiently and scalably; half of the
logic and cost of an 128-node SGI Origin was needed for that alone.
There is a bit more on this right at the end.

A common assumption in WG5 is that coarrays will be located in the image
that they are defined in, and that atomics will be handled by the
processor instance for that image.  But the standard does not say so,
and I describe below why that is unimplementable using the current
operating system standard, POSIX.

This is not a theoretical problem, and I have seen almost all of the
above gotchas in actual systems in various parallel environments; in
most cases, they were NOT bugs, but it took ages to explain why to the
users whose programs had failed in the cases when users brought them to
me.


5. Implementations
------------------

5.1 Shared-memory systems
-------------------------

5.1.1 Multi-core CPUs
---------------------

Almost all of these will use the native hardware.  Currently, this is
dominated by 'x86', followed by POWER, but there are a lot of others.
Once one gets beyond single-board systems, the properties of the
interconnect become relevant, and they are not always the same as for
the underlying CPUs.  These properties are likely to become more
relaxed, because of scalability constraints.

In particular, the problems above can be caused by virtually all of the
standard hardware optimisations, including caching, preloading, store
buffering, asynchronous accesses and directory-based systems.  There
is no chance that those are going to become less important.

Where the underlying system uses a more relaxed memory model than the
interface requires, the compiler has to include various forms of
fencing, which can have a serious performance impact.  Where the
interface is inadequately designed, the programmer gets whatever the
hardware provides, and programs using atomics cease to be portable; this
has been observed with OpenMP and with POSIX threads and volatile.

Note that POSIX cannot be used as the implementation basis, whether
images are threads or processes, and whether or not either of the forms
of inter-process shared memory is used, for reasons described below
under message passing.  Any correct implementation is necessarily
dependent on hardware facilities for the synchronisation.

We can assume that systems will continue to be different, and direct use
of the memory facilities will lead to the problems mentioned above.

5.1.2 RDMA and similar
----------------------

I am not an expert on these, and the one thing that I know for sure is
that they are all different.  My guess is most of them will adopt the
approach that coarrays will be located in the image that they are
defined in, and that atomics will be handled by the hardware on which
that image is running.  Bill Long has posted what Cray do, but the
semantics of other systems is unclear.

Strictly, even message passing systems can be directory-based, but I am
currently describing the infrastructure on which Fortran coarrays will
be implemented.  At least some of the larger and future systems will
be directory-based, because that is generally felt to be the best way
to achieve scalable coherence.  But what consistency model will they
deliver?  Looking at file systems (which are comparable in granularity
to coarrays) seems a good idea.

POSIX is very close to sequentially consistent, but no distributed
file systems deliver its semantics.  NFS, AFS, GPFS, DFS and others
are all different and all are less or much less strict.  Lustre can
be configured from nearly POSIX-conforming and direly slow to very
different and much faster.

It is likely that at least some of the 'RDMA' systems will have at least
some of the issues described below for message passing.


5.2 Message passing
-------------------

I shall assume that this is based on TCP/IP or MPI on a POSIX-based (or
possibly Microsoft) system, because that will be far and away the most
common form.  I have failed to find a precise specification of Microsoft
threading, but have reason to believe that it is similar to POSIX.  The
same is likely to be true of other, similar infrastructure.

We took care to ensure that Fortran segments and synchronisation could
be implemented by MPI's progress engine; but, to go beyond that, an
implementation needs a helper thread.  The problem is that atomics, like
MPI passive one-sided communication and UPC, need access to the data of
an image without the cooperation of that image.  Without helper threads,
this leads to deadlock, as I showed for Berkeley UPC.

Let's ignore problems caused by unreliable transports like UDP or IP,
and assume that the implementation is based on a 'reliable' transport
like TCP or MPI.

5.2.1 Helper threads
--------------------

Using a helper thread correctly and efficiently is not easy.  The
problem is with accesses to atomic variables from the images on the
owning node.  If they use the remote protocol (i.e. via TCP/IP or MPI)
for all accesses to atomic variables, there is no problem.  But that
does mean ALL accesses, and not just the ones via the atomic
subroutines, so the performance impact will be considerable.

The reason is that POSIX (like almost all other shared-memory
interfaces) has no one-sided synchronisation mechanisms.  It doesn't
have fences as such, but the same applies to the use of fences, so they
are used here.  In order for thread A to read data written by a remote
thread B to location X, the sequence has to be as follows:

Helper:    Receive update into X
           Fence
           . . .
Thread A:  Fence
           Read data from X

If an implementation does not do that, then there will necessarily be
circumstances under which incorrect results occur.  So we need to
specify as relaxed a model as is consistent with usability, to allow
maximum flexibility.

5.2.2 Multiple paths
--------------------

This includes when multiple network ports and connections are strapped
together to increase bandwidth, but it also includes redundant networks
and many other such configurations.  The point is that one message can
overtake another, even when they are sent from and to the same nodes.
MPI specifically forbids this, but UDP and IP do not, and it will
obviously occur when the Fortran processor has direct access to both
paths.

This is a soluble problem, as MPI demonstrates.  But, if the standard
does not require it to be excluded, we can assume that at least some
implementations will not serialise their messages.  That could be
for simplicity or for extra performance.