J3/13-355r1
To: J3
From: Bill Long, Malcolm Cohen
Subject: Memory model for atomics
Date: 2013 October 17
References: N1983, N1989


Discussion
----------

{Reinhold C}: Request to specify a memory model for atomic functions
(beyond the current model). Possible side effects for EVENT statements.
Response: Defer to {Nick 22}.

{Malcolm Reason 2}: An explicit memory model for atomics [beyond what is
currently in F2008] is needed for both users and vendors.
Response: Defer to {Nick 22}.

{Nick 22}: 7.2 p15:13-15.  This specification will cause massive confusion,
and it was clear from WG5 in Delft that there was no agreement on even the
minimal semantics specified by Fortran.  In particular, several people were
assuming levels of consistency that are not always available in existing
hardware, and would need extra work in the compiler to provide.  At the
very least, there needs to be a Note saying clearly and explicitly that
currently their behavior is deliberately left entirely processor-dependent,
and WG5 intends to provide a proper semantic specification in due course.

Response. The existing normative text making the results of atomic reading
processor-dependent when in unordered segments is nearly sufficient.  Some
additional normative text is added, plus an explanatory note.


Edits to N1983
--------------

[15:17+] Insert new paragraph at the end of 7.2 Atomic subroutines.

  "This Technical Specification does not specify a formal data consistency
   model for atomic references.  An inconsistent model would be worse than
   leaving the behaviour processor-dependent, and developing a proper
   consistency model is best left until after the facilities are integrated
   into the Fortran base language standard."

[29:13] Append sentence to the text to be appended to 13.1p3:
  "If two or more variables are updated by a sequence or atomic memory
   operations on an image, and these changes are observed by atomic
   accesses from an unordered segment on another images, the changes need
   not be observed on the remote image in the same order as they are made
   on the local image, even if the updates in the local images are made in
   ordered segments."

[31:30+] In the new section "8.10 Edits to annex C",
          add a new edit
  "Insert subcause A.3.2 as subclause C.10.3."

[35:26+] Append new subclause A.3.2

"Annex A.3.2 Atomic memory consistency

A.3.2.1 Relaxed memory model

Parallel programs sometimes have apparently impossible behaviour because
data transfers and other messages can be delayed, reordered and even
repeated, by all forms of hardware, communication software, as well as
caching and other forms of optimisation.  Requiring processors to deliver
globally consistent behaviour is incompatible with performance on many
systems.  Fortran specifies that all ordered actions will be consistent
([2.3.5] and [8.5]), but all consistency between unordered segments is
deliberately left processor dependent or undefined.  Depending on the
hardware, this can be observed even when only two images and one mechanism
are involved.

A.3.2.2 Examples with atomic operations

When variables are being referenced (atomically) from segments that are
unordered with respect to the segment that is is atomically defining or
redefining the variables, the results are processor dependent.  This
supports use of so-called "relaxed memory model" architectures, which can
enable more efficient execution on some hardware implementations.

The following examples assume the following declarations:

   MODULE example
     USE iso_fortran_env
     INTEGER(atomic_int_kind) :: x[*] = 0, y[*] = 0

Example 1:

With x[j] and y[j] still in their initial state (both zero), image j
executes the following sequence of statements:

  CALL atomic_define(x,1)
  CALL atomic_define(y,1)

and image k executes the following sequence of statements:

  DO
    CALL atomic_ref(tmp,y[j])
    IF (tmp==1) EXIT
  END DO
  CALL atomic_ref(tmp,x[j])
  PRINT *,tmp

The final value of tmp on image k can be either 0 or 1.  That is, even
though image j thinks it wrote x[j] before writing y[j], this ordering is
not guaranteed on image k.

There are many aspects of hardware and software implementation that can
cause this effect, but conceptually this example can be thought of as the
change in the value of y propagating faster across the inter-image
connections than the change in the value of x.

Changing the execution on image j by inserting
  SYNC MEMORY
in between the definitions of x and y is not sufficient to prevent
unexpected results; even though x and y are being updated in ordered
segments, the references from image k are both from a segment that is
unordered with respect to image j.

To guarantee the expected value for tmp of 1 at the end of the code
sequence on image k, it is necessary to ensure that the atomic reference on
image k is in a segment that is ordered relative to the segment on image j
that defined x[j]; SYNC MEMORY is certainly necessary, but not sufficient
unless it is somehow synchronised.

Example 2:

  With the initial state of x and y on image j (i.e. x[j] and y[j]) still
  being zero, execution of

    CALL atomic_ref(tmp,x[j])
    CALL atomic_define(y[j],1)
    PRINT *,tmp

  on image k1, and execution of

    CALL atomic_ref(tmp,y[j])
    CALL atomic_define(x[j],1)
    PRINT *,tmp

  on image k2, in unordered segments, might print the value 1 both times.

  This can happen by such mechanisms as "load buffering"; one might imagine
  that what is happening is that the writes (atomic_define) are overtaking
  the reads (atomic_ref).

  It is likely that insertion of SYNC MEMORY in between the calls to
  atomic_ref and atomic_define will be sufficient to prevent this anomalous
  behaviour, but that is only guaranteed by the standard if the SYNC MEMORY
  executions cause an ordering between the relevant segments on images k1
  and k2.

Example 3:

  Because there are no segment boundaries implied by collective
  subroutines, with the initial state as before, execution of

    IF (this_image()==1) THEN
      CALL atomic_define(x,42)
      y = x
    ENDIF
    CALL co_broadcast(y,1)
    IF (this_image()==2) THEN
      CALL atomic_ref(tmp,x[1])
      PRINT *,y,z
    END IF

  could print the values 42 and 0."

===END===