J3/14-199 To: J3 From: Nick Maclaren Subject: The Progress Issue (Coarrays) Date: 2014 June 21 References: N1754, N2007, 14-158 Several proposals have assumed or are assuming a particular progress model without requiring it in normative code, and without its implications being properly analysed and discussed in J3, let alone WG5. I want to ensure that we do not sleepwalk into accepting a specification that is less generally or reliably implementable than we realise. This paper attempts to summarise the issue, but N1754 describes the technical details (which have not changed since 2008). Introduction ------------ The question of progress (in the MPI sense) is very similar to that of data consistency, and equally nasty. This could be regarded as arising in the half-century-old debate over whether an algorithm need terminate in bounded time, in finite time, with probability one, or only often enough to be useful, to name just four common assertions. But this is not just a theoretical problem. All HPC experts are familiar with the phenomena of spurious deadlock, livelock and the closely-related one of failure to make progress. The questions are (a) what the Fortran standard should require and (b) what constraints that places on implementations. I am not going to repeat all of the evidence here - see N1754 for that, and note that the ONLY change since then is that (apparently) Cray added some atomic operations to their SHMEM man page later that year (2008). My point is that ANY assumption of progress needs a proper proposal, with an analysis of the implications. In particular, we should NOT add a requirement that Fortran coarrays can be implemented only on systems with shared-memory threading without (a) taking the decision explicitly and (b) being aware of what that implies. Current Situation (Fortran 2008) -------------------------------- While no progress model has been proposed, some of us have tried to ensure that the facilities are implementable within the MPI progress model (which has two decades of experience showing that it is both implementable and usable). In particular, we have held to the line that no progress is required except at image control statements and the only ordering is that imposed by segment ordering. This has consequences like an image control statement on image A may block until all the images to which it has previously written data have reached a subsequent image control statement. We specified explicitly that progress between image control statements (i.e. for atomic variables) is processor-dependent, which makes them useless for synchronisation in portable programs. This was a holding position. With this model, a daemon thread is NOT needed for correct execution and eventual termination, though the efficiency might be dire. However, if one permits C code called from one image to wait until an action is taken by C code called from another, then it is trivial to produce examples where deadlock is inevitable without the use of a daemon thread. Such code is currently allowed in Fortran 2008, by being legal in C99 and not forbidden by Fortran. New Assumption -------------- It is not clear whether TS18508 atomics and events are implementable using the MPI progress model (with only image control statements involved), even excluding the use of C. By that, I mean that they can be implemented in such a way that no conforming program will deadlock or livelock. I have spent quite a few hours both trying to convince myself that they are implementable and that I could produce a counterexample, and have failed in both. The example in 14-158 states that it will eventually complete. At the very least, that would require a change to the implicit implementation requirements of Fortran 2008, in that atomic subroutines would now also be required to participate in making progress. In turn, that would mean that they would need to include an event loop, and not merely send a message or simple request. I do not think that we should allow that level of mission creep without a proper WG5 discussion. Implementation Options ---------------------- This is obviously not a complete list, but is the ones that I think are most plausible today (i.e. in 2014). A) Use the MPI progress engine in all participating actions (at least image control statements and perhaps atomics). This has the following disadvantages, as I say above: 1) I am not sure that this will work for all conforming programs. 2) It will NOT work if we allow companion processor code on different images to have order dependencies. 3) Its efficiency is likely to be dire for a great many apparently reasonable (and, theoretically, entirely parallelisable) codes. B) Use a daemon thread for all remote coarray accesses, but not local ones. This is not going to work, in general, because accessing the same object in unsynchronised threads is undefined behaviour, and there are no one-sided fences in at least POSIX and Intel architecture (or, I believe, POWER, SPARC and ARM). C) Rely on special hardware or operating system support. This might be acceptable for the specialist HPC vendors, but is certainly not for the others. I believe that the only existing implementations of the current atomic facilities do this. D) Use a daemon thread (or process) for ALL coarray accesses (even local ones). This has none of the above disadvantages, but does have the following ones: 1) It requires multiple threads (or processes) per 'node' and, at least until recently, that was not supported on all HPC systems. 2) EVERY coarray access involves either an inter-thread handshake or a message to the daemon process, and thus it is slow. 3) It really DOES mean all accesses (including initialisation, termination and access via an associated non-coarray variable), because of the problem mentioned under (B). 4) Separating the two forms of access could be a significant burden on compiler vendors; that might be particularly problematic if a separate process were needed. E) Use an interrupt for all remote accesses. This is mentioned for completeness, because it was a traditional method. However, it is very much a mainframe-era approach (though I believe it is still used on some embedded systems) and is not supported for user code by POSIX or any of the other modern operating systems I have looked at, though it is supported for kernel code at least under Linux. F) Use a daemon thread for just atomic accesses (even local ones), or possibly those plus events and locks. This (in general) has the disadvantages of both (A) and (D), though it would enable the example in 14-158 (and similar code using EVENT_QUERY) to make progress. Summary ------- This paper is proposing nothing, but it attempting to explain why we should not allow ANY assumptions of progress to creep into the specification without proper consideration of the consequences.