13-276r1 To: J3 From: Nick Maclaren/Steve Lionel Subject: Changes for failed image wording Date: 2013 June 26 The following changes are intended to make it clear that the failed image facility provides the processor with a way to indicate failure that can be tested in a portable program, but does not require a processor to support failure recovery or specify exactly what the consequences are following failure. This paper also reflects a change to make STAT_FAILED_IMAGE the lowest priority failure status rather than the highest priority, and an editorial change from Malcolm. Edits to N1967 -------------- [Introduction:iv:p2] After sentence 2 add "The existing system does not provide a mechanism for a processor to identify what images have failed during execution of a program. This adversely affects the resilience of programs executing on large systems." [Introduction:iv:p3] After the first comma in sentence 1, add "for the processor to indicate which images have failed during execution and allow continued execution of the program on the remaining images, " [5.6 STAT FAILED IMAGE:p11] This should be replaced by: "The value of the default integer scalar constant STAT_FAILED_IMAGE is different from the value of STAT_STOPPED_IMAGE, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, or STAT_UNLOCKED. If the processor has the ability to detect that an image of the current team has failed and does so, the value of STAT_FAILED_IMAGE is assigned to the variable specified in a STAT=specifier in an execution of an image control statement, or the STAT argument in an invocation of a collective procedure. A failed image is one for which references or definitions of variables fail when that variable should be accessible, or the image fails to respond as part of a collective activity. A failed image remains failed for the remainder of the program execution. If more than one nonzero status value is valid for the execution of a statement, the status variable is defined with a value other than STAT_FAILED_IMAGE. The conditions that cause an image to fail are processor dependent. NOTE 5.4 A failed image is usually associated with a hardware failure of the processor, memory system, or interconnection network. A failure that occurs while a coindexed reference or definition, or collective action, is in progress may leave variables on other images that would be defined by that action in an undefined state. Similarly, failure while using a file may leave that file in an undefined state. A failure on one image may cause other images to fail for that reason." [7.2;p15:22-29] Replace paragraph with: If the STAT argument is present in an invocation of a collective subroutine and an error condition occurs, the argument is assigned a nonzero value and the effect is otherwise the same as that of executing the SYNC MEMORY statement. If execution involves synchronization with an image that has stopped, the argument becomes defined with the value of STAT_STOPPED_IMAGE in the intrinsic module ISO_FORTRAN_ENV; otherwise, if no image of the current team has stopped or failed, the argument is assigned a processor-dependent positive value that is different from the value of STAT_STOPPED_IMAGE or STAT_FAILED_IMAGE in the intrinsic module ISO_FORTRAN_ENV. If an image had failed, but no other error condition occurred, the argument is assigned the value of the constant STAT_FAILED_IMAGE. [8.5 Edits to clause 8;p27 27-35] replace paragraph wih the following: If the STAT= specifier appears in a CHANGE TEAM, END TEAM, FORM SUBTEAM, LOCK, SYNC ALL, SYNC IMAGES, SYNC MEMORY, or UNLOCK statement and its execution is not successful, the specified variable becomes defined with a nonzero value and the effect is otherwise the same as that of executing the SYNC MEMORY statement. If there is a stopped image in the current team, the variable becomes defined with the constant STAT_STOPPED_IMAGE in the intrinsic module ISO_FORTRAN_ENV (13.8.2); otherwise, if no image of the current team has been detected as stopped or failed, the variable becomes defined with a processor-dependent positive value that is different from the value of STAT_STOPPED_IMAGE or STAT_FAILED_IMAGE in the intrinsic module ISO_FORTRAN_ENV (13.8.2). If an image had been detected as failed, the variable becomes defined with the the constant STAT_FAILED_IMAGE. [8.6 Edits to clause 13;p28:22-29] replace paragraph with the following: If the STAT argument is present in an invocation of a collective subroutine and an error condition occurs, the argument is assigned a nonzero value and the effect is otherwise the same as that of executing the SYNC MEMORY statement. If execution involves synchronization with an image that has stopped, the argument becomes defined with the value of STAT_STOPPED_IMAGE in the intrinsic module ISO_FORTRAN_ENV (13.8.2); otherwise, if no image of the current team has stopped or failed, the argument is assigned a processor-dependent positive value that is different from the value of STAT_STOPPED_IMAGE or STAT_FAILED_IMAGE in the intrinsic module ISO_FORTRAN_ENV (13.8.2). If an image had been detected as failed, but no other error condition occurred, the argument is assigned the value of the constant STAT_FAILED_IMAGE.