To: J3 J3/13-350 From: Reinhold Bader Subject: Spare images example Date: 2013 October 09 References: N1983 Discussion: ~~~~~~~~~~~ This late paper (sorry) suggests a further example for the teams section of the TS' Annex. EDITs to N1983: ~~~~~~~~~~~~~~~ Add the following text to Annex A.1 after the first team example: "Parallel algorithms often use work sharing schemes based on a specific mapping between image index and global data addressing. In order to make such programs continue in the case that an image fails, the combination of using a team of fixed size and provisioning of spare images can be used in order to re-establish execution of the algorithm with the failed image(s) replaced by spare ones, while retaining image indexing. The following example program illustrates how this might be done. Images for which failure cannot be tolerated in this setup are image 1 and the initial set of spare images, whose number is assumed to be small in comparison to the set of active images. PROGRAM possibly_recoverable_simulation USE, INTRINSIC :: iso_fortran_env IMPLICIT NONE INTEGER, ALLOCATABLE :: failed_img(:) INTEGER :: images_used, i, images_spare, id, me, status TYPE(team_type) :: simulation_team LOGICAL :: read_checkpoint, done[*] images_used = ... ! A value slightly below num_images() images_spare = num_images() - images_used read_checkpoint = .FALSE. setup : DO me = this_image() id = 1 IF (me > images_used) id = 0 ! ! set up spare images as replacement for failed ones failed_img = failed_images() if (size(failed_img,1) > images_spare) ERROR STOP 'cannot recover' ifail_used = images_used DO i=1, size(failed_img,1) IF (failed_img(i) > images_used .or. & failed_img(i) == 1) ERROR STOP 'cannot recover' IF (me == images_used + i) THEN me = failed_img(i) id = 1 END IF END DO ! ! set up simulation team of constant size. ! id == 0 does not participate in team execution FORM SUBTEAM (id, simulation_team, NEW_INDEX=me, STAT=status) simulation : CHANGE TEAM (simulation_team, STAT=status) iter : DO CALL simulation_procedure(read_checkpoint, status, done) ! simulation_procedure: ! sets up required objects (maybe coarrays) ! reads checkpoint if requested ! returns status on its internal synchronizations ! returns .TRUE. in done once complete read_checkpoint = .FALSE. IF (status == STAT_FAILED_IMAGE) THEN read_checkpoint = .TRUE. EXIT simulation ELSE IF (done) EXIT iter END IF END DO iter END TEAM simulation (STAT=status) SYNC ALL (STAT=status) IF (this_image() > images_used) done = done[1] IF (done) EXIT setup END DO setup END PROGRAM Supporting the concept of fail-safe execution imposes additional obligations on library writers that use the parallel language facilities: Every synchronization statement, allocation or deallocation of coarrays, or invocation of a collective procedure must specify a synchronization status variable, and implicit deallocation of coarrays must be avoided. Especially, coarray module variables that are allocated inside the team execution context are not persistent."