J3/98-209 Date: 23 Oct 1998 To: J3 From: R. Maine Subject: Specs and Syntax for B.4, Stream I/O I. BACKGROUND Stream I/O is item B.4 on the f2k work plan. It is the only item on the work plan that has not been addressed in some manner. The rationale for this item is given in items 63 and 63a of the wg5 repository, N1189. I consider this item to be of importance both in itself and also as a component of C interopability. Other work on C interopability has focussed on interoperating with C code, but interoperating with C data files is also an important item. It would not be particularly convenient to tell the users that in order to work with C files, they need to write C code to do so and then call that C code using the other C interopability features. Furthermore, as mentioned in the quoted rationales below, byte-stream files have become a de-facto standard far beyond the direct scope of the C environment. >From item 63: | Rationale: C-style "byte stream" has become a de facto standard far beyond the | direct scope of the C environment. In a scientific application it is not | surprising to have a sensor feeding a stream of data to a processor which in | turn feeds the results over a heterogeneous network for additional processing. | Fortran record structure provides the user with an obstacle to overcome in this | scenario (the processors may not have the same record conventions (even when | the CPU architecture is the same), etc.) And from item 63a: | there is a category of files that are definitely not record oriented. | This category is called "binary stream files". These files are merely | constituted of a continuous sequence of storage units, without any internal | structure. Stream files are prevalent in many operating systems such as Unix, | DOS, Windows and OS/2. Also, there are "industry-standard" file formats that | are not record oriented, such as GIF and TIFF formats for digital images. | | Accessing stream files with standard Fortran I/O facilities is often difficult: | unformatted sequential access may fail because the file contains no record | delimiters. Using unformatted direct access is also awkward since the data | cannot be accessed easily with fixed record lengths. In short, a new file | access is needed. I consider this work item to be of far higher importance than might be inferred by just looking at its current position in the work plan. It is also not a particularly difficult item to do, either in terms of standards work or implementation. The impact on the standard is fairly localized, and many implementations already do something simillar as an extension. II. SPECIFICATIONS Considering the late date, I believe that a fairly minimalist approach to this work item is appropriate. WG5 item 63 mentions stream versions of all combinations of formatted/unformatted and direct/sequential, with a new specifier on the OPEN statement. WG5 item 63a restricts itself to unformatted i/o and adds stream as a new kind of access instead of as a new specifier. I think it works out far more simply and integrates better to add one new access model along the lines of item 63a, rather than to add 4 new ones, which is essentially what item 63 proposes. I earlier considered specifying the new model with the form keyword, which has some precedent. But this did not seem to integrate as well as one would like. It raises questions about how to interpret the access keyword when the form is specified to be stream. On reconsideration, I am proposing that the new model be specified with the access keyword as an alternative to sequential or direct. This appears to integrate far better. Indeed, much of what the standard already says about the data in unformatted files is the same stuff that needs to be said about these new files. It is only in matters of record structure that they should much differ. Thus, I propose one new file structure, which would be a stream access, unformatted form file. After deciding that this seemed like the best approach, I reviewed wg5 document N1189 and found this to be exactly what item 63a specifies. In principle, one could also define stream access for formatted files, but I do not propose to do that now. The proposal does leave that open as a possible extension, but it seems like too much work for too little benefit for now. Most requests relating to formatted stream I/O appear to have more to do with enhancements of non-advancing I/O in that there still is an underlying record structure (the lines of the file). Formatted files that truly have no record structure are not particularly common. For those rare cases where such a file is needed, it could be handled as character data on a unformatted stream file in combination with internal i/o. This would be a little awkward, but I don't see it as common enough to be a significant issue. If we did add a capability for stream formatted I/O, we would have to do something to define the interactions with formatting issues that refer to record boundaries, notably the "/" and "T" format descriptors, format reversion, and advance="yes". I don't see any reason why we couldn't define these interactions, but it would be extra work for what I consider minimal return. With these thoughts in mind, I think that most of the specifications are laid out adequately in item 63a, which I therefore quote verbatim: | Detailed Specification: only new keywords and options in OPEN, READ, WRITE, | and INQUIRE are needed. | | A binary stream file consists of a sequence of processor-dependant storage | units. These units must be the same to those used to define the record length | of an unformatted direct file. The storage units are numbered from 1 to n, n | being the last unit written. Two concepts are present in a stream file. A file | position pointer is used to locate the next storage unit to be read or | written. Following the last storage unit of the file, there is an end-of-file | marker that may physically exist or not. This marker can be checked in a READ | statement by END or IOSTAT specifiers. The file position pointer may point to | the end-of-file marker. | | Opening a binary stream file could be done by simply adding ACCESS='STREAM' in | the OPEN statement. The POSITION specifier is valid for these files. | | Binary stream I/O should work following in an hybrid fashion between | sequential and direct access. Sequential access should be done by using the | syntax of unformatted sequential READ and WRITE, except that the unit is | connected to a binary stream file. The file position pointer is moved by the | amount of data storage units transferred by each READ or WRITE statements | executed. Random access should be provided by adding a POS=location specifier | to the READ or WRITE statements. Mixed access should be allowed for the same | unit. When a WRITE statement overwrites a portion of a stream file, only the | amount of storage tranferred should replace the existing locations; the | remaining storage units should remain intact (in the contrary of conventional | unformatted sequential WRITEs). | | READ and WRITE statement with POS specifiers but with an empty I/O list merely | move the pointer inside the file. In the case of a WRITE statement, if the | position pointer is moved with the POS specifier beyond the end-of-file | marker, the gap is filled with unitialized data. | | The BACKSPACE statement should be disallowed for such files, since there is no | record delimiters. | | The ENDFILE statement is used to truncate the binary stream file | at the current file pointer position. | | New specifiers should be added to the INQUIRE statements. | | In particular, a CURRPOS specifier to obtain the current position | of the pointer, and a FILESIZE specifier that returns the amount | of units written to the file. The ACCESS specifer should also be | extended to allow ACCESS='STREAM' to be returned. The most complicated part of this spec relates to the random positioning capability. This is a desirable feature and it appears to me that this spec lays out an approach that will work. Therefore, I'll propose to keep it in. However, if it proves controversial or difficult, I'm prepared to accept deletion of the random positioning feature. One deletion from the above specs: Consistent with the philosophy of giving the user maximum control over the contents of a stream file, and in order to improve file portability, delete the concept of a system-dependent end-of-file mark. There is a terminal position of the file, and this terminal position may be set to the current position by using ENDFILE, but there is no mark associated with this. (If the system file structure inherently includes an end-of-file mark, that is outside of the scope of the Fortran standard; it would not be considered to be a part of the file as viewed by Fortran). And one additional spec: Delete the prohibition against namelist with internal I/O. Allowing internal namelist I/O will help facilitate getting the effect of formatted stream I/o by using unformatted stream I/O in conjunction with internal I/O. But this is not critical and can be dropped if there is objection. III. SYNTAX Much of the syntax follows fairly obviously from the specifications, with possible minor quibbles about spelling. A. The OPEN statement. ACCESS='STREAM' is allowed in the OPEN Statement. In such cases, FORM defaults to 'unformatted', which may be explicitly specified. FORM='formatted' is not allowed. BLANK, DELIM, and PAD, are already prohibitted for unformatted; this prohibition applies. RECL is not allowed. ASYNC is allowed. B. The CLOSE, WAIT, REWIND, and ENDFILE statements No syntax changes. EOR is allowed in WAIT, but will never happen. (We already allow it for other cases where it can't happen). The specs describe the different interpretation of ENDFILE. Take out the prohibiton against namelist on internal files. C. The BACKSPACE and PRINT statements are disallowed, as is the form of READ without an io-control-spec-list. (The PRINT and that form of READ are only for formatted files). D. The READ and WRITE statements. Identical in syntax to other unformatted READ/WRITE statements. Can not have FMT=, NML=, ADVANCE=, SIZE=, EOR=, REC=. (All but the REC= apply only to formatted files). May have ASYNC= and ID=. Add a new POS=scalar-int-expr specifier allowed only for stream files. If this is specified, the file is positioned to the specified position prior to the data transfer. E. The INQUIRE statement. May return 'STREAM' as a value for access. Add a STREAM= specifier just like SEQUENTIAL= and DIRECT=. (It is possible that a processor might not allow stream access to all files). Add a SIZE=scalar-default-int-variable specifier that returns the file size in the same units as used for REC=. (The wg5 item suggested an example spelling FILESIZE, but I don't think that necessary, though I'd accept it if the majority prefers; file_size would be another obvious alternative along that line). This returns a value of -1 if the file size cannot be determined (for example, if the file is a device instead of a disk file). SIZE= may also be used for sequential and direct access files; in those cases, the file size might not be the same as the amount of data written to the file (i.e. the processor can return the actual file size; it doesn't have to do anything like keep track of how much of the size is user data versus how much is record headers). Add a POS=scalar-default-int-variable specifier that returns the current file position. This returns -1 if the position cannot be determined. The result of POS= is undefined for sequential and direct files. The wg5 item used the example spelling CURRPOS, but I find that a bit awkward. I think it better to use the same spelling for the specifier in the INQUIRE statement as the one in the READ/WRITE statements. F. Derived type I/O. Uses the unformatted derived type I/O routines with no changes. (Those routines already look almost more like stream I/O than record-oriented anyway - within the DTIO routine you don't get any file positioning before or after a read or write).