Parsing Fortran 77 I/O format specifiers in Harwell-Boeing sparse matrix files

Fortran 77 I/O format specifiers

The Harwell-Boeing (a.k.a. Rutherford-Boeing) sparse matrix storage format makes direct use of Fortran 77 I/O format specifiers. This sounds scary for the modern generation of programmers, who are indoctrinated to think of Fortran as some monstrosity from the dark ages of coding, but those who are familiar with the specifiers can testify that they are generally more flexible and easier to use than the corresponding C printf/scanf format specifiers. They inspired the Common Lisp FORMAT output facility, which is still in active use. (This is particularly ironic, as Lisp and Fortran are typically cited as two languages with completely opposite objectives!) Combined with the internal file facility (as described by Ellis, pp. 430-1 [1]), they allow quite general manipulation of numbers and strings, similar to that provided by C++ stringstreams or the equivalent Common Lisp constructs. Fortran format specifiers are described in [1] and [2], among other sources.

Fortran format specifiers make parsing Harwell-Boeing files quite easy in Fortran 77, as can be seen by an example code. The data file actually decodes itself -- it contains the format specifiers needed to parse itself. The technique recalls the much later XML DTD, with its goal that documents be "self-parsing."

The specific case of storing Fortran format specifiers in the data file to be parsed is undesirable, however, for the following reasons:

Fortran 77 is considered a "legacy language." Fewer and fewer coders learn Fortran 77, and even fewer of these learn the format specifiers. It's much more desirable to use an "obvious" encoding (such as a scientific notation format) which doesn't need to be specified explicitly in the file. The Matrix Market data file format takes this approach, for example.
Fortran formatted I/O offers a quite complex array of formatting options. We don't want to have to support them all.
Encoding the format specifier (for any language's I/O system) directly in the input file is a potential security hole. C programmers are well aware of this; some versions of standard tools such as lint warn about "variable format strings" (as inputs to printf or scanf). One could imagine crafting an input file so as to cause buffer overruns in the I/O system. Furthermore, some languages or language extensions support format specifiers that allow nearly arbitrary code execution.
One unfortunate consequence of Fortran I/O in its full generality is that "what you see may not be what you get." For example, given the data
```
123456789
```
the following Fortran code
```
    READ 122,A,B,C,D
122 FORMAT (F3.1,F2.2,F3.0,TL6,F4.2)
```
breaks up the data into four different numbers as follows:
- A = 12.3 (it read 3 characters, "123", and put 1 character after the decimal point)
- B = 0.45 (read 2 characters, put 2 after the decimal point)
- C = 678.0 (read three characters, put 0 after the decimal point)
- D = 34.56 (count back 6 characters from the "9" to the "3"; read 4, put 2 after the decimal point)
It's unlikely that typical consumers of Harwell-Boeing files would have resorted to the full generality allowed by Fortran format specifiers. The whole point of the Harwell-Boeing format is to provide a standard storage scheme to allow communication of test problems. Harwell-Boeing matrix files are not typically distributed with input codes, so they should be considered pretty much human-readable. There are many cases in which integer indices are stored without separation by spaces or any other delimiter, but this is common enough and the format specifier's field width parameter makes decoding this easy.

Restricted subset of Fortran 77 I/O format specifiers

Due to the above reasons, I've implemented only a restricted subset of the allowed Fortran format specifiers. The only necessary specifiers refer to numeric data types: integers, and floating-point numbers (real or complex) in either scientific notation or fixed-point format. Format specifiers not in this subset are rejected. The subset has a slightly relaxed syntax: I ignore case, and supply defaults for certain numeric parameters.

Here is the BNF grammar for the allowed set of format specifiers. (Note that Fortran format specifiers have to be enclosed in parenthesis; we relax this constraint.) Parentheses, the period, *, +, and ? have their usual metalinguistic meanings; their explicit use is always escaped by a preceding backslash. "\d" refers to a single digit character.
```
<format> --> \(<format-string>\)
<format-string> --> <count>?<rest>
<count> --> \d+
<rest> --> <fixedid>|<intid>|<fltid>|<doubleid>|<generalid>
<fixedid> --> F<field-width>\.<digits-after-decimal-point>
<intid> --> I<field-width>(\.<min-num-digits>)?
<fltid> --> E<field-width>\.<decimal-significand-length>(E<num-digits-in-exponent>)?
<doubleid> --> D<field-width>\.<decimal-significand-length>(E<num-digits-in-exponent>)?
<generalid> --> G<field-width>\.<decimal-significand-length>(E<num-digits-in-exponent>)?
<field-width> --> \d+
<min-num-digits> --> \d+  (defaults to zero)
<num-digits-in-exponent> --> \d+
<decimal-significand-length> --> \d+
<digits-after-decimal-point> --> \d+
```
Note that I've extended the format specifiers for general floating-point numbers (G) and double-precision floating-point numbers (D) to allow specifying the exponent length.

Complex numbers are typically encoded in the Harwell-Boeing collection as pairs of real numbers. An example format for nonzero values might be (6E13.5) -- consecutive pairs of real numbers in the file form a single complex number. More complex formats are possible, but we do not support these.

How do we parse them?

What are some options for parsing and working with these Fortran format specifiers? There is a Fortran format specifier parser in the f2cl package (which converts a large subset of Fortran 77 code to Common Lisp), but it's hard to extract the relevant code. Besides, we don't want a fully general parser, as it's a security risk.

Second, we can do the parsing ourselves. Note that once we've parsed the strings, using them in Common Lisp should be fairly easy, as the format specifiers used by Common Lisp's FORMAT function were inspired in part by Fortran I/O format specifiers. It should be feasible to translate the Fortran format specifiers into CL FORMAT format specifiers. For example:
- "~5D" is a base-10 integer with a field of five characters; spaces are used as padding if the number (including sign and digits) takes up less than five characters. This is pretty much equivalent to Fortran's "(I5)".
- ~F is a fixed-format floating-point number. More specifically, one can write
```
~<field-width>,<digits-after-decimal-point>,<scaling-factor-defaulting-to-zero>,<overflowchar>,<padchar>F
```
  so that the number uses exactly <field-width> characters to print out. This is quite similar to Fortran's "Fw.d" descriptor. In Fortran 77, the w.d (<field-width> and <digits-after-decimal-point>) specifiers are required. In Fortran, the pad character is always space and the overflow character is always asterisk.
Still, we need to parse the Fortran format specifiers. We could either use the CL-YACC package (basically, YACC in Common Lisp), or write a simple parser ourselves. I've chosen the latter route, as it results in fewer dependencies. Furthermore, the restricted grammar can be parsed with a simple recursive-descent parser that uses one token lookahead. The token stream is short enough that we simply lex it all at once and store it explicitly as a list. Lexing is fairly straightforward (one character lookahead, which is trivial if we have the string, and legal with both Common Lisp and C stream-based I/O).

Existing routines for parsing Harwell-Boeing sparse matrix files

K. A. Remington of the National Institute of Standards and Technology wrote a C code for reading and writing Harwell-Boeing format sparse matrix files, sometime in the mid-1990's (perhaps 1997?). I took this code and adapted it as the Harwell-Boeing I/O module for the BeBOP Sparse Matrix Converter. It uses the following defaults:
```
PTRFMT = INDFMT = "(8I10)"
VALFMT = RHSFMT = "(4E20.13)"
```
The comments in the C code say that it does not support the D and P edit descriptors. It expects an integer format like the following:
```
\(<num-per-line>I<field-width>\)
```
The real format that it expects (the code only supports real matrices) is pretty hard to figure out directly from the C code, as it's full of interesting pointer arithmetic. I'm guessing that it's like this:
```
\((<scaling-factor>P)?<count>[DEF]<field-width>.<decimal-significand-length>(E<num-digits-in-exponent>)?\)
```
The scaling factor specified by the optional P parameter was considered an obsolete language feature by the Fortran 77 Committee, according to Ellis. If P is in the format string, the scaling factor is removed, as the scaling factor only affects output, not input. Otherwise, only formats D, E, and F are supported. Note that <num-digits-in-exponent> is ignored on input. We've chosen not to allow the P scaling factor because it makes the value of the output dependent on the format specifier.

New parsing routines

I've been finished for a while with a Common Lisp prototype parser for Harwell-Boeing files. It can be called directly from a C program (and used to fill in C arrays) by using the ECL Common Lisp system. ECL isn't too hard to port to any platform with a C compiler and linker that supports shared libraries. I don't have the time to port the Common Lisp parser code to C (I'm not going to say "unfortunately," because the CL code is much easier to debug than the C code!).

References
1. Ellis, T. M. R. Fortran 77 Programming, second edition. Addison-Wesley, 1990.
2. Oxford tutorial on Fortran I/O
This research was supported in part by the National Science Foundation. The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.

Parsing Fortran 77 I/O format specifiers in Harwell-Boeing sparse matrix files

Fortran 77 I/O format specifiers

Restricted subset of Fortran 77 I/O format specifiers

How do we parse them?

Existing routines for parsing Harwell-Boeing sparse matrix files

New parsing routines

References