The
Harwell-Boeing (a.k.a. Rutherford-Boeing) sparse matrix storage
format makes direct use of Fortran 77 I/O format specifiers. This
sounds scary for the modern generation of programmers, who are
indoctrinated to think of Fortran as some monstrosity from the dark
ages of coding, but those who are familiar with the specifiers can
testify that they are generally more flexible and easier to use than
the corresponding C printf/scanf format specifiers. They inspired the
Common Lisp
FORMAT output facility, which is still in active use. (This is
particularly ironic, as Lisp and Fortran are typically cited as two
languages with completely opposite objectives!) Combined with the
Fortran format specifiers make parsing Harwell-Boeing files quite easy in Fortran 77, as can be seen by an example code. The data file actually decodes itself -- it contains the format specifiers needed to parse itself. The technique recalls the much later XML DTD, with its goal that documents be "self-parsing."
The specific case of storing Fortran format specifiers in the data file to be parsed is undesirable, however, for the following reasons:
123456789the following Fortran code
READ 122,A,B,C,D 122 FORMAT (F3.1,F2.2,F3.0,TL6,F4.2)breaks up the data into four different numbers as follows:
It's unlikely that typical consumers of Harwell-Boeing files would have resorted to the full generality allowed by Fortran format specifiers. The whole point of the Harwell-Boeing format is to provide a standard storage scheme to allow communication of test problems. Harwell-Boeing matrix files are not typically distributed with input codes, so they should be considered pretty much human-readable. There are many cases in which integer indices are stored without separation by spaces or any other delimiter, but this is common enough and the format specifier's field width parameter makes decoding this easy.
Due to the above reasons, I've implemented only a restricted subset of the allowed Fortran format specifiers. The only necessary specifiers refer to numeric data types: integers, and floating-point numbers (real or complex) in either scientific notation or fixed-point format. Format specifiers not in this subset are rejected. The subset has a slightly relaxed syntax: I ignore case, and supply defaults for certain numeric parameters.
Here is the BNF grammar for the allowed set of format specifiers. (Note that Fortran format specifiers have to be enclosed in parenthesis; we relax this constraint.) Parentheses, the period, *, +, and ? have their usual metalinguistic meanings; their explicit use is always escaped by a preceding backslash. "\d" refers to a single digit character.
<format> --> \(<format-string>\) <format-string> --> <count>?<rest> <count> --> \d+ <rest> --> <fixedid>|<intid>|<fltid>|<doubleid>|<generalid> <fixedid> --> F<field-width>\.<digits-after-decimal-point> <intid> --> I<field-width>(\.<min-num-digits>)? <fltid> --> E<field-width>\.<decimal-significand-length>(E<num-digits-in-exponent>)? <doubleid> --> D<field-width>\.<decimal-significand-length>(E<num-digits-in-exponent>)? <generalid> --> G<field-width>\.<decimal-significand-length>(E<num-digits-in-exponent>)? <field-width> --> \d+ <min-num-digits> --> \d+ (defaults to zero) <num-digits-in-exponent> --> \d+ <decimal-significand-length> --> \d+ <digits-after-decimal-point> --> \d+Note that I've extended the format specifiers for general floating-point numbers (G) and double-precision floating-point numbers (D) to allow specifying the exponent length.
Complex numbers are typically encoded in the Harwell-Boeing collection as pairs of real numbers. An example format for nonzero values might be (6E13.5) -- consecutive pairs of real numbers in the file form a single complex number. More complex formats are possible, but we do not support these.
What are some options for parsing and working with these Fortran format specifiers? There is a Fortran format specifier parser in the f2cl package (which converts a large subset of Fortran 77 code to Common Lisp), but it's hard to extract the relevant code. Besides, we don't want a fully general parser, as it's a security risk.
Second, we can do the parsing ourselves. Note that once we've parsed the strings, using them in Common Lisp should be fairly easy, as the format specifiers used by Common Lisp's FORMAT function were inspired in part by Fortran I/O format specifiers. It should be feasible to translate the Fortran format specifiers into CL FORMAT format specifiers. For example:
~<field-width>,<digits-after-decimal-point>,<scaling-factor-defaulting-to-zero>,<overflowchar>,<padchar>Fso that the number uses exactly <field-width> characters to print out. This is quite similar to Fortran's "Fw.d" descriptor. In Fortran 77, the w.d (<field-width> and <digits-after-decimal-point>) specifiers are required. In Fortran, the pad character is always space and the overflow character is always asterisk.
Still, we need to parse the Fortran format specifiers. We could either use the CL-YACC package (basically, YACC in Common Lisp), or write a simple parser ourselves. I've chosen the latter route, as it results in fewer dependencies. Furthermore, the restricted grammar can be parsed with a simple recursive-descent parser that uses one token lookahead. The token stream is short enough that we simply lex it all at once and store it explicitly as a list. Lexing is fairly straightforward (one character lookahead, which is trivial if we have the string, and legal with both Common Lisp and C stream-based I/O).
K. A. Remington of the National Institute of Standards and Technology wrote a C code for reading and writing Harwell-Boeing format sparse matrix files, sometime in the mid-1990's (perhaps 1997?). I took this code and adapted it as the Harwell-Boeing I/O module for the BeBOP Sparse Matrix Converter. It uses the following defaults:
PTRFMT = INDFMT = "(8I10)" VALFMT = RHSFMT = "(4E20.13)"The comments in the C code say that it does not support the D and P edit descriptors. It expects an integer format like the following:
\(<num-per-line>I<field-width>\)The real format that it expects (the code only supports real matrices) is pretty hard to figure out directly from the C code, as it's full of interesting pointer arithmetic. I'm guessing that it's like this:
\((<scaling-factor>P)?<count>[DEF]<field-width>.<decimal-significand-length>(E<num-digits-in-exponent>)?\)The scaling factor specified by the optional P parameter was considered an obsolete language feature by the Fortran 77 Committee, according to Ellis. If P is in the format string, the scaling factor is removed, as the scaling factor only affects output, not input. Otherwise, only formats D, E, and F are supported. Note that <num-digits-in-exponent> is ignored on input. We've chosen not to allow the P scaling factor because it makes the value of the output dependent on the format specifier.
I've been finished for a while with a Common Lisp prototype parser for Harwell-Boeing files. It can be called directly from a C program (and used to fill in C arrays) by using the ECL Common Lisp system. ECL isn't too hard to port to any platform with a C compiler and linker that supports shared libraries. I don't have the time to port the Common Lisp parser code to C (I'm not going to say "unfortunately," because the CL code is much easier to debug than the C code!).
This research was supported in part by the National Science Foundation. The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.