What are ReproBLAS ?

ReproBLAS stands for Reproducible Basic Linear Algebra Subprograms.

By reproducibility, we mean getting bitwise identical results from multiple runs of a program on the same input. Floating point summation is not associative because of roundoff errors, so the computed sum depends on the order of summation. Modern processors, which may dynamically assign resources, such as variable numbers of processors, do not guarantee the same order of summation from run-to-run of the same program or subroutine. Reproducibility is needed for debugging, verification and validation, correctness, and even contractual obligations.

ReproBLAS aims at providing users with a set of (Parallel and Sequential) Basic Linear Algebra Subprograms that guarantee reproducibility regardless of the number of processors, of the data partitioning, of the way reductions are scheduled, and more generally of the order in which the sums are computed. We assume only that

  1. Floating point numbers are binary and correspond to the IEEE Floating Point Standard 754-2008
  2. Floating point operations are performed in ROUND-TO-NEAREST mode (ties may be broken arbitrarily)
  3. Underflow occurs gradually (denormal numbers must be supported). It is possible to adapt the algorithms to handle abrupt underflow as well. [1]
The BLAS are commonly used in scientific programs, and the reproducible versions provided in the ReproBLAS will provide high performance while reducing user effort for debugging, correctness checking and understanding the reliability of programs.

Main Goals

In order to support computing environments ranging from multi-core laptops to highly parallel systems, including PetaScale and Cloud Computing, the main goals of the ReproBLAS are:

  1. Reproducible summation, independent of summation order, assuming only a subset of the IEEE 754 Floating Point Standard .
    This is our primary goal. In particular, we only use binary arithmetic (any precision) with the default round-to-nearest-even rounding mode. We need to make some assumptions on the maximum number of summands, which turn out to be \(2^{64}\) and \(2^{33}\) for double and single precision, resp., which is more than enough in practice.
  2. Accuracy at least as good as conventional summation, and tunable .
    The internal representation used by ReproBLAS lets us choose the desired precision. Our default in double precision is to carry at least 80 bits of precision. Our algorithm has an error bound of \[ error \leq n 2^{-80} \max|x_j| + 7 \epsilon|\sum\limits_{j = 0}^{n - 1} x_j| \] where \(\epsilon = 2^{-53}\).
  3. Handle overflow, underflow, and other exceptions reproducibly .
    It is easy to construct small sets of finite floating point numbers, where depending on the order of summation, one might compute +Inf (positive infinity), -Inf, NaN (Not-a-Number), 0 or 1 (the correct answer); Our algorithm guarantees that no overflows can occur until the final rounding to a single floating point number (assuming the above bounds on the number of summands). Summands that are +Inf, -Inf, or NaN propagate in a reproducible way to the final sum. Underflows are handled reproducibly by assuming default IEEE 754 gradual underflow semantics, but abrupt underflow may be handled as well.
  4. One read-only pass over the summands .
    This property is critical for efficient implementation of the BLAS, where summands (like \(A_{ik} \cdot B_{kj}\) in matrix multiply \(C = A \cdot B\)) are computed once on the fly and discarded immediately.
  5. One reduction .
  6. In parallel environments, the cost of a parallel reduction may dominate the total cost of a sum, so analogously to performing a single read-only pass over the data, it is important to be able to perform one parallel reduction operation (processing the data in any order) to compute a reproducible sum.
  7. Use as little memory as possible, to enable tiling .
    Many existing algorithms for reproducible summation represent intermediate sums by a data structure that we called a ``reproducible accumulator'' above. BLAS3 operations like matrix-multiplication require many such reproducible accumulators in order to exploit common optimizations like tiling; otherwise they are forced to run at the much slower speeds of BLAS1 or BLAS2. In order to fit as many of these reproducible accumulators into the available fast memory as needed, they need to be as small as possible. Our default-sized reproducible accumulator occupies 6 double precision floating point words, which is small enough for these optimization purposes.
  8. Usability.
    ReproBLAS is composed of basic operations that can be applied in many situations.

Structure
reproBLAS.h
idxdBLAS.h
idxd.h
idxd_MPI.h

The ReproBLAS are structured so that different layers may be used, depending on what sources of non-reproducibility are present. We first describe the layers, and then how to use them.

At the core of ReproBLAS is the package idxd.h which defines a new data type, the Indexed Floating Point type, as well as lower-level functions to perform reproducible arithmetic operations on indexed and native floating point data types. idxd.h provides the basic operations (conversion, addition, negation, allocation, etc.) for other modules in the library. However, even with just idxd.h, one can easily implement reproducible operations, no matter how the input values are ordered.

The package idxd_MPI.h provides special reduction operators and datatypes for indexed types in an MPI environment, allowing the user to to reproducibly compute the sum of indexed or native floating point numbers distributed across a number of processors regardless of the reduction tree shape used by MPI.

Package idxdBLAS.h provides performance-optimized sequential versions of the BLAS which take in vectors of native floating point data types and return a result in indexed floating point format.

The package reproBLAS.h is built on top of idxdBLAS.h and returns reproducible results in native floating point data types instead of indexed floating point format. Currently, reproBLAS.h only supports single-core computations. In future versions, reproBLAS.h may have a multi-threaded mode. Future versions of ReproBLAS may also include a separate module for distributed memory BLAS.

Usage

As stated above, non-reproducibility occurs because round-off makes floating point addition non-associative, so the answer depends on the order of summation. There are a variety of reasons the order of summation can vary from run to run, including changing the

Depending on the level of control over these sources of non-reproducibility, i.e. which ones the user can rule out, the user can use the fewest layers of the ReproBLAS necessary to obtain reproducibility at a minimum cost. Here are some examples:

  1. Everything is the same from run to run
    The data ordering is fixed, number of processors is fixed, the data partitioning is fixed, the choice of instructions is fixed (eg. using SSE instructions only), the data alignment is fixed, and the reduction tree shape is fixed and scheduled deterministically, then the computed results will be be reproducible: There is no need to use ReproBLAS.
  2. Fixed Data Ordering and Partitioning
    If the number of processors is fixed and the data is partitioned in a deterministic way, with the same order of data on each processor, then the local summations may be carried out in a deterministic sequential order without the ReproBLAS. The only source of non-reproducibility is the reduction tree shape, so a final reproducible can be computed using only idxd.h and idxd_MPI.h package.
  3. Fixed Data Ordering and Layout
    There is no assumption about the number of processors, or reduction tree shape and scheduling. However the data is laid out in blocks of both fixed size and fixed ordering inside each block. This is sometimes referred as an over-decomposition technique. In this case, summation inside each block can again be carried out in a deterministic sequential order without the ReproBLAS. Then idxd.h and idxd_MPI.h can be used to aggregate the intermediate values and produce the final reproducible result.
  4. No control over data/resources
    If there is no assumption about any of the above mentioned sources, i.e. even the data ordering is not known, then idxdBLAS.h and idxd_MPI.h are needed to guarantee reproducibility.
Performance

ReproBLAS is comparable in speed to commercially available optimized non-reproducible BLAS. For example, in the case of a dot product the slowdown of the ReproBLAS compared to a performance-optimized non-reproducible dot product is 4x on a single Intel "Sandy Bridge" core; here the result is reproducible regardless of how the input vector is permuted. [1] And on a large scale system of more than 512 Intel "Ivy Bridge" cores (the Edison machine at NERSC), the slowdown is less than 1.2x for the summation of 1,000,000 double precision floating point numbers [2]; here the result is also reproducible regardless of how the input vector is partitioned among nodes as well as how the local input vector is stored within a node.

Status

The current version only supports BLAS that perform matrix-matrix multiply, matrix-vector multiply, and reductions (namely summation, absolute value summation, dot product, and 2-norm) for the 4 basic types of data: double, float, double complex, and float complex.

Future versions (under development) will include:

Details about development status of ReproBLAS can be checked here.

People

Primary contributors to ReproBLAS include:

Contact

We appreciate any possible feedback, bug reports, suggestions, or comments that you might have to improve the product. Please email us at:

Links

Publications

Talks

Class Projects

Acknowledgements

This research is supported in part by NSF grants NSF OCI-1032639, NSF ACI-1339676, DOE grants DOE DE-SC0010200, DOE DE-SC0003959, DOE DE-SC0005136, DOE DE-SC0008699, DOE DE-SC0008700, DOE DE-SC0004938, DOE AC02-05CH11231, and DARPA grant HR0011-12-2-0016, ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, and Oracle. Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and does not necessarily reflect the position or the policy of the sponsors.

Last update 2/21/2016