#### What are ReproBLAS ?

ReproBLAS stands for Reproducible Basic Linear Algebra Subprograms.

By reproducibility, we mean getting bitwise identical results from multiple runs of a program on the same input. Floating point summation is not associative because of roundoff errors, so the computed sum depends on the order of summation. Modern processors, which may dynamically assign resources, such as variable numbers of processors, do not guarantee the same order of summation from run-to-run of the same program or subroutine. Reproducibility is needed for debugging, verification and validation, correctness, and even contractual obligations.

ReproBLAS aims at providing users with a set of (Parallel and Sequential) Basic Linear Algebra Subprograms that guarantee reproducibility regardless of the number of processors, of the data partitioning, of the way reductions are scheduled, and more generally of the order in which the sums are computed. We assume only that

1. Floating point numbers are binary and correspond to the IEEE Floating Point Standard 754-2008
2. Floating point operations are performed in ROUND-TO-NEAREST mode (ties may be broken arbitrarily)
3. Underflow occurs gradually (denormal numbers must be supported). It is possible to adapt the algorithms to handle abrupt underflow as well. [1]
The BLAS are commonly used in scientific programs, and the reproducible versions provided in the ReproBLAS will provide high performance while reducing user effort for debugging, correctness checking and understanding the reliability of programs.

Main Goals

In order to support computing environments ranging from multi-core laptops to highly parallel systems, including PetaScale and Cloud Computing, the main goals of the ReproBLAS are:

1. Reproducible summation, independent of summation order, assuming only a subset of the IEEE 754 Floating Point Standard .
This is our primary goal. In particular, we only use binary arithmetic (any precision) with the default round-to-nearest-even rounding mode. We need to make some assumptions on the maximum number of summands, which turn out to be $2^{64}$ and $2^{33}$ for double and single precision, resp., which is more than enough in practice.
2. Accuracy at least as good as conventional summation, and tunable .
The internal representation used by ReproBLAS lets us choose the desired precision. Our default in double precision is to carry at least 80 bits of precision. Our algorithm has an error bound of $error \leq n 2^{-80} \max|x_j| + 7 \epsilon|\sum\limits_{j = 0}^{n - 1} x_j|$ where $\epsilon = 2^{-53}$.
3. Handle overflow, underflow, and other exceptions reproducibly .
It is easy to construct small sets of finite floating point numbers, where depending on the order of summation, one might compute +Inf (positive infinity), -Inf, NaN (Not-a-Number), 0 or 1 (the correct answer); Our algorithm guarantees that no overflows can occur until the final rounding to a single floating point number (assuming the above bounds on the number of summands). Summands that are +Inf, -Inf, or NaN propagate in a reproducible way to the final sum. Underflows are handled reproducibly by assuming default IEEE 754 gradual underflow semantics, but abrupt underflow may be handled as well.
4. One read-only pass over the summands .
This property is critical for efficient implementation of the BLAS, where summands (like $A_{ik} \cdot B_{kj}$ in matrix multiply $C = A \cdot B$) are computed once on the fly and discarded immediately.
5. One reduction .
6. In parallel environments, the cost of a parallel reduction may dominate the total cost of a sum, so analogously to performing a single read-only pass over the data, it is important to be able to perform one parallel reduction operation (processing the data in any order) to compute a reproducible sum.
7. Use as little memory as possible, to enable tiling .
Many existing algorithms for reproducible summation represent intermediate sums by a data structure that we called a reproducible accumulator'' above. BLAS3 operations like matrix-multiplication require many such reproducible accumulators in order to exploit common optimizations like tiling; otherwise they are forced to run at the much slower speeds of BLAS1 or BLAS2. In order to fit as many of these reproducible accumulators into the available fast memory as needed, they need to be as small as possible. Our default-sized reproducible accumulator occupies 6 double precision floating point words, which is small enough for these optimization purposes.
8. Usability.
ReproBLAS is composed of basic operations that can be applied in many situations.

Structure
reproBLAS.h
idxdBLAS.h
idxd.h
idxd_MPI.h

The ReproBLAS are structured so that different layers may be used, depending on what sources of non-reproducibility are present. We first describe the layers, and then how to use them.

At the core of ReproBLAS is the package idxd.h which defines a new data type, the Indexed Floating Point type, as well as lower-level functions to perform reproducible arithmetic operations on indexed and native floating point data types. idxd.h provides the basic operations (conversion, addition, negation, allocation, etc.) for other modules in the library. However, even with just idxd.h, one can easily implement reproducible operations, no matter how the input values are ordered.

The package idxd_MPI.h provides special reduction operators and datatypes for indexed types in an MPI environment, allowing the user to to reproducibly compute the sum of indexed or native floating point numbers distributed across a number of processors regardless of the reduction tree shape used by MPI.

Package idxdBLAS.h provides performance-optimized sequential versions of the BLAS which take in vectors of native floating point data types and return a result in indexed floating point format.

The package reproBLAS.h is built on top of idxdBLAS.h and returns reproducible results in native floating point data types instead of indexed floating point format. Currently, reproBLAS.h only supports single-core computations. In future versions, reproBLAS.h may have a multi-threaded mode. Future versions of ReproBLAS may also include a separate module for distributed memory BLAS.

Usage

As stated above, non-reproducibility occurs because round-off makes floating point addition non-associative, so the answer depends on the order of summation. There are a variety of reasons the order of summation can vary from run to run, including changing the

• Data partitioning across processors
• Number of available processors
• Choice of instructions (eg. 2 doubles for SSE instructions vs. 4 doubles for AVX instructions)
• Data alignment in memory (which may affect the order of loading data into multi-word registers)
• Reduction tree shape, which could be scheduled statically or dynamically
• Order of the input data

Depending on the level of control over these sources of non-reproducibility, i.e. which ones the user can rule out, the user can use the fewest layers of the ReproBLAS necessary to obtain reproducibility at a minimum cost. Here are some examples:

1. Everything is the same from run to run
The data ordering is fixed, number of processors is fixed, the data partitioning is fixed, the choice of instructions is fixed (eg. using SSE instructions only), the data alignment is fixed, and the reduction tree shape is fixed and scheduled deterministically, then the computed results will be be reproducible: There is no need to use ReproBLAS.
2. Fixed Data Ordering and Partitioning
If the number of processors is fixed and the data is partitioned in a deterministic way, with the same order of data on each processor, then the local summations may be carried out in a deterministic sequential order without the ReproBLAS. The only source of non-reproducibility is the reduction tree shape, so a final reproducible can be computed using only idxd.h and idxd_MPI.h package.
3. Fixed Data Ordering and Layout
There is no assumption about the number of processors, or reduction tree shape and scheduling. However the data is laid out in blocks of both fixed size and fixed ordering inside each block. This is sometimes referred as an over-decomposition technique. In this case, summation inside each block can again be carried out in a deterministic sequential order without the ReproBLAS. Then idxd.h and idxd_MPI.h can be used to aggregate the intermediate values and produce the final reproducible result.
4. No control over data/resources
If there is no assumption about any of the above mentioned sources, i.e. even the data ordering is not known, then idxdBLAS.h and idxd_MPI.h are needed to guarantee reproducibility.
Performance

ReproBLAS is comparable in speed to commercially available optimized non-reproducible BLAS. For example, in the case of a dot product the slowdown of the ReproBLAS compared to a performance-optimized non-reproducible dot product is 4x on a single Intel "Sandy Bridge" core; here the result is reproducible regardless of how the input vector is permuted. [1] And on a large scale system of more than 512 Intel "Ivy Bridge" cores (the Edison machine at NERSC), the slowdown is less than 1.2x for the summation of 1,000,000 double precision floating point numbers [2]; here the result is also reproducible regardless of how the input vector is partitioned among nodes as well as how the local input vector is stored within a node.

Status

The current version only supports BLAS that perform matrix-matrix multiply, matrix-vector multiply, and reductions (namely summation, absolute value summation, dot product, and 2-norm) for the 4 basic types of data: double, float, double complex, and float complex.

Future versions (under development) will include:

• The higher level BLAS routines trsv, trsm
• Multi-threaded mode (OpenMP for reproBLAS.h)
• Distributed memory routines (MPI)

Details about development status of ReproBLAS can be checked here.

#### People

Primary contributors to ReproBLAS include:

#### Contact

We appreciate any possible feedback, bug reports, suggestions, or comments that you might have to improve the product. Please email us at:
• peter.ahrens@berkeley.edu
• hdnguyen@eecs.berkeley.edu
• demmel@berkeley.edu

#### Publications

• Effcient Reproducible Floating Point Summation and BLAS
EECS Department, UC Berkeley, Technical Report No. UCB/EECS-2016-121, June 18, 2016.

• Reproducible Tall-Skinny QR Factorization
In the proceedings of ARITH 22, Lyon, France, June 22-26, 2015

• Parallel Reproducible Summation
IEEE Transactions on Computer, v.64, i. 7, July 2015. DOI: 10.1109/TC.2014.2345391

• Numerical Accuracy and Reproducibility at ExaScale [invited Paper] (Special Session on ExaScale Compputing: Managing Computation, Precision, Accuracy, and Performance on ExaScale Systems)
ARITH 21, Austin, Texas, April 7-10, 2013

• Fast Reproducible Floating-Point Summation (slides )
ARITH 21, Austin, Texas, April 7-10, 2013

#### Talks

• Toward hardware support for Reproducible Floating-Point Computation
SCAN2014, 16th GAMM-IMACS International Symposium on Scientific Computing, Computer Arithmetic and Validated Numerics, Wurzburg, Germany, Sep 21-26, 2014
• ReproBLAS: Reproducible BLAS
(Workshop on: Obtaining Bitwise Reproducible Results: Perspectives and Latest Advances)
SC'13, Denver, CO, Nov 18-22, 2013
• Efficient Reproducible Floating-Point Reduction Operations on Large Scale Systems
(Minisymposium on: Regarding Reproducible and Repeatable Computations)
SIAM Annual Metting 13, July 8-12, 2013
• Numerical Accuracy and Reproducibility at ExaScale
(Special Session on ExaScale Compputing: Managing Computation, Precision, Accuracy, and Performance on ExaScale Systems)
ARITH 21, Austin, Texas, April 7-10, 2013
• Fast Reproducible Floating-Point Summation
ARITH 21, Austin, Texas, April 7-10, 2013

#### Class Projects

• Reproducible Parallel Matrix-Vector Multiply
CS 267 Final Report, UC Berkeley, May 11, 2015.

#### Acknowledgements

This research is supported in part by NSF grants NSF OCI-1032639, NSF ACI-1339676, DOE grants DOE DE-SC0010200, DOE DE-SC0003959, DOE DE-SC0005136, DOE DE-SC0008699, DOE DE-SC0008700, DOE DE-SC0004938, DOE AC02-05CH11231, and DARPA grant HR0011-12-2-0016, ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, and Oracle. Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and does not necessarily reflect the position or the policy of the sponsors.

Last update 2/21/2016