What are ReproBLAS
?
ReproBLAS
stands for Reproducible Basic Linear Algebra Subprograms.
By reproducibility, we mean getting bitwise identical results from multiple runs of a program on the same input.
Floating point summation is not associative because of roundoff errors, so the computed sum depends on the order of summation. Modern processors, which may dynamically assign resources, such as variable numbers of processors, do not guarantee the same order of summation from runtorun of the same program or subroutine. Reproducibility
is needed for debugging, verification and validation, correctness, and even contractual obligations.
ReproBLAS
aims at providing users with a set of (Parallel and Sequential) Basic Linear Algebra Subprograms
that guarantee reproducibility regardless of the number of processors, of the data partitioning,
of the way reductions are scheduled, and more generally of the order in which the sums are computed.
We assume only that
 Floating point numbers are binary and correspond to the IEEE Floating Point Standard 7542008
 Floating point operations are performed in ROUNDTONEAREST mode (ties may be broken arbitrarily)
 Underflow occurs gradually (denormal numbers must be supported). It is possible to adapt the algorithms to handle abrupt underflow as well.
^{[1]}
The
BLAS are commonly used in scientific programs,
and the reproducible versions provided in the
ReproBLAS
will provide high performance while reducing user effort for debugging,
correctness checking and understanding the reliability of programs.
Main Goals
In order to support computing environments ranging from multicore laptops to highly parallel systems,
including PetaScale and Cloud Computing, the main goals of the ReproBLAS
are:

Reproducible summation, independent of summation order, assuming only a subset of the IEEE 754 Floating Point Standard .
This is our primary goal. In particular, we only use binary arithmetic (any precision) with the default roundtonearesteven rounding mode. We need to make some assumptions on the maximum number of summands, which turn out to be \(2^{64}\) and \(2^{33}\) for double and single precision, resp., which is more than enough in practice.

Accuracy at least as good as conventional summation, and tunable .
The internal representation used by ReproBLAS lets us choose the desired precision. Our default in double precision is to carry at least 80 bits of precision. Our algorithm has an error bound of
\[
error \leq n 2^{80} \maxx_j + 7 \epsilon\sum\limits_{j = 0}^{n  1} x_j
\]
where \(\epsilon = 2^{53}\).

Handle overflow, underflow, and other exceptions reproducibly .
It is easy to construct small sets of finite floating point numbers,
where depending on the order of summation, one might compute
+Inf
(positive infinity), Inf
, NaN
(NotaNumber), 0 or 1 (the correct answer);
Our algorithm guarantees that no overflows can occur until the final
rounding to a single floating point number (assuming the above bounds on the number
of summands). Summands that are +Inf
, Inf
, or NaN
propagate in a reproducible
way to the final sum.
Underflows are handled reproducibly by assuming default IEEE 754
gradual underflow semantics, but abrupt underflow may be handled as well.

One readonly pass over the summands .
This property is critical for efficient implementation of the BLAS, where
summands (like \(A_{ik} \cdot B_{kj}\) in matrix multiply \(C = A \cdot B\))
are computed once on the fly and discarded immediately.

One reduction .
In parallel environments, the cost of
a parallel reduction may dominate the total cost of a sum, so
analogously to performing a single readonly pass over the data,
it is important to be able to perform one parallel reduction
operation (processing the data in any order) to compute a reproducible sum.

Use as little memory as possible, to enable tiling .
Many existing algorithms for reproducible summation represent
intermediate sums by a data structure that we called a
``reproducible accumulator'' above. BLAS3 operations like matrixmultiplication
require many such reproducible accumulators in order to exploit
common optimizations like tiling; otherwise they are forced to
run at the much slower speeds of BLAS1 or BLAS2.
In order to fit as many of these reproducible accumulators
into the available fast memory as needed, they need to be as small as possible.
Our defaultsized reproducible accumulator occupies 6 double
precision floating point words, which is small enough for these
optimization purposes.

Usability.
ReproBLAS is composed of basic operations that can be applied in many situations.
Structure
reproBLAS.h
idxdBLAS.h
idxd.h
idxd_MPI.h
The ReproBLAS
are structured so that different layers may be used,
depending on what sources of nonreproducibility are present.
We first describe the layers, and then how to use them.
At the core of ReproBLAS
is the package idxd.h
which defines a new data type, the Indexed Floating Point type,
as well as lowerlevel functions to perform reproducible arithmetic operations on indexed
and native floating point data types.
idxd.h
provides the basic operations (conversion, addition, negation, allocation, etc.) for other modules in the library. However, even with just idxd.h
, one can easily implement reproducible operations, no matter how the input values are ordered.
The package idxd_MPI.h
provides special reduction operators and datatypes for indexed types in an MPI environment, allowing the user to
to reproducibly compute the sum of indexed or native floating point numbers
distributed across a number of processors
regardless of the reduction tree shape used by MPI.
Package idxdBLAS.h
provides performanceoptimized sequential versions of the BLAS
which take in vectors of native floating point data types and return a result in
indexed floating point format.
The package reproBLAS.h
is built on top of idxdBLAS.h
and returns reproducible results in native floating point data types
instead of indexed floating point format.
Currently, reproBLAS.h
only supports singlecore computations.
In future versions, reproBLAS.h
may have a multithreaded mode. Future versions of ReproBLAS may also include a separate module for distributed memory BLAS.
Usage
As stated above, nonreproducibility occurs because roundoff makes
floating point addition nonassociative,
so the answer depends on the order of summation.
There are a variety of reasons the order of summation can vary from run to run, including changing the
 Data partitioning across processors
 Number of available processors
 Choice of instructions (eg. 2 doubles for SSE instructions vs. 4 doubles for AVX instructions)
 Data alignment in memory (which may affect the order of loading data into multiword registers)
 Reduction tree shape, which could be scheduled statically or dynamically
 Order of the input data
Depending on the level of control over these sources of nonreproducibility,
i.e. which ones the user can rule out,
the user can use the fewest layers of the ReproBLAS
necessary to obtain reproducibility at a minimum cost.
Here are some examples:

Everything is the same from run to run
The data ordering is fixed, number of processors is fixed,
the data partitioning is fixed,
the choice of instructions is fixed (eg. using SSE instructions only),
the data alignment is fixed,
and the reduction tree shape is fixed and scheduled deterministically,
then the computed results will be be reproducible:
There is no need to use ReproBLAS
.

Fixed Data Ordering and Partitioning
If the number of processors is fixed and the data is partitioned in a deterministic way,
with the same order of data on each processor,
then the local summations may be carried out in a deterministic sequential order without the ReproBLAS
.
The only source of nonreproducibility is the reduction tree shape, so a final reproducible
can be computed using only idxd.h
and idxd_MPI.h
package.

Fixed Data Ordering and Layout
There is no assumption about the number of processors, or reduction tree shape and scheduling.
However the data is laid out in blocks of both fixed size and fixed ordering inside each block.
This is sometimes referred as an overdecomposition technique.
In this case, summation inside each block can again be carried out
in a deterministic sequential order without the ReproBLAS
.
Then idxd.h
and idxd_MPI.h
can be used to aggregate the intermediate values
and produce the final reproducible result.

No control over data/resources
If there is no assumption about any of the above mentioned sources,
i.e. even the data ordering is not known,
then idxdBLAS.h
and idxd_MPI.h
are needed to guarantee reproducibility.
Performance
ReproBLAS is comparable in speed to commercially available optimized nonreproducible BLAS.
For example, in the case of a dot product the slowdown of the ReproBLAS
compared to a performanceoptimized nonreproducible dot product is 4x
on a single Intel "Sandy Bridge" core;
here the result is reproducible regardless of how the input vector is permuted.
^{[1]}
And on a large scale system of more than 512 Intel "Ivy Bridge" cores
(the Edison machine at NERSC),
the slowdown is less than 1.2x for the summation of 1,000,000
double precision floating point numbers
^{[2]};
here the result is also reproducible regardless of how the input vector is partitioned among nodes
as well as how the local input vector is stored within a node.
Status
The current version only supports BLAS that perform matrixmatrix multiply, matrixvector multiply, and reductions (namely summation, absolute value summation,
dot product, and 2norm) for the 4 basic types of data:
double
, float
,
double complex
, and float complex
.
Future versions (under development) will include:
 The higher level BLAS routines
trsv, trsm
 Multithreaded mode (OpenMP for
reproBLAS.h
)
 Distributed memory routines (MPI)
Details about development status of
ReproBLAS
can be checked
here.
People
Primary contributors to
ReproBLAS
include:
Contact
We appreciate any possible feedback, bug reports, suggestions, or comments that you might have
to improve the product.
Please email us at:
 peter.ahrens@berkeley.edu
 hdnguyen@eecs.berkeley.edu
 demmel@berkeley.edu
Links
Publications

P. Ahrens, H.D. Nguyen and J. Demmel
Effcient Reproducible Floating Point Summation and BLAS
EECS Department, UC Berkeley, Technical Report No. UCB/EECS2016121, June 18, 2016.

J. Demmel and H. D. Nguyen
Reproducible TallSkinny QR Factorization
In the proceedings of ARITH 22, Lyon, France, June 2226, 2015

J. Demmel and H. D. Nguyen
Parallel Reproducible Summation
IEEE Transactions on Computer, v.64, i. 7, July 2015. DOI: 10.1109/TC.2014.2345391

J. Demmel and H. D. Nguyen
Numerical Accuracy and Reproducibility at ExaScale
[invited Paper]
(Special Session on ExaScale Compputing: Managing Computation, Precision, Accuracy, and Performance on ExaScale Systems)
ARITH 21, Austin, Texas, April 710, 2013

J. Demmel and H. D. Nguyen
Fast Reproducible FloatingPoint Summation
(slides )
ARITH 21, Austin, Texas, April 710, 2013
Talks

Toward hardware support for Reproducible FloatingPoint Computation
SCAN2014, 16th GAMMIMACS International Symposium on Scientific Computing, Computer Arithmetic and Validated Numerics, Wurzburg, Germany, Sep 2126, 2014

ReproBLAS: Reproducible BLAS
(Workshop on: Obtaining Bitwise Reproducible Results: Perspectives and Latest Advances)
SC'13, Denver, CO, Nov 1822, 2013

Efficient Reproducible FloatingPoint Reduction Operations on Large Scale Systems
(Minisymposium on: Regarding Reproducible and Repeatable Computations)
SIAM Annual Metting 13, July 812, 2013

Numerical Accuracy and Reproducibility at ExaScale
(Special Session on ExaScale Compputing: Managing Computation, Precision, Accuracy, and Performance on ExaScale Systems)
ARITH 21, Austin, Texas, April 710, 2013

Fast Reproducible FloatingPoint Summation
ARITH 21, Austin, Texas, April 710, 2013
Class Projects

Peter Ahrens
Reproducible Parallel MatrixVector Multiply
CS 267 Final Report, UC Berkeley, May 11, 2015.
Acknowledgements
This research is supported in part by
NSF grants NSF OCI1032639, NSF ACI1339676,
DOE grants DOE DESC0010200, DOE DESC0003959, DOE DESC0005136, DOE DESC0008699, DOE DESC0008700, DOE DESC0004938,
DOE AC0205CH11231,
and DARPA grant HR00111220016,
ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, and Oracle.
Any opinions, findings, conclusions, or recommendations in this paper are solely
those of the authors and does not necessarily reflect the position or the policy of the sponsors.
Last update 2/21/2016