What are ReproBLAS
?
ReproBLAS
stands for Reproducible Basic Linear Algebra Subprograms.
By reproducibility, we mean getting bitwise identical results from multiple runs of a program on the same input.
Floating point summation is not associative because of roundoff errors, so the computed sum depends on the order of summation. Modern processors, which may dynamically assign resources, such as variable numbers of processors, do not guarantee the same order of summation from runtorun of the same program or subroutine. Reproducibility
is needed for debugging, verification and validation, correctness, and even contractual obligations.
ReproBLAS
aims at providing users with a set of (Parallel and Sequential) Basic Linear Algebra Subprograms
that guarantee reproducibility regardless of the number of processors, of the data partitioning,
of the way reductions are scheduled, and more generally of the order in which the sums are computed.
We assume only that
 Floating point numbers are binary and correspond to the IEEE Floating Point Standard 7542008
 Floating point operations are performed in ROUNDTONEAREST mode (ties may be broken arbitrarily)
 Underflow occurs gradually (denormal numbers must be supported). It is possible to adapt the algorithms to handle abrupt underflow as well.
^{[1]}
The
BLAS are commonly used in scientific programs,
and the reproducible versions provided in the
ReproBLAS
will provide high performance while reducing user effort for debugging,
correctness checking and understanding the reliability of programs.
Main Goals
In order to support computing environments ranging from multicore laptops to highly parallel systems,
including PetaScale and Cloud Computing, the main goals of the ReproBLAS
are:

Reproducible summation, independent of summation order, assuming only a subset of the IEEE 754 Floating Point Standard .
This is our primary goal. In particular, we only use binary arithmetic (any precision) with the default roundtonearesteven rounding mode. We need to make some assumptions on the maximum number of summands, which turn out to be \(2^{64}\) and \(2^{33}\) for double and single precision, resp., which is more than enough in practice.

Accuracy at least as good as conventional summation, and tunable .
The internal representation used by ReproBLAS lets us choose the desired precision. Our default in double precision is to carry at least 80 bits of precision. Our algorithm has an error bound of
\[
error \leq n 2^{80} \maxx_j + 7 \epsilon\sum\limits_{j = 0}^{n  1} x_j
\]
where \(\epsilon = 2^{53}\).

Handle overflow, underflow, and other exceptions reproducibly .
It is easy to construct small sets of finite floating point numbers,
where depending on the order of summation, one might compute
+Inf
(positive infinity), Inf
, NaN
(NotaNumber), 0 or 1 (the correct answer);
Our algorithm guarantees that no overflows can occur until the final
rounding to a single floating point number (assuming the above bounds on the number
of summands). Summands that are +Inf
, Inf
, or NaN
propagate in a reproducible
way to the final sum.
Underflows are handled reproducibly by assuming default IEEE 754
gradual underflow semantics, but abrupt underflow may be handled as well.

One readonly pass over the summands .
This property is critical for efficient implementation of the BLAS, where
summands (like \(A_{ik} \cdot B_{kj}\) in matrix multiply \(C = A \cdot B\))
are computed once on the fly and discarded immediately.

One reduction .
In parallel environments, the cost of
a parallel reduction may dominate the total cost of a sum, so
analogously to performing a single readonly pass over the data,
it is important to be able to perform one parallel reduction
operation (processing the data in any order) to compute a reproducible sum.

Use as little memory as possible, to enable tiling .
Many existing algorithms for reproducible summation represent
intermediate sums by a data structure that we called a
``reproducible accumulator'' above. BLAS3 operations like matrixmultiplication
require many such reproducible accumulators in order to exploit
common optimizations like tiling; otherwise they are forced to
run at the much slower speeds of BLAS1 or BLAS2.
In order to fit as many of these reproducible accumulators
into the available fast memory as needed, they need to be as small as possible.
Our defaultsized reproducible accumulator occupies 6 double
precision floating point words, which is small enough for these
optimization purposes.

Usability.
ReproBLAS is composed of basic operations that can be applied in many situations.
Structure
reproBLAS.h
binnedBLAS.h
binned.h
binnedMPI.h
The ReproBLAS
are structured so that different layers may be used,
depending on what sources of nonreproducibility are present.
We first describe the layers, and then how to use them.
At the core of ReproBLAS
is the package binned.h
which defines a new data type, the Binned Floating Point Number,
as well as lowerlevel functions to perform reproducible arithmetic operations on binned
and native floating point data types.
binned.h
provides the basic operations (conversion, addition, negation, allocation, etc.) for other modules in the library. However, even with just binned.h
, one can easily implement reproducible operations, no matter how the input values are ordered.
The package binnedMPI.h
provides special reduction operators and datatypes for binned numbers in an MPI environment, allowing the user to
to reproducibly compute the sum of binned or native floating point numbers
distributed across a number of processors
regardless of the reduction tree shape used by MPI.
Package binnedBLAS.h
provides performanceoptimized sequential versions of the BLAS
which take in vectors of native floating point data types and return a result in
binned floating point format.
The package reproBLAS.h
is built on top of binnedBLAS.h
and returns reproducible results in native floating point data types
instead of binned floating point format.
Currently, reproBLAS.h
only supports singlecore computations.
In future versions, reproBLAS.h
may have a multithreaded mode. Future versions of ReproBLAS may also include a separate module for distributed memory BLAS.
Usage
As stated above, nonreproducibility occurs because roundoff makes
floating point addition nonassociative,
so the answer depends on the order of summation.
There are a variety of reasons the order of summation can vary from run to run, including changing the
 Data partitioning across processors
 Number of available processors
 Choice of instructions (eg. 2 doubles for SSE instructions vs. 4 doubles for AVX instructions)
 Data alignment in memory (which may affect the order of loading data into multiword registers)
 Reduction tree shape, which could be scheduled statically or dynamically
 Order of the input data
Depending on the level of control over these sources of nonreproducibility,
i.e. which ones the user can rule out,
the user can use the fewest layers of the ReproBLAS
necessary to obtain reproducibility at a minimum cost.
Here are some examples:

Everything is the same from run to run
The data ordering is fixed, number of processors is fixed,
the data partitioning is fixed,
the choice of instructions is fixed (eg. using SSE instructions only),
the data alignment is fixed,
and the reduction tree shape is fixed and scheduled deterministically,
then the computed results will be be reproducible:
There is no need to use ReproBLAS
.

Fixed Data Ordering and Partitioning
If the number of processors is fixed and the data is partitioned in a deterministic way,
with the same order of data on each processor,
then the local summations may be carried out in a deterministic sequential order without the ReproBLAS
.
The only source of nonreproducibility is the reduction tree shape, so a final reproducible
can be computed using only binned.h
and binnedMPI.h
package.

Fixed Data Ordering and Layout
There is no assumption about the number of processors, or reduction tree shape and scheduling.
However the data is laid out in blocks of both fixed size and fixed ordering inside each block.
This is sometimes referred as an overdecomposition technique.
In this case, summation inside each block can again be carried out
in a deterministic sequential order without the ReproBLAS
.
Then binned.h
and binnedMPI.h
can be used to aggregate the intermediate values
and produce the final reproducible result.

No control over data/resources
If there is no assumption about any of the above mentioned sources,
i.e. even the data ordering is not known,
then binnedBLAS.h
and binnedMPI.h
are needed to guarantee reproducibility.
Performance
ReproBLAS is comparable in speed to commercially available optimized nonreproducible BLAS.
Using our default settings, reproducibly summing n floating point types requires approximately 9n floating point operations (arithmetic, comparison, and absolute value). In theory, this count can be reduced to 5n using the new "augmented addition" and "maximum magnitude" instructions in the proposed IEEE Floating Point Standard 7542018.
For example, in the case of a dot product the slowdown of the ReproBLAS
compared to a performanceoptimized nonreproducible dot product is 4x
on a single Intel "Sandy Bridge" core;
here the result is reproducible regardless of how the input vector is permuted.
^{[1]}
And on a large scale system of more than 512 Intel "Ivy Bridge" cores
(the Edison machine at NERSC),
the slowdown is less than 1.2x for the summation of 1,000,000
double precision floating point numbers
^{[2]};
here the result is also reproducible regardless of how the input vector is partitioned among nodes
as well as how the local input vector is stored within a node.
Status
The current version only supports BLAS that perform matrixmatrix multiply, matrixvector multiply, and reductions (namely summation, absolute value summation,
dot product, and 2norm) for the 4 basic types of data:
double
, float
,
double complex
, and float complex
.
Future versions (under development) will include:
 The higher level BLAS routines
trsv, trsm
 Multithreaded mode (OpenMP for
reproBLAS.h
)
 Distributed memory routines (MPI)
Details about development status of
ReproBLAS
can be checked
here.
People
Primary contributors to
ReproBLAS
include:
Contact
We appreciate any possible feedback, bug reports, suggestions, or comments that you might have
to improve the product.
Please email us at:
 willow@csail.mit.edu
 hdnguyen@eecs.berkeley.edu
 demmel@berkeley.edu
Publications

W. Ahrens, H.D. Nguyen and J. Demmel
Effcient Reproducible Floating Point Summation and BLAS
EECS Department, UC Berkeley, Technical Report No. UCB/EECS2016121, June 18, 2016.

J. Demmel and H. D. Nguyen
Reproducible TallSkinny QR Factorization
In the proceedings of ARITH 22, Lyon, France, June 2226, 2015

J. Demmel and H. D. Nguyen
Parallel Reproducible Summation
IEEE Transactions on Computer, v.64, i. 7, July 2015. DOI: 10.1109/TC.2014.2345391

J. Demmel and H. D. Nguyen
Numerical Accuracy and Reproducibility at ExaScale
[invited Paper]
(Special Session on ExaScale Compputing: Managing Computation, Precision, Accuracy, and Performance on ExaScale Systems)
ARITH 21, Austin, Texas, April 710, 2013

J. Demmel and H. D. Nguyen
Fast Reproducible FloatingPoint Summation
(slides )
ARITH 21, Austin, Texas, April 710, 2013
Talks

Reproducible BLAS (Basic Linear Algebra Subprograms)
(BirdsofaFeather Session on: Reproducibility of High Performance Codes and Simulations: Tools, Techniques, Debugging)
SC 2015, Austin, TX, Nov 1520, 2015

Toward hardware support for Reproducible FloatingPoint Computation
SCAN 2014, 16th GAMMIMACS International Symposium on Scientific Computing, Computer Arithmetic and Validated Numerics, Wurzburg, Germany, Sep 2126, 2014

ReproBLAS: Reproducible BLAS
(Workshop on: Obtaining Bitwise Reproducible Results: Perspectives and Latest Advances)
SC 2013, Denver, CO, Nov 1822, 2013

Efficient Reproducible FloatingPoint Reduction Operations on Large Scale Systems
(Minisymposium on: Regarding Reproducible and Repeatable Computations)
SIAM Annual Metting 2013, July 812, 2013

Numerical Accuracy and Reproducibility at ExaScale
(Special Session on ExaScale Compputing: Managing Computation, Precision, Accuracy, and Performance on ExaScale Systems)
ARITH 21, Austin, Texas, April 710, 2013

Fast Reproducible FloatingPoint Summation
ARITH 21, Austin, Texas, April 710, 2013
Class Projects

Willow Ahrens
Reproducible Parallel MatrixVector Multiply
CS 267 Final Report, UC Berkeley, May 11, 2015.
Related Links
Related References
Bibliography
Ahrens, W., J. Demmel, and H. D. Nguyen. 2016. “Efficient Reproducible Floating Point Summation and BLAS.” UCB/EECS2016121. EECS Department, University of California, Berkeley.
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS2016121.html.
Ansel, J., S. Kamil, K. Veeramachaneni, J. RaganKelley, J. Bosboom, U. O’Reilly, and S. Amarasinghe. 2014. “OpenTuner: An Extensible Framework for Program Autotuning.” In
International Conference on Parallel Architectures and Compilation Techniques (PACT), 303–316. PACT ’14. New York, NY, USA: ACM.
https://doi.org/10.1145/2628071.2628092.
Arteaga, A., O. Fuhrer, and T. Hoefler. 2014. “Designing BitReproducible Portable HighPerformance Applications.” In
International Parallel and Distributed Processing Symposium (IPDPS), 1235–44.
https://doi.org/10.1109/IPDPS.2014.127.
Chohra, C., P. Langlois, and D. Parello. 2015. “Efficiency of Reproducible Level 1 BLAS.” In
Scientific Computing, Computer Arithmetic, and Validated Numerics (SCAN), 99–108. Lecture Notes in Computer Science. Springer, Cham.
https://doi.org/10.1007/9783319317694_8.
———. 2016a. “Reproducible, Accurately Rounded and Efficient BLAS.” In
EuroPar Parallel Processing Workshops, 609–20. Lecture Notes in Computer Science. Springer, Cham.
https://doi.org/10.1007/9783319589435_49.
———. 2016b. “Parallel Experiments with RAREBLAS.” In
Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 135–38.
https://doi.org/10.1109/SYNASC.2016.032.
Collange, S., D. Defour, S. Graillat, and R. Iakymchuk. 2015. “Numerical Reproducibility for the Parallel Reduction on Multi and ManyCore Architectures.”
Parallel Computing 49 (November): 83–97.
https://doi.org/10.1016/j.parco.2015.09.001.
Dekker, T. J. 1971. “A FloatingPoint Technique for Extending the Available Precision.”
Numerische Mathematik 18 (3): 224–42.
https://doi.org/10.1007/BF01397083.
Demmel, J., D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. 2013. “CommunicationOptimal Parallel Recursive Rectangular Matrix Multiplication.” In
International Symposium on Parallel and Distributed Processing (SPDP), 261–72.
https://doi.org/10.1109/IPDPS.2013.80.
Demmel, J., M. Gates, G. Henry, X.S. Li, J. Riedy, and P.T.P. Tang. n.d. “A Proposal for a NextGeneration BLAS.”
https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit?usp=sharing.
Demmel, J., G. Gopalakrishnan, M. Heroux, W. Keyrouz, and K. Sato. 2015. “Reproducibility of High Performance Codes and Simulations: Tools, Techniques, Debugging.” In
SC 2015 Birds of a Feather Sessions. Austin, TX, United States.
https://gcl.cis.udel.edu/sc15bof.php.
Demmel, J., and Y. Hida. 2004. “Accurate and Efficient Floating Point Summation.”
SIAM Journal on Scientific Computing 25 (4): 1214–48.
https://doi.org/10.1137/S1064827502407627.
Demmel, J., and H. D. Nguyen. 2013. “Fast Reproducible FloatingPoint Summation.” In
Symposium on Computer Arithmetic (ARITH), 163–72.
https://doi.org/10.1109/ARITH.2013.9.
———. 2015. “Parallel Reproducible Summation.”
IEEE Transactions on Computers 64 (7): 2060–70.
https://doi.org/10.1109/TC.2014.2345391.
Diethelm, K. 2010. “Reproducible Parallel Solvers for Linear Systems.” NA Digest, V. 10, # 36. September 1, 2010.
http://www.netlib.org/nadigesthtml/10/v10n36.html#1.
Dongarra, J., S. Hammarling, N. Higham, S. D. Relton, P. ValeroLara, and M. Zounon. 2017. “The Design and Performance of Batched BLAS on Modern HighPerformance Computing Systems.”
Procedia Computer Science, International Conference on Computational Science, ICCS 2017, 1214 June 2017, Zurich, Switzerland, 108 (January): 495–504.
https://doi.org/10.1016/j.procs.2017.05.138.
Dukhan, M., R. Vuduc, and J. Riedy. 2016. “Wanted: FloatingPoint Add Roundoff Error Instruction.”
ArXiv:1603.00491 [Cs], March.
http://arxiv.org/abs/1603.00491.
Geijn, R. A. Van De, and J. Watts. 1997. “SUMMA: Scalable Universal Matrix Multiplication Algorithm.”
Concurrency: Practice and Experience 9 (4): 255–74.
https://doi.org/10.1002/(SICI)10969128(199704)9:4<255::AIDCPE250>3.0.CO;22.
Hida, Y., X. S. Li, and D. H. Bailey. 2001. “Algorithms for QuadDouble Precision Floating Point Arithmetic.” In
Symposium on Computer Arithmetic (ARITH), 155–62.
https://doi.org/10.1109/ARITH.2001.930115.
Higham, N. 1993. “The Accuracy of Floating Point Summation.”
SIAM Journal on Scientific Computing 14 (4): 783–99.
https://doi.org/10.1137/0914050.
———. 2002.
Accuracy and Stability of Numerical Algorithms. 2nd ed. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics.
https://doi.org/10.1137/1.9780898718027.
Iakymchuk, R., S. Collange, D. Defour, and S. Graillat. 2015. “ExBLAS: Reproducible and Accurate BLAS Library.” In
SC 2015 Numerical Reproducibility at Exascale Workshops (NRE). Austin, TX, United States.
https://hal.archivesouvertes.fr/hal01202396.
Iakymchuk, R., D. Defour, S. Collange, and S. Graillat. 2015a. “Reproducible Triangular Solvers for HighPerformance Computing.” In
International Conference on Information Technology  New Generations (ITNG), 353–58.
https://doi.org/10.1109/ITNG.2015.63.
———. 2015b. “Reproducible and Accurate Matrix Multiplication.” In
Scientific Computing, Computer Arithmetic, and Validated Numerics (SCAN), 126–37. Lecture Notes in Computer Science. Springer, Cham.
https://doi.org/10.1007/9783319317694_11.
“IEEE Standard for FloatingPoint Arithmetic.” 2008.
IEEE Std 7542008, August, 1–70.
https://doi.org/10.1109/IEEESTD.2008.4610935.
———.
IEEE Standard for FloatingPoint Arithmetic. 2018. New York, NY: Institute of Electrical and Electronics Engineers.
http://754r.ucbtest.org/drafts/P754238.pdf.
Kahan, W. 1965. “Pracniques: Further Remarks on Reducing Truncation Errors.”
Communications of the ACM 8 (1): 40–.
https://doi.org/10.1145/363707.363723.
Knuth, D. E. 1969. The Art of Computer Programming 2: Seminumerical Algorithms. Reading, Mass.: AddisonWesley.
Kornerup, P., V. Lefevre, N. Louvet, and J. Muller. 2012. “On the Computation of Correctly Rounded Sums.”
IEEE Transactions on Computers 61 (3): 289–98.
https://doi.org/10.1109/TC.2011.27.
Kulisch, U. 2012. Computer Arithmetic and Validity: Theory, Implementation, and Applications. 2nd ed. Walter de Gruyter.
Lefèvre, V. 2017. “Correctly Rounded ArbitraryPrecision FloatingPoint Summation.”
IEEE Transactions on Computers 66 (12): 2111–24.
https://doi.org/10.1109/TC.2017.2690632.
Li, X. S., J. W. Demmel, D. H. Bailey, G. Henry, Y. Hida, J. Iskandar, W. Kahan, et al. 2002. “Design, Implementation and Testing of Extended and Mixed Precision BLAS.”
ACM Transactions on Mathematical Software 28 (2): 152–205.
https://doi.org/10.1145/567806.567808.
Lutz, D. R., and C. N. Hinds. 2017. “HighPrecision Anchored Accumulators for Reproducible FloatingPoint Summation.” In
Symposium on Computer Arithmetic (ARITH), 98–105.
https://doi.org/10.1109/ARITH.2017.20.
Nguyen, H. D., and J. Demmel. 2015. “Reproducible TallSkinny QR.” In
Symposium on Computer Arithmetic (ARITH), 152–59.
https://doi.org/10.1109/ARITH.2015.28.
Rump, S. M. 2009. “Ultimately Fast Accurate Summation.”
SIAM Journal on Scientific Computing 31 (5): 3466–3502.
https://doi.org/10.1137/080738490.
Rump, S. M., T. Ogita, and S. Oishi. 2010. “Fast High Precision Summation.”
Nonlinear Theory and Its Applications, IEICE 1: 2–24.
https://doi.org/10.1587/nolta.1.2.
Rump, S., T. Ogita, and S. Oishi. 2008a. “Accurate FloatingPoint Summation Part I: Faithful Rounding.”
SIAM Journal on Scientific Computing 31 (1): 189–224.
https://doi.org/10.1137/050645671.
———. 2008b. “Accurate FloatingPoint Summation Part II: Sign, KFold Faithful and Rounding to Nearest.”
SIAM Journal on Scientific Computing 31 (2): 1269–1302.
https://doi.org/10.1137/07068816X.
Solomonik, E., and J. Demmel. 2011. “CommunicationOptimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms.” In
International Conference on Parallel Processing (ICPP), Part II:90–109. EuroPar’11. Berlin, Heidelberg: SpringerVerlag.
https://doi.org/10.1007/9783642233975_10.
Acknowledgements
This research is supported in part by
NSF grants NSF OCI1032639, NSF ACI1339676,
DOE grants DOE DESC0010200, DOE DESC0003959, DOE DESC0005136, DOE DESC0008699, DOE DESC0008700, DOE DESC0004938,
DOE AC0205CH11231,
and DARPA grant HR00111220016,
ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, and Oracle.
Any opinions, findings, conclusions, or recommendations in this paper are solely
those of the authors and does not necessarily reflect the position or the policy of the sponsors.
Last update August 17, 2018.