The following is a list of titles and abstracts published in the current reporting period. --------------------------------------------------------------- Title: SPARSITY: An Optimization Framework for Sparse Matrix Kernels (Submitted to the International Journal of High-Performance Computing Applications) Authors: Eun-Jin Im, Katherine Yelick, Richard Vuduc Abstract: Sparse matrix-vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix. The Sparsity system is designed to address these problems by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. Sparsity combines traditional techniques such as loop transformations with data structure transformations and optimization heuristics that are specific to sparse matrices. It provides a novel framework for selecting optimization parameters, such as block sizes, using a combination of performance models and search. In this paper, we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular, finite-element problems. Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the non-zero structure is random. For applications involving multiple vectors, reorganizing the computational to perform the entire set of multiplications, as a single operation produces significant speedups. We describe the different optimizations and parameter selection techniques and evaluate them on several machines using over 40 matrices taken from broad set of application domains. Our results demonstrate speedups of up to 2x for the single vector case and up to 6x for the multiple vector case. --------------------------------------------------------------- Title: Statistical Models in Automatic Performance Tuning Systems (Accepted, International Journal of High Performance Computing Applications: Special Issue on Automatic Performance Tuning) Authors: Richard Vuduc, Jeff Bilmes, James Demmel Abstract: Achieving peak performance from library subroutines usually requires extensive, machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible implementations of a subroutine, and (2) selecting the fastest implementation by an exhaustive, empirical search. This paper presents quantitative data that motivates the development of such a search-based system, and discusses two problems which arise in the context of search. First, we develop a heuristic for stopping an exhaustive compile-time search early if a near-optimal implementation is found. Second, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations. We address both problems by using statistical techniques to exploit the large amount of performance data collected during the search. We apply our methods to actual performance data collected by the PHiPAC tuning system for matrix multiply on a variety of hardware platforms. --------------------------------------------------------------- Title: Memory Hierarchy Optimizations and Performance Bounds for Sparse A^T*A*x (Submitted, Workshop on Parallel Linear Algebra, ICCS 2003) Authors: Richard Vuduc, Attila Gyulassy, James Demmel, Katherine Yelick Abstract: We present uniprocessor performance optimizations, automatic tuning techniques, and a detailed experimental analysis of the sparse matrix operation, y = A^T*A*x, where A is a sparse matrix and x, y are dense vectors. We first describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply (SpMxV). We then evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2x. Moreover, we present platform-specific upper-bounds on the performance of this kernel. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we present a hybrid off-line/run-time heuristic which in practice automatically selects optimal (or near-optimal) values of the key tuning parameters, the register block sizes. There are at least three implications of this work. First, the effectiveness of our optimizations and utility of this kernel suggest that sparse A^T*A*x should be a basic primitive in sparse matrix libraries. Second, our upper bound analysis shows that there is an opportunity to apply automatic low-level tuning methods, in the spirit of ATLAS/PHiPAC tuning systems for dense linear algebra, to further improve the performance of this kernel. Third, the success of our heuristic provides additional validation of the Sparsity tuning methodology. --------------------------------------------------------------- Title: Matrix Splitting and Reordering for Sparse Matrix-Vector Multiply (Submitted as a U.C. Berkeley Technical Report) Authors: Hyun Jin Moon, Richard Vuduc, James Demmel, Katherine Yelick Abstract: We present the experimental results from applying matrix reordering and splitting techniques to improve the performance of the sparse matrix-vector multiply (SpMxV) operation, y <-- y + Ax, where A is a sparse matrix and x, y are dense vectors. We build on the prior work of the Sparsity system, which proposed a register-level blocking optimization to exploit naturally occuring dense blocks in A. In particular, we consider two performance optimization techniques. First, we apply two-level splittings of A, A = A_1 + A_2, which better exploit variable block structure by allowing A_1 and A_2 to use different block sizes. We consider the case when A_1 is blocked but A_2 is unblocked. We furthermore relax a Sparsity requirement that blocks be aligned on a particular column boundary, and study the effects. Second, we consider reordering matrix rows and columns to create dense block structure. Specifically, we apply a method due to Pinar based on formulating the reordering problem as a traveling salesman problem (TSP). When these techniques are combined, we demonstrate speedups of up to 2x over Sparsity-optimized implementations of SpMxV on a benchmark suite of 44 matrices and 5 evaluation platforms. Finally, we present a preliminary evaluation of a heuristic for choosing a block sizes, given that TSP reordering either will or will not be applied. This heuristic selects a block size yielding performance, in Mflop/s, within 5% of the best block size. --------------------------------------------------------------- Title: Accurate Floating Point Summation (Submitted to SIAM Journal on Scientific Computing) Authors: Yozo Hida, James Demmel Abstract: We present and analyze several simple algorithms for accurately summing n floating point numbers S = s_1+ ... + s_n, independent of how much cancellation occurs in the sum. Let f be the number of significant bits in the s_i. We assume a register is available with F > f significant bits. Then assuming that (1) n <= floor(2^{F-f}/(1-2^{-f})) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the s_i in decreasing order of magnitude yields S rounded to within just over 1.5 units in its last place. If S=0, then it is computed exactly. If we increase n slightly to floor(2^{F-f} / (1-2^{-f})) + 3 then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (F >= 2f). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors x_1 * y_1 + ... + x_n * y_n computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as n <= 33. If double extended is used n can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy. --------------------------------------------------------------- Title: Fast and Accurate Floating Point Summation with Application to Computational Geometry (Submitted to 10th GAMM-IMACS International Symposium on Scientific Computing, Computer Arithmetic, and Validated Numerics) Authors: Yozo Hida and James Demmel Abstract: We present several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator. Let f and F be the number of significant bits in the summands and the accumulator, respectively. Then assuming gradual underflow, no overflow, and round-to-nearest arithmetic, up to floor(2^{F-f}/(1-2^{-f})) + 1 numbers can be accurately added by just summing the terms in decreasing order of exponents, yielding a sum correct to within about 1.5 units in the last place. In particular, if the sum is zero, it is computed exactly. We apply this result to the floating point formats in the IEEE floating point standard, and investigate its performance. Our results show that in the absence of massive cancellation (the most common case) the cost of guaranteed accuracy is about 30-40\% more than the straightforward summation. If massive cancellation does occur, the cost of computing the accurate sum is about a factor of ten. Finally we apply our algorithm in computing a robust geometric predicate (used in computational geometry), where our accurate summation algorithm improves the existing algorithm by a factor of two on a nearly coplanar set of points. --------------------------------------------------------------- Title: Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply (Published, Proceedings of Supercomputing 2002) Authors: Richard Vuduc, James Demmel, Katherine Yelick, Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Abstract: We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpMxV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpMxV when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on a test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on a class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2x are still possible. Lastly, we present a new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpMxV performance (Mflop/s) by as much as 2.5x over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for (SpMxV) will come from two sources: (1) consideration of higher-level matrix structures (e.g., exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g., sparse matrix-multiple vector multiply, multiplication of A^TA by a vector). --------------------------------------------------------------- Title: Matrix Splitting and Reordering for Sparse Matrix Vector Multiply (Submitted, UC Berkeley Technical Report) Authors: Hyun Jin Moon, Richard Vuduc, James Demmel, Katherine Yelick We present the experimental results from applying matrix reordering and splitting techniques to improve the performance of the sparse matrix-vector multiply SpMxV operation, y <-- y + Ax, where A is a sparse matrix and x, y are dense vectors. We build on the prior work of the Sparsity system, which proposed a register-level blocking optimization to exploit naturally occuring dense blocks in A. In particular, we consider two performance optimization techniques. First, we apply two-level splittings of A, A = A_1 + A_2, which better exploit variable block structure by allowing A_1 and A_2 to use different block sizes. We consider the case when A_1 is blocked but A_2 is unblocked. We furthermore relax a Sparsity requirement that blocks be aligned on a particular column boundary, and study the effects. Second, we consider reordering matrix rows and columns to create dense block structure. Specifically, we apply a method due to Pinar based on formulating the reordering problem as a traveling salesman problem (TSP). When these techniques are combined, we demonstrate speedups of up to 2x over Sparsity-optimized implementations of SpMxV on a benchmark suite of 44 matrices and 5 evaluation platforms. Finally, we present a preliminary evaluation of a heuristic for choosing a block sizes, given that TSP reordering either will or will not be applied. This heuristic selects a block size yielding performance, in Mflop/s, within 5% of the best block size. --------------------------------------------------------------- Title: Automatic Performance Tuning and Analysis of Sparse Triangular Solve (Published, Workshop on Performance Of High-level Languages and Libraries, International Conference on Supercomputing 2002) Authors: Richard Vuduc, Shoaib Kamil, Jen Hsu, James Demmel, Katherine Yelick Abstract: In this paper, we present performance optimizations and bounds for the uniprocessor solution of the sparse triangular system Lx=y for a single dense vector x, given the lower triangular sparse matrix L and dense vector y. Many of the lower triangular factors arising in sparse LU factorization have a large, dense triangle in the lower right-hand corner of the matrix; this trailing triangle can account for as much as 90\% of the matrix non-zeros. Therefore, we consider both algorithmic and data structure reorganizations which partition the solve into a sparse phase and a dense phase. To the sparse phase, we adapt the register blocking optimization, previously proposed for sparse matrix-vector multiply SpMxV in the Sparsity system, to the SpTS kernel; to the dense phase, we make judicious use of highly tuned BLAS routines by switching to a dense implementation (switch-to-dense optimization). We describe fully automatic hybrid off-line/on-line heuristics for selecting the key tuning parameters: the register block size and the point at which to use the dense algorithm. We then evaluate the performance of our optimized implementations relative to the fundamental limits on performance. Specifically, we first derive simple models of the upper bounds on the execution rate (Mflop/s) of our implementations. Using hardware counter data collected with the PAPI library, we then verify our models on three hardware platforms and a set of triangular factors from applications. We observe that our optimized implementations can achieve 80% or more of these bounds; furthermore, we observe speedups of up to 1.8x when both register blocking and switch-to-dense optimizations are applied. We also present preliminary results confirming that our heuristics choose reasonable values for the tuning parameters.