The following is a list of titles and abstracts published in the
current reporting period.

---------------------------------------------------------------
Title: SPARSITY: An Optimization Framework for Sparse Matrix
Kernels
(Submitted to the International Journal of High-Performance
Computing Applications)

Authors: Eun-Jin Im, Katherine Yelick, Richard Vuduc

Abstract:
	Sparse matrix-vector multiplication is an important
computational kernel that performs poorly on most modern
processors due to a low compute-to-memory ratio and
irregular memory access patterns. Optimization is difficult
because of the complexity of cache-based memory systems and
because performance is highly dependent on the non-zero
structure of the matrix. The Sparsity system is designed to
address these problems by allowing users to automatically
build sparse matrix kernels that are tuned to their matrices
and machines. Sparsity combines traditional techniques such
as loop transformations with data structure transformations
and optimization heuristics that are specific to sparse
matrices. It provides a novel framework for selecting
optimization parameters, such as block sizes, using a
combination of performance models and search.

	In this paper, we discuss the optimization of two
operations: a sparse matrix times a dense vector and a
sparse matrix times a set of dense vectors. Our experience
indicates that register level optimizations are effective
for matrices arising in certain scientific simulations,
in particular, finite-element problems. Cache level
optimizations are important when the vector used in
multiplication is larger than the cache size, especially
for matrices in which the non-zero structure is random.
For applications involving multiple vectors, reorganizing
the computational to perform the entire set of
multiplications, as a single operation produces
significant speedups. We describe the different
optimizations and parameter selection techniques and
evaluate them on several machines using over 40 matrices
taken from broad set of application domains. Our results
demonstrate speedups of up to 2x for the single vector
case and up to 6x for the multiple vector case.


---------------------------------------------------------------
Title: Statistical Models in Automatic Performance Tuning Systems
(Accepted, International Journal of High Performance Computing
Applications: Special Issue on Automatic Performance Tuning)

Authors: Richard Vuduc, Jeff Bilmes, James Demmel

Abstract:
	Achieving peak performance from library subroutines
usually requires extensive, machine-dependent tuning by hand.
Automatic tuning systems have emerged in response, and they
typically operate by (1) generating a large number of possible
implementations of a subroutine, and (2) selecting the fastest
implementation by an exhaustive, empirical search. This paper
presents quantitative data that motivates the development of
such a search-based system, and discusses two problems which
arise in the context of search. First, we develop a heuristic
for stopping an exhaustive compile-time search early if a
near-optimal implementation is found. Second, we show how to
construct run-time decision rules, based on run-time inputs,
for selecting from among a subset of the best implementations.
We address both problems by using statistical techniques to
exploit the large amount of performance data collected during
the search.  We apply our methods to actual performance data
collected by the PHiPAC tuning system for matrix multiply on
a variety of hardware platforms.


---------------------------------------------------------------
Title: Memory Hierarchy Optimizations and Performance Bounds for
Sparse A^T*A*x
(Submitted, Workshop on Parallel Linear Algebra, ICCS 2003)

Authors: Richard Vuduc, Attila Gyulassy, James Demmel,
Katherine Yelick

Abstract:
	We present uniprocessor performance optimizations, automatic
tuning techniques, and a detailed experimental analysis of the
sparse matrix operation, y = A^T*A*x, where A is a sparse matrix
and x, y are dense vectors. We first describe an implementation
of this computational kernel which brings A through the memory
hierarchy only once, and which can be combined naturally with
the register blocking optimization previously proposed in the
Sparsity tuning system for sparse matrix-vector multiply
(SpMxV). We then evaluate these optimizations on a benchmark
set of 44 matrices and 4 platforms, showing speedups of up to
4.2x. Moreover, we present platform-specific upper-bounds on
the performance of this kernel. We analyze how closely we can
approach these bounds, and show when low-level tuning techniques
(e.g., better instruction scheduling) are likely to yield a
significant pay-off. Finally, we present a hybrid
off-line/run-time heuristic which in practice automatically
selects optimal (or near-optimal) values of the key tuning
parameters, the register block sizes.

	There are at least three implications of this work. First,
the effectiveness of our optimizations and utility of this
kernel suggest that sparse A^T*A*x should be a basic primitive
in sparse matrix libraries. Second, our upper bound analysis
shows that there is an opportunity to apply automatic low-level
tuning methods, in the spirit of ATLAS/PHiPAC tuning systems
for dense linear algebra, to further improve the performance of
this kernel. Third, the success of our heuristic provides
additional validation of the Sparsity tuning methodology.


---------------------------------------------------------------
Title: Matrix Splitting and Reordering for Sparse Matrix-Vector
Multiply
(Submitted as a U.C. Berkeley Technical Report)

Authors: Hyun Jin Moon, Richard Vuduc, James Demmel, Katherine
Yelick

Abstract:
	We present the experimental results from applying matrix
reordering and splitting techniques to improve the performance
of the sparse matrix-vector multiply (SpMxV) operation,
y <-- y + Ax, where A is a sparse matrix and x, y are dense
vectors. We build on the prior work of the Sparsity system,
which proposed a register-level blocking optimization to
exploit naturally occuring dense blocks in A. In particular,
we consider two performance optimization techniques. First,
we apply two-level splittings of A, A = A_1 + A_2, which
better exploit variable block structure by allowing A_1 and
A_2 to use different block sizes. We consider the case when
A_1 is blocked but A_2 is unblocked. We furthermore relax a
Sparsity requirement that blocks be aligned on a particular
column boundary, and study the effects. Second, we consider
reordering matrix rows and columns to create dense block
structure. Specifically, we apply a method due to Pinar
based on formulating the reordering problem as a traveling
salesman problem (TSP). When these techniques are combined,
we demonstrate speedups of up to 2x over Sparsity-optimized
implementations of SpMxV on a benchmark suite of 44 matrices
and 5 evaluation platforms. Finally, we present a preliminary
evaluation of a heuristic for choosing a block sizes, given
that TSP reordering either will or will not be applied. This
heuristic selects a block size yielding performance, in
Mflop/s, within 5% of the best block size.


---------------------------------------------------------------
Title: Accurate Floating Point Summation
(Submitted to SIAM Journal on Scientific Computing)

Authors: Yozo Hida, James Demmel

Abstract:
	We present and analyze several simple algorithms for
accurately summing n floating point numbers S =  s_1+ ... + s_n,
independent of how much cancellation occurs in the sum.  Let f
be the number of significant bits in the s_i.  We assume a
register is available with F > f significant bits. Then
assuming that
(1) n <=  floor(2^{F-f}/(1-2^{-f})) + 1,
(2) rounding is to nearest,
(3) no overflow occurs, and
(4) all underflow is gradual,
then simply summing the s_i in decreasing order of magnitude
yields S rounded to within just over 1.5 units in its last
place. If S=0, then it is computed exactly.
If we increase n slightly to floor(2^{F-f} / (1-2^{-f})) + 3
then all accuracy can be lost.
This result extends work of Priest and others who considered
double precision only (F >= 2f).
We apply this result to the floating point formats in
the (proposed revision of the) IEEE floating point standard.
For example, a dot product of IEEE single precision vectors
 x_1 * y_1 + ... + x_n * y_n computed using double precision
and sorting is guaranteed correct to nearly 1.5 ulps as long
as n <= 33.
If double extended is used n can be as large as 65537.
We also show how sorting may be
mostly avoided while retaining accuracy.


---------------------------------------------------------------
Title: Fast and Accurate Floating Point Summation with
Application to Computational Geometry
(Submitted to 10th GAMM-IMACS International Symposium on
Scientific Computing, Computer Arithmetic, and Validated Numerics)

Authors: Yozo Hida and James Demmel

Abstract:
	We present several simple algorithms for accurately
computing the sum of n floating point numbers using a wider
accumulator.  Let f and F be the number of significant bits
in the summands and the accumulator, respectively.  Then
assuming gradual underflow, no overflow, and round-to-nearest arithmetic,
up to floor(2^{F-f}/(1-2^{-f})) + 1 numbers can be accurately
added by just summing the terms in decreasing order of exponents,
yielding a sum correct to within about 1.5 units in the last place.
In particular, if the sum is zero, it is computed exactly.
We apply this result to the floating point formats in the
IEEE floating point standard, and
investigate its performance.
Our results show that in the absence of massive cancellation (the
most common case) the cost of guaranteed accuracy is about 30-40\%
more than the straightforward summation.  If massive cancellation
does occur, the cost of computing the accurate sum is about a factor of ten.
Finally we apply our algorithm in computing a robust geometric
predicate (used in computational geometry), where our accurate summation
algorithm improves the existing algorithm by a factor of two on a nearly
coplanar set of points.


---------------------------------------------------------------
Title: Performance Optimizations and Bounds for
Sparse Matrix-Vector Multiply
(Published, Proceedings of Supercomputing 2002)

Authors: Richard Vuduc, James Demmel, Katherine Yelick, Shoaib
Kamil, Rajesh Nishtala, Benjamin Lee

Abstract:
	We consider performance tuning, by code and data structure
reorganization, of sparse matrix-vector multiply (SpMxV), one of
the most important computational kernels in scientific applications.
This paper addresses the fundamental questions of what limits exist
on such performance tuning, and how closely tuned code approaches
these limits.
	Specifically, we develop upper and lower bounds on the performance
(Mflop/s) of SpMxV when tuned using our previously proposed register
blocking optimization. These bounds are based on the non-zero pattern
in the matrix and the cost of basic memory operations, such as cache
hits and misses. We evaluate our tuned implementations with respect to
these bounds using hardware counter data on 4 different platforms and
on a test set of 44 sparse matrices. We find that we can often get
within 20% of the upper bound, particularly on a class of matrices
from finite element modeling (FEM) problems; on non-FEM matrices,
performance improvements of 2x are still possible. Lastly, we
present a new heuristic that selects optimal or near-optimal
register block sizes (the key tuning parameters) more accurately
than our previous heuristic. Using the new heuristic, we show
improvements in SpMxV performance (Mflop/s) by as much as 2.5x over
an untuned implementation.
	Collectively, our results suggest that future performance
improvements, beyond those that we have already demonstrated for
(SpMxV) will come from two sources: (1) consideration of
higher-level matrix structures (e.g., exploiting symmetry, matrix
reordering, multiple register block sizes), and (2) optimizing
kernels with more opportunity for data reuse (e.g., sparse
matrix-multiple vector multiply, multiplication of A^TA by a
vector).


---------------------------------------------------------------
Title: Matrix Splitting and Reordering for Sparse Matrix Vector
Multiply
(Submitted, UC Berkeley Technical Report)

Authors: Hyun Jin Moon, Richard Vuduc, James Demmel, Katherine
Yelick
	We present the experimental results from applying matrix
reordering and splitting techniques to improve the performance of
the sparse matrix-vector multiply SpMxV operation, y <-- y + Ax,
where A is a sparse matrix and x, y are dense vectors. We build
on the prior work of the Sparsity system, which proposed a
register-level blocking optimization to exploit naturally occuring
dense blocks in A. In particular, we consider two performance
optimization techniques. First, we apply two-level splittings of
A, A = A_1 + A_2, which better exploit variable block structure by
allowing A_1 and A_2 to use different block sizes. We consider
the case when A_1 is blocked but A_2 is unblocked. We furthermore
relax a Sparsity requirement that blocks be aligned on a
particular column boundary, and study the effects.
Second, we consider reordering matrix rows and columns to create
dense block structure. Specifically, we apply a method due to
Pinar based on formulating the reordering problem as a traveling
salesman problem (TSP). When these techniques are combined, we
demonstrate speedups of up to 2x over Sparsity-optimized
implementations of SpMxV on a benchmark suite of 44 matrices and
5 evaluation platforms. Finally, we present a preliminary
evaluation of a heuristic for choosing a block sizes, given that
TSP reordering either will or will not be applied. This heuristic
selects a block size yielding performance, in Mflop/s, within
5% of the best block size.


---------------------------------------------------------------
Title: Automatic Performance Tuning and Analysis of Sparse
Triangular Solve
(Published, Workshop on Performance Of High-level Languages and
Libraries, International Conference on Supercomputing 2002)

Authors: Richard Vuduc, Shoaib Kamil, Jen Hsu, James Demmel,
Katherine Yelick

Abstract:
In this paper, we present performance optimizations and bounds
for the uniprocessor solution of the sparse triangular system
Lx=y for a single dense vector x, given the lower triangular
sparse matrix L and dense vector y.
Many of the lower triangular factors arising in sparse LU
factorization have a large, dense triangle in the lower
right-hand corner of the matrix; this trailing triangle can
account for as much as 90\% of the matrix non-zeros.
Therefore, we consider both algorithmic and data structure
reorganizations which partition the solve into a sparse phase
and a dense phase. To the sparse phase, we adapt the
register blocking optimization, previously proposed for
sparse matrix-vector multiply SpMxV in the Sparsity system,
to the SpTS kernel; to the dense phase, we make judicious
use of highly tuned BLAS routines by switching to a dense
implementation (switch-to-dense optimization). We describe
fully automatic hybrid off-line/on-line heuristics for
selecting the key tuning parameters: the register block size
and the point at which to use the dense algorithm.

We then evaluate the performance of our optimized
implementations relative to the fundamental limits on performance.
Specifically, we first derive simple models of the upper bounds on
the execution rate (Mflop/s) of our implementations. Using hardware
counter data collected with the PAPI library, we then verify our
models on three hardware platforms and a set of triangular factors
from applications. We observe that our optimized implementations
can achieve 80% or more of these bounds; furthermore, we observe
speedups of up to 1.8x when both register blocking and
switch-to-dense optimizations are applied. We also present
preliminary results confirming that our heuristics choose
reasonable values for the tuning parameters.