This paper is a follow-up to the readings on cache-oblivious algorithms. The authors examine compiler source transformation for generating cache-oblivious recursive algorithms from a loop-based implementation, and show experimental results with some linear algebra kernels (e.g., Cholesky factorization).
(Note: The fonts in the PostScript version display a little more clearly than those in the PDF version.)
This article is a high-level overview of vectorization in the Intel C/C++ and Fortran compilers (icc/ifc).
Section 3.7 covers the Pentium Pro and is a digestable summary of the key results in the first reading by Bhandarkar and Ding. The rest of the chapter reviews basic register renaming and out-of-order execution techniques. (This is a preprint of their new book, so see me about obtaining a copy of this chapter.)
This paper discusses, in the context of a particular machine/cache model, algorithms for matrix transpose, FFT, and sorting that are asymptotically optimal with respect to the amount of data movement in a system with multi-level caches. These algorithms are interesting because they do not have any cache-dependent tuning parameters (like block sizes).
This web page contains a short, high-level description of this new
software standard. This is a good place to start.
This page contains the full formal software standard of which the XBLAS is a part. The XBLAS in particular are defined in Chapter 4. You should read Chapter 1, the introduction, and pages 132--144 of Chapter 4. The rest of the very lengthy Chapter 4 is a list of subroutine interfaces, and just reading the first few will give you a flavor of what we are doing. Also, we are not currently working on the Fortran versions of this code---just C.
Read this paper next. It gives a more complete explanation of the motivation and background of the standard (Chapter 4 of the standard is a bit brief on this).
Read this paper on the different implementations of high precision arithmetic.
Read this first for a summary of the latest Sparsity-related research results.
Read this next.
This thesis has the full details. Note: The appendix is a separate download.
Read first.
Another useful starting-point.
Take a look at this paper first.
See this page for an overview of the project.
The definitive starting reference.