Before diving into how BLAS is written to be fast, it’s essential to understand memory. Specifically, how data is stored in memory and how it is fed to the processor, which performs the computations. All BLAS does is optimize these two operations for a specific computer architecture. CORAL, for instance, targets AArch64 architectures.
Future posts on BLAS will refer back to concepts explained here.
The content in this post is heavily taken from What Every Programmer Should Know About Memory by Ulrich Drepper. It’s phenomenal. Unless otherwise linked, all numerical values come from this paper.