In the past few months I’ve become immensely interested in scientific computing and writing fast code. I started CORAL as a project to learn both at the same time. And learn Rust.
CORAL stands for COre Rust Architecture for Linear algebra. It is an implementation of the Basic Linear Algebra Subprograms, or BLAS, in pure Rust. It is written for AArch64 architectures only.
BLAS is the set of the most common low-level operations, “kernels”, for linear algebra. Most numerical routines involve linear algebra; it is clear that a useful BLAS must be as fast as possible. These kernels naturally separate into three levels, each monumentally more difficult than the last.
-
Level 1#
Vector-Vector Operations
Think of things like calculating the dot product, , or multiplying by a scalar, .
These operations are memory bound; the bottleneck is how fast memory is moved around, not how fast the CPU is. Good performance can be achieved if code is written intelligently. -
Level 2#
Matrix-Vector Operations
Think of things like calculating , or solving a system of equations given a triangular matrix and . These operations are also memory bound. It is here though, that clever tricks leveraging cache to maximize performance begin. Good performance can still be achieved with smart code. -
Level 3#
Matrix-Matrix Operations
Think of things like calculating . It’s fair to say is the most executed mathematical operation on the planet. It is also compute bound, which means reaching peak performance is still an active area of research.A BLAS’s performance is almost entirely dependent on how fast it can calculate . Consequently, solving many s is how supercomputers are benchmarked today. AI only exists today because matrix multiplication became fast enough.
One of BLAS’s pioneers is Kazushige Goto, who hand optimized assembly routines for his GotoBLAS. This implementation outperformed many BLAS used at the time and became the backbone for the current industry standard OpenBLAS. If you use Python and NumPy for vector calculations, OpenBLAS is why it’s so fast.
CORAL isn’t built to compete with industry BLAS, but to reach 80 % of their performance on AArch64. This is to educate myself and others on how these fast low-level algorithms work. The purpose of this “blog” is to walk through how to intelligently write code to make a fast BLAS. Just not one that’s used by supercomputers.
Turns out, on AArch64, CORAL is actually comparable to OpenBLAS
when both are single-threaded. CORAL outperforms for D/C/ZGEMM, and is comparable
for SGEMM (single precision general matrix multiplication). This makes sense,
since SGEMM is the most used. You can see the benchmark(s)
here.
However, optimized for Apple Silicon, Apple
Accelerate, another BLAS
implementation, absolutely wrecks both CORAL and OpenBLAS.