coral benchmarks

Apple M4 (6P + 4E), 16GB unified memory. All benchmarks are single-precision and single-threaded.

Each plot shows:

  • coral-safe (portable-simd, safe Rust)
  • coral-neon (AArch64 / NEON)
  • a reference implementation:
    • OpenBLAS armv8, or
    • Apple Accelerate, or
    • BLIS for sgemm
    • faer for sgemm/matmul

The OpenBLAS backend used is optimized for Level2-3. For some Level1 routines like SNRM2, I believe this backend just uses the reference, netlib algorithm.


Table of Contents#

OpenBLAS#

Level 1#

ISAMAX — index of max absolute value#

coral-aarch64 benchmarks

Apple M4 (6P + 4E), 16GB unified memory. All benchmarks are single-threaded.

In all plots, CORAL is benchmarked against OpenBLAS. Some routines also include Apple Accelerate. When Accelerate is omitted, it’s because its AMX-backed kernels on this M4 MacBook Pro are much faster and mask any comparison with OpenBLAS.

For sgemm, faer is included, also single-threaded.


Table of Contents#


Level 1#

AXPY#

AXPY performs a scaled vector addition: \[ y \leftarrow \alpha x + y \]

Understanding Memory

Before diving into how BLAS is written to be fast, it’s essential to understand memory. Specifically, how data is stored in memory and how it is fed to the processor, which performs the computations. All BLAS does is optimize these two operations for a specific computer architecture. CORAL, for instance, targets AArch64 architectures.

Future posts on BLAS will refer back to concepts explained here.

The content in this post is heavily taken from What Every Programmer Should Know About Memory by Ulrich Drepper. It’s phenomenal. Unless otherwise linked, all numerical values come from this paper.

What is CORAL?

In the past few months I’ve become immensely interested in scientific computing and writing fast code. I started CORAL as a project to learn both at the same time. And learn Rust.

CORAL stands for COre Rust Architecture for Linear algebra. It is an implementation of the Basic Linear Algebra Subprograms, or BLAS, in pure Rust. It is written for AArch64 architectures only.

BLAS is the set of the most common low-level operations, “kernels”, for linear algebra. Most numerical routines involve linear algebra; it is clear that a useful BLAS must be as fast as possible. These kernels naturally separate into three levels, each monumentally more difficult than the last.