Apple M4 (6P + 4E), 16GB unified memory. All benchmarks are single-precision and single-threaded.
Each plot shows:
coral-safe(portable-simd, safe Rust)coral-neon(AArch64 / NEON)- a reference implementation:
- OpenBLAS armv8, or
- Apple Accelerate, or
- BLIS for
sgemm - faer for
sgemm/matmul
The OpenBLAS backend used is optimized for Level2-3. For some Level1 routines like SNRM2, I believe this backend just uses the reference, netlib algorithm.
Table of Contents#
OpenBLAS#
Level 1#
ISAMAX — index of max absolute value#

SASUM — sum of absolute values#

SAXPY — scalar vector addition#

SCOPY — copy vector into another#

SDOT — dot product#

SNRM2 — Euclidean norm#

SROT — Givens rotation#

SROTM — modified Givens rotation#

SSCAL — scale vector#

SSWAP — swap two vectors#

Level 2#
SGEMV — matrix–vector multiply#
\[ y \leftarrow \alpha \operatorname{op}(A)x + \beta y \]

SGER — rank-1 update#
\[ A \leftarrow \alpha x y^T + A \]

SSYMV — symmetric matrix–vector multiply#
\[ y \leftarrow \alpha A x + \beta y, \quad A = A^T \]
- stored lower triangle:

- stored upper triangle:

SSYR — symmetric rank-1 update#
\[ A \leftarrow \alpha x x^T + A \]
- lower triangle stored:

- upper triangle stored:

SSYR2 — symmetric rank-2 update#
\[ A \leftarrow \alpha (x y^T + y x^T) + A \]
- lower triangle stored:

- upper triangle stored:

STRMV — triangular matrix–vector multiply#
\[ x \leftarrow \operatorname{op}(A) x \]
Upper triangular (STRMV):

Lower triangular (STRMV):

STRSV — triangular solve#
\[ x \leftarrow A^{-1} b \]
Upper triangular (STRSV):

Lower triangular (STRSV):

Level 3#
SGEMM — matrix–matrix multiply#
\[ C \leftarrow \alpha \operatorname{op}(A)\operatorname{op}(B) + \beta C \]

Apple Accelerate#
The following benchmarks are the same as above, but with Apple Accelerate.
For critical routines like sgemv and sgemm, Apple uses AMX
to be much faster. Consequently it masks any comparison between my coral implementations and other BLAS.
Level 1 (Accelerate)#
ISAMAX — index of max absolute value#

SASUM — sum of absolute values#

SAXPY — scaled vector addition#

SCOPY — copy a vector into another#

SDOT — dot product#

SNRM2 — Euclidean norm#

SROT — Givens rotation#

SROTM — modified Givens rotation#

SSCAL — scale a vector#

SSWAP — swap two vectors#

Level 2 (Accelerate)#
SGEMV — matrix–vector multiply#

SGER — rank-1 update#

SSYMV — symmetric matrix–vector multiply#
Lower triangle stored:

Upper triangle stored:

SSYR — symmetric rank-1 update#
Lower triangle stored:

Upper triangle stored:

SSYR2 — symmetric rank-2 update#
Lower triangle stored:

Upper triangle stored:

STRMV — triangular matrix–vector multiply#
Upper triangular (STRMV):

Lower triangular (STRMV):

STRSV — triangular solve#
Upper triangular (STRSV):

Lower triangular (STRSV):

Level 3 (Accelerate)#
SGEMM — matrix–matrix multiply#
