CORAL Benchmarks

Apple M4 (6P + 4E), 16GB unified memory. single-threaded (1 P-core).

In all plots I benchmark against OpenBLAS. For some I also benchmark against Apple Accelerate. Routines that don’t have Accelerate shown mean Accelerate was much faster on my M4 Macbook pro and masked any comparison with OpenBLAS.

faer benchmarked when single-threaded for GEMM.


Level 1#

AXPY#

f32#

f64#

c32#

c64#


SCAL#

f32#

f64#

c32#

c64#


DOT#

f32#

f64#

c32#

conj (cdotc)#

unconj (cdotu)#

c64#

conj (zdotc)#

unconj (zdotu)#


Level 2#

GEMV#

yαop(A)x+βy y \leftarrow \alpha \operatorname{op}(A) x + \beta y

f32#

f64#

c32#

c64#


TRSV#

xA1b x \leftarrow A^{-1} b

f32#

LOWER TRIANGULAR#

UPPER TRIANGULAR#

f64#

LOWER TRIANGULAR#

UPPER TRIANGULAR#

c32#

LOWER TRIANGULAR#

UPPER TRIANGULAR#

c64#

LOWER TRIANGULAR#

UPPER TRIANGULAR#


TRMV#

xop(A)x x \leftarrow \operatorname{op}(A) x

f32#

LOWER TRIANGULAR#

UPPER TRIANGULAR#

f64#

LOWER TRIANGULAR#

UPPER TRIANGULAR#

c32#

LOWER TRIANGULAR#

UPPER TRIANGULAR#

c64#

LOWER TRIANGULAR#

UPPER TRIANGULAR#


Level 3#

GEMM#

Cαop(A)op(B)+βC C \leftarrow \alpha \operatorname{op}(A)\operatorname{op}(B) + \beta C

f32#

f64#

c32#