Inducing complex matrix multiplication via the 1m method

FLAME Working Note #85

Field G. Van Zee
Department of Computer Science
The University of Texas at Austin
Austin, TX 78712

February 28, 2017

Abstract
In this article, we continue exploring the topic of so-called induced methods for implementing complex matrix multiplication. Previous work investigated two approaches and demonstrated various algorithms for each method that compute matrix products in the complex domain using only a real matrix multiplication kernel. However, algorithms based on the more generally applicable of the two methods—the 4m method—lead to implementations that, for various reasons, often underperform their real domain benchmarks. To overcome these limitations, we derive a superior 1m method for expressing complex matrix multiplication, one which addresses virtually all of the shortcomings inherent in 4m. Our derivation also naturally exposes a symmetry that allows the method to perform well when updating either column- or row-stored matrices. Applying the method to two general algorithms for matrix multiplication yields a family of algorithmic variants, each with a unique set of circumstantial affinities. Further analysis suggests 1m will match or exceed the performance of a real matrix multiplication based on the same kernel, especially for certain rank-\(k\) updates. We also show that the method is actually a special case of a larger family of algorithms based on a 2m method, which is generally well-suited for storage formats that separate real and imaginary parts into separate matrices. Implementations are developed within the BLIS framework, which facilitates their extension to all level-3 operations. Testing on a recent Intel microarchitecture confirms that the 1m method yields performance that is competitive with solutions based on conventionally implemented complex kernels.

1 Introduction

Dense matrix multiplication—the foundation of many dense linear algebra operations—is now ubiquitous within scientific and numerical computing applications. For three decades, libraries that provide the Basic Linear Algebra Subprograms (BLAS) [1] have exported common interfaces to specific implementations of matrix multiplication and related functionality. However, before a user can employ these highly-tuned functions, a library developer with knowledge of the target hardware must first implement the operations in question.

Some projects seek to provide ready-made solutions while others focus on streamlining the development process so that third parties may rapidly instantiate dense linear algebra functionality on existing and new hardware [6, 11, 10]. In either case, the basic formula is the same: a developer carefully writes a small computational kernel, usually in assembly (or a low-level, assembly-like) language, which is then inserted into a larger infrastructure of portable code. Within dense linear algebra (DLA) library development circles, it is taken for granted that one matrix multiplication kernel is needed for each floating-point precision and domain to be supported. Thus, to support all four combinations of the single- and double-precision in the real and complex domains, four kernels must be authored to enable a complete set of matrix multiplication implementations.
Many real-world applications, high-performance benchmarks, and pedagogical settings rely only on computation in the real domain. Thus, providing support for real matrix multiplication is understandably a priority. However, DLA—and by proxy, matrix multiplication—in the complex domain is essential for many fields and applications, in part because of complex numbers’ unique ability to encode both the phase and magnitude of a wave.

Unfortunately, those who seek to provide support for complex matrix operations must wrestle with additional challenges. Namely, most modern hardware lacks machine instructions for directly computing complex arithmetic on complex numbers. Instead, the library developer, or kernel author, must orchestrate computation on the real and imaginary components explicitly in order to implement multiplication and addition on complex scalars, which in many cases proves to be a more difficult programming task than in the real domain. Furthermore, for maintainers of BLAS, and BLAS-like library frameworks such as BLIS, supporting complex matrix multiplication doubles the number of assembly-coded matrix kernels that must be maintained. In other words, life would be simpler for DLA library developers if complex matrix multiplication was not necessary.

Of course, complex matrix multiplication will always be necessary. But what if complex matrix multiplication kernels were found to be unnecessary? To certain actors, particularly DLA library developers, such a finding would carry non-trivial consequence.

Recent work investigates whether (and to what degree of effectiveness) real domain matrix multiplication kernels can be repurposed and leveraged toward the implementation of complex matrix multiplication [9].

The authors develop a new class of algorithms that implement these so-called “induced methods” for matrix multiplication in the complex domain. Instead of relying on an assembly-coded complex kernel, as a conventional implementation would, these algorithms express complex matrix multiplication only in terms of real domain primitives.\(^1\)

We consider this article a companion and follow-up to this previous work, to which we will often refer [9]. For this reason, and for the sake of brevity, we omit much of the typical review of the literature on dense matrix multiplication, which we provide in aforementioned article. We also assume the reader has a basic understanding of the background provided by this previous article, either because he or she has already read the piece, or because he or she naturally possesses this familiarity independent of our work. Of course, we will review and summarize a reasonable amount of content specific to that article here, as needed, in order to properly set the stage for our present discussion.

### 1.1 Contributions

This article makes the following contributions:

- It introduces a new induced method—the 1M method—that relies upon only a single real matrix multiplication. As with the previous article, we introduce a family of algorithms and analyze issues germane to their high-performance implementations, including workspace, packing formats, cache behavior, multithreadability, and programming effort. A detailed review shows how 1M avoids virtually all of the challenges observed of the 4M method.

- It promotes code reuse and portability by continuing the previous article’s focus on solutions which may be cast in terms of real matrix multiplication kernels. Such solutions have clear implications for developer productivity, as they allow kernel programmers to focus their efforts on fewer and simpler kernels.

- It builds on the theme of the BLIS framework as a productivity multiplier [10], further demonstrating how complex matrix multiplication may be implemented with relatively minor modifications to the source code, and in such a way that results in immediate instantiation of complex implementations for all level-3 BLAS-like operations.

\(^1\)In [9], the authors use the term “primitive” to refer to a functional abstraction that implements a single real matrix multiplication. Such primitives are often not general purpose, and may come with significant prerequisites to facilitate their use.
• It demonstrates performance of 1M implementations that is not only superior to the previous effort based on the 4M method, but also competitive with solutions based on complex matrix kernels.

• It serves as a reference guide to the 1M implementations for complex matrix multiplication found within the BLIS framework, which is available to the community under an open source software license.2

We believe that these contributions are consequential because the 1M method effectively obviates the previous state-of-the-art established via the 4M method. Furthermore, we believe the thorough treatment of induced methods encompassed by the present article and its predecessor will have lasting archival as well as pedagogical value, to say nothing of the potential impact on developer productivity.

1.2 Notation
In this article, we continue the notation established in [9]. Specifically, we use uppercase Roman letters (e.g. A, B, and C) to refer to matrices, lowercase Roman letters (e.g. x, y, and z) to refer to vectors, and lowercase Greek letters (e.g. χ, ψ, and ζ) to refer to scalars. Subscripts are used typically to denote sub-matrices within a larger matrix (e.g. $A = \begin{pmatrix} A_0 & A_1 & \cdots & A_{n-1} \end{pmatrix}$) or scalars within a larger matrix or vector.

We also make extensive use of superscripts to denote the real and imaginary components of a scalar, vector, or (sub-)matrix. For example, $\alpha^r, \alpha^i \in \mathbb{R}$ denote the real and imaginary parts, respectively, of a scalar $\alpha \in \mathbb{C}$. Similarly, $A^r$ and $A^i$ refer to the real and imaginary parts of a complex matrix $A$, where $A^r$ and $A^i$ are themselves matrices with dimensions identical to $A$. Note that while this notation for real, imaginary, and complex matrices encodes information about content and origin, it does not encode how the matrices are actually stored. We will explicitly address storage details as implementation issues are discussed.

Also, at times we find it useful to refer to the real and imaginary elements of a complex object indistinguishably as fundamental elements (or f.e.). We also abbreviate floating-point operations as “flops” and memory operations as “memops”. We define the former to be a Multiply or Add (or Subtract) operation whose operands are fundamental elements and the latter to be a load or store operation on a single fundamental element. These definitions allow for a consistent accounting of complex computation relative to the real domain.

We also discuss cache and register block sizes that are key features of the matrix multiplication algorithm discussed elsewhere [10, 8, 9]. Unless otherwise noted, block sizes $n_C, m_C, k_C, m_R$, and $n_R$ refer to those appropriate for computation in the real domain. Complex domain block sizes will be denoted with a superscript $z$.

This article also discusses and references several hypothetical algorithms and functions. Unless otherwise noted, a call to function FUNC that implements $C := C + AB$ appears as $[C] := \text{FUNC}(A, B, C)$. We will also reference functions that access properties of matrices. For example, $m(A)$ and $n(A)$ would return the $m$ and $n$ dimensions of a matrix $A$, while $\text{rs}(B)$ and $\text{cs}(B)$ would return the row and column strides of $B$.

2 Background and review

2.1 Motivation
In [9], the authors list three primary motivating factors behind their effort to seek out methods for inducing complex matrix multiplication via real domain kernels:

• Productivity. By inducing complex matrix multiplication from real domain kernels, the number of kernels that must be supported would be halved. This allows the DLA library developers to focus on a smaller and simpler set of real domain kernels. This benefit would manifest most obviously when instantiating BLAS-like functionality on new hardware [8]

---

2The BLIS framework is available under the so-called “new” or “modified” or “3-clause” BSD license.
• **Portability.** Induced methods avoid dependence on complex domain kernels because they encode the idea of complex matrix product at a higher level. This would naturally allow us to encode such methods portably within a framework such as BLIS [10]. Once integrated into the framework, developers and users would benefit from the immediate availability of complex matrix multiplication implementations whenever real matrix kernels were present.

• **Performance.** Implementations of complex matrix multiplication that rely on real domain kernels would likely inherit the high-performance properties of those kernels. Any improvement to the real kernels would benefit both real and complex domains.

Thus, it is clear that finding a suitable induced method would carry significant benefit to DLA library and kernel developers.

### 2.2 The 3m and 4m methods

The authors of [9] presented two general ways of inducing complex matrix multiplication: the 3m method and the 4m method. These methods are then contrasted to the conventional approach, whereby a blocked matrix multiplication algorithm is executed with a complex domain kernel—one that implements complex arithmetic at the scalar level, in assembly language.

The 4m method begins with the classic definition of complex scalar multiplication and addition in terms of real and imaginary components of \( \chi, \psi, \zeta \in \mathbb{C} \):

\[
\begin{align*}
\zeta^r := \zeta^r + \chi^r \psi^r - \chi^i \psi^i \\
\zeta^i := \zeta^i + \chi^r \psi^i + \chi^i \psi^r
\end{align*}
\]  

(1)

We then observe that we can apply such a definition to complex matrices \( A \in \mathbb{C}^{m \times k} \), \( B \in \mathbb{C}^{k \times n} \), and \( C \in \mathbb{C}^{m \times n} \), provided that we can reference the real and imaginary parts as logically separate submatrices:

\[
\begin{align*}
C^r := C^r + A^r B^r - A^i B^i \\
C^i := C^i + A^r B^i - A^i B^r
\end{align*}
\]  

(2)

This definition expresses complex matrix multiplication in terms of four matrix products (hence the name 4M) and four matrix accumulations (i.e., additions or subtractions).

The 3m method relies on an algebraic equivalent of Eq. 2:

\[
\begin{align*}
C^r := C^r + A^r B^r - A^i B^i \\
C^i := C^i + (A^r + A^i)(B^r + B^i) - A^r B^r - A^i B^i
\end{align*}
\]

This reexpression reduces the number of matrix products to three, at the expense of increasing the number of accumulations from four to seven. However, when the cost of a matrix product greatly exceeds that of an accumulation, this trade-off can result in a net reduction in computational runtime.

The authors of [9] observe that both methods may be applied to any particular level of a blocked matrix multiplication algorithm, resulting in several algorithms, each exhibiting somewhat different properties. The blocked algorithm used in that article is shown in Figure 1 (left) and explained in detail in Section 2.1 of [9] and revisited in Section 2.4 of the present article.

Algorithms that implement the 3m method were found to yield “effective flops per second” performance that not only exceeded that of 4m, but also approached or exceeded the theoretical peak rate of the hardware. Unfortunately, these compelling results come at a cost: the numerical properties of implementations based on 3m are slightly less robust than that of algorithms based on the conventional approach or 4m. And although the author of [3] found that 3m was stable enough for most practical purposes, many applications will simply not be willing to stray from the numerical expectations implicit in conventional matrix multiplication. Thus, going forward, we will turn our attention away from 3m, and instead focus on the 4m as the standard reference method against which we will compare.
2.3 Previous findings

For the reader’s convenience, we will now summarize the key findings, observations, and other highlights from the previous article regarding algorithms and implementations based on the 4M method.

- Since the algorithms in the 4M family execute the same number of flops, the algorithms’ relative performance depends entirely on the (1) the number of memops executed and (2) the level of cache from which fundamental elements of matrices A and B (or rather, F.E. of the packed copies of these matrices, A′, B′, and B″) are reused. The number of memops is affected only by a halving of certain cache blocksize needed in order to leave cache footprints of A′′ and B′′ unchanged. The level from which F.E. are reused is determined by the level of the matrix multiplication algorithm to which the 4M method was originally applied. The lower the 4M method is applied, the higher the efficiency of data reuse from and movement through the cache hierarchy.

- The lowest level application, algorithm 4M_1A, efficiently moves F.E. of A, B, and C from main memory to the L1 cache only once per rank-k_{C} update, with virtually no excess movement due to incidental cache line proximity, and reuses F.E. from the L1 cache. It relies on a relatively simple packing format in which complex micro-panels are stored with real and imaginary F.E. separated into two consecutive real micro-panels, each with identical register blocksize and k dimensions. Algorithm 4M_1A requires negligible workspace (approximately equal to the storage capacity of the vector register set), is well-suited for multithreading, and is minimally disruptive to the encoding within the BLIS framework.

- The conventional approach can be viewed as a special case of 4M in which F.E. are reused from registers, rather than cache. In this way, a conventional implementation embodies the lowest-level application of 4M possible, in which the method is applied to individual scalars (and, typically, then optimally encoded via vector instructions).

- The way complex numbers are stored has a significant effect on performance. Interleaved pair-wise storage of real and imaginary values naturally favors implementations that reuse F.E. from registers, as is common with conventional implementations. However, this storage is awkward for algorithms based on 4M (and 3M for that matter), largely because it prevents the use of vector instructions for loading and storing F.E. of C⁴, as these instructions implicitly must load or store several consecutive values. Indeed, 4M_1A suffers from a quadrupling⁵ of the number of memops on C, in addition to being forced to access these F.E. in a non-contiguous manner. If, however, applications stored complex matrices with real and imaginary parts separated, the penalty paid by induced methods would be partially mitigated.

- While observed performance of low-level applications of 4M is decent, far exceeding an unoptimized reference implementation, it not only falls short of a comparable conventional solution based on a complex kernel, it appears to fall short of its real domain “benchmark”—that is, the performance of a similar problem size in the real domain computed by an optimized algorithm using the same real domain kernel. This level of performance may be disappointing for some, even if it achieves 90-95% of what is possible with a complex kernel. The authors conclude that its somewhat attenuated performance would relegate 4M, in practice, to serving mostly as a placeholder, to be used when complex kernels have not yet been written, rather than a competitive replacement that makes complex kernels unnecessary.

---

3 Here, the term “reuse” refers to the reuse of F.E. that corresponds to the recurrence of A′′, A′, B′, and B″ in Eq. 2, not the reuse of whole (complex) elements that naturally occurs in the execution of the matrix multiplication algorithm in Figure 1 (left).

4 The traditional pair-wise storage is also awkward for 4M algorithms during the packing of data from A and B, but this effect is not nearly as dramatic.

5 A factor of two comes from the fact that, as shown in Eq. 2, 4M touches C^r and C^s twice each, while another factor of two comes from the cache blocksize scaling required on k^C in order to maintain the cache footprints of micro-panels of A′′ and B′′.
Figure 1: Left: An illustration of the algorithm for computing high-performance matrix multiplication, taken from [9], which expresses computation in terms of a so-called “block-panel” subproblem (macro-kernel). Right: An alternate algorithm that expresses the operation in terms of “panel-block” matrix multiplication.

2.4 Revisiting the matrix multiplication algorithm

In this section, we review a common algorithm for high-performance matrix multiplication on conventional microprocessor architectures. This algorithm was first reported on in [2] and further refined in [10]. Figure 1 (left) illustrates the key features of this algorithm.

The current state-of-the-art formulation of the matrix multiplication algorithm consists of six loops, the last of which resides within a micro-kernel that is typically highly optimized for the target hardware. These loops partition the matrix operands using carefully chosen cache \((n_C, k_C, \text{ and } m_C)\) and register \((m_R \text{ and } n_R)\) blockizes that result in submatrices residing favorably at various levels of the cache hierarchy, so as to allow data to be reused many times. In addition, submatrices of \(A\) and \(B\) are copied (“packed”) to temporary workspace matrices (\(\hat{A}_i\) and \(\hat{B}_p\), respectively) in such a way that allows the micro-kernel to subsequently access matrix elements contiguously in memory, which improves cache and TLB performance. The cost of this packing is amortized over enough computation that its impact on overall performance is negligible for all but the smallest problems. At the lowest level, within the micro-kernel loop, an \(m_R \times 1\) micro-column vector and a \(1 \times n_R\) micro-row vector are loaded from the current micro-panels of \(\hat{A}_i\) and \(\hat{B}_p\), respectively, so...
that the outer product of these vectors may be computed to update the corresponding \( m_R \times n_R \) submatrix, or “micro-tile” of \( C \). The individual floating-point operations that constitute these tiny rank-1 updates are oftentimes executed via vector instructions (if the architecture supports them) in order to maximize utilization of the floating-point unit(s).

The algorithm captured by Figure 1 (left) forms the basis for all level-3 implementations found in the BLIS framework (as of this writing). This algorithm is based on a so-called block-panel matrix multiplication.\(^6\) The register \((m_R, n_R)\) and cache \((m_C, k_C, n_C)\) blocksizes labeled in the algorithmic diagram are typically chosen by the kernel developer as a function of hardware characteristics, such as the vector register set, cache sizes, and cache associativity. The authors of [5] present an analytical model for identifying suitable (if not optimal) values for these blocksizes.

Later in this article, we will observe that it may be desirable in some situations to have access to a companion algorithm that casts its largest cache-bound subproblem (computed by the macro-kernel) in terms of panel-block multiplication. This alternative algorithm is depicted in Figure 1 (right).\(^7\) Either algorithm can be used to implement all of the level-3 operations defined by the original BLAS or the BLIS framework, including the most popular and general-purpose operation, general matrix multiplication (GEMM).\(^8\)

3 1m method

The primary motivation for seeking a better induced method comes from the observation that 4m inherently must update real and imaginary F.E. of \( C \) individually and in separate steps (due to traditional pair-wise storage), and, in the case of 4m_1A, twice as frequently (due to the algorithm’s half-of-optimal cache blocksize \( k_C \)). As reviewed in Section 2.3, this imposes a significant drag on performance. If there existed an induced method that could update real and imaginary elements in one step, it may conveniently avoid both issues.

3.1 Derivation

Consider the classic definition of complex scalar multiplication and accumulation, shown in Eq. 1, refactored and expressed in terms of matrix and vector notation, where we use \(+ = \) operator to concisely denote element-wise accumulation:

\[
\begin{pmatrix}
\zeta^r \\
\zeta^i \\
\end{pmatrix} = \begin{pmatrix}
\chi^r - \chi^i \\
\chi^i \\
\end{pmatrix} \begin{pmatrix}
\psi^r \\
\psi^i \\
\end{pmatrix}
\]

(3)

Here, we have a singleton complex matrix multiplication problem that can naturally be expressed as a tiny real matrix multiplication where \( m = k = 2 \) and \( n = 1 \).

From this, we make the following key observation: If we pack \( \chi \) to \( \tilde{A}_i \) in such a way that duplicates \( \chi^r \) and \( \chi^i \) to the second column of the micro-panel (while also swapping their order and negating \( \chi^i \)), and if we pack \( \psi \) to \( \tilde{B}_p \) such that \( \psi^i \) is stored to the second row of the micro-panel (which, granted, only has one column), then a real domain GEMM micro-kernel executed on those micro-panels will compute the correct result in the complex domain, and do so with a single invocation of that micro-kernel.

Thus, Eq. 3 serves as a packing template that hints at how the data must be stored, which we may generalize. We replace \( \chi, \psi, \zeta \) with \( \alpha, \beta, \gamma \) so as to more intuitively denote complex elements of matrices \( A \), \( B \), and \( C \), respectively. Also, let us apply the Eq. 3 to the special case of \( m = 3 \), \( n = 4 \), and \( k = 2 \) to better

\(^6\)This terminology describes the shape of the typical problem computed by the macro-kernel, i.e. the second loop around the micro-kernel.

\(^7\)We renamed the matrices (and indices) in the panel-block algorithm in Figure 1 (right) so that \( B \) is the left-hand matrix product operand while \( A \) appears on the right. This allows \( \tilde{A}_i \) and \( \tilde{B}_p \) to remain the matrices that reside in the L2 and L3 caches, respectively.

\(^8\)Throughout this document, we will sometimes interchangeably use the term GEMM to refer to matrix multiplication.
observe the general pattern.

\[
\begin{pmatrix}
\gamma^r_0 & \gamma^i_0 & \gamma^r_2 & \gamma^i_3 \\
\gamma^r_0 & \gamma^i_0 & \gamma^r_1 & \gamma^i_3 \\
\gamma^r_1 & \gamma^i_1 & \gamma^r_2 & \gamma^i_3 \\
\gamma^r_0 & \gamma^i_0 & \gamma^r_1 & \gamma^i_2 \\
\gamma^r_0 & \gamma^i_0 & \gamma^r_1 & \gamma^i_1 \\
\gamma^r_0 & \gamma^i_0 & \gamma^r_1 & \gamma^i_2
\end{pmatrix}
= 
\begin{pmatrix}
\alpha^r_{00} & -\alpha^i_{00} & \alpha^i_{01} & -\alpha^i_{01} \\
\alpha^r_{00} & \alpha^i_{00} & \alpha^i_{01} & -\alpha^i_{01} \\
\alpha^r_{10} & -\alpha^i_{10} & \alpha^i_{11} & -\alpha^i_{11} \\
\alpha^r_{10} & \alpha^i_{10} & \alpha^i_{11} & \alpha^i_{11} \\
\alpha^r_{20} & -\alpha^i_{20} & \alpha^i_{21} & -\alpha^i_{21} \\
\alpha^r_{20} & \alpha^i_{20} & \alpha^i_{21} & \alpha^i_{21}
\end{pmatrix}
\begin{pmatrix}
\beta^r_{00} & \beta^i_{00} & \beta^i_{01} & \beta^i_{01} \\
\beta^r_{01} & \beta^i_{01} & \beta^i_{02} & \beta^i_{02} \\
\beta^r_{11} & \beta^i_{11} & \beta^i_{12} & \beta^i_{12} \\
\beta^r_{12} & \beta^i_{12} & \beta^i_{13} & \beta^i_{13}
\end{pmatrix}
\]

(4)

From this, we can make the following observations:

- The complex matrix multiplication \( C := C + AB \) with \( m = 3, n = 4, \) and \( k = 2 \) becomes a real matrix multiplication with \( m = 6, n = 4, \) and \( k = 4 \). In other words, the \( m \) and \( k \) dimensions are doubled for the purposes of the real GEMM primitive.

- If the primitive is the real GEMM micro-kernel, and we assume that matrices \( A \) and \( B \) above represent column-stored and row-stored micro-panels from \( \tilde{A} \) and \( \tilde{B}_p \), respectively, and also that the dimensions are conformal to the register block sizes of this micro-kernel (i.e., \( m = m_R \) and \( n = n_R \)) then the micro-panels of \( A \), are packed from a \( \frac{1}{2} m_R \times \frac{1}{2} k_C \) submatrix of \( A \), which, when expanded in the special packing format, appears as the \( m_R \times k_C \) micro-panel that the real GEMM micro-kernel expects.

- Similarly, the micro-panels of \( \tilde{B}_p \) are packed from a \( \frac{1}{2} k_C \times n_R \) submatrix of \( B \), which, when reordered into a second special packing format, appears as the \( k_C \times n_R \) micro-panel that the real GEMM micro-kernel expects.

It is easy to see by inspection that the real matrix multiplication implied by Eq. 4 induces the desired complex matrix multiplication, provided the computation is executed on matrices \( A \) and \( B \) that have been packed into special formats. We will refer to the packing format used on matrix \( A \) above as the 1E format, since the F.E. are expanded (i.e., duplicated to the next column and then swapped, with the imaginary component negated). Similarly, we will refer to the packing format used on matrix \( B \) above as the 1R format, since the F.E. are merely reordered (i.e., imaginary elements moved to the next row).

### 3.2 Two variants

Notice that implicit in the 1M method suggested by Eq. 4 is the fact that matrix \( C \) is stored by columns. This assumption is important; \( C \) must be stored by columns in order to allow the real domain micro-kernel to correctly update F.E. of \( C \) with the corresponding F.E. from the matrix product \( AB \).

Suppose, for a moment, that we instead refactored and expressed Eq. 1 as follows:

\[
(\zeta^r \quad \zeta^i) + = \begin{pmatrix}
\chi^r \\
\chi^i
\end{pmatrix}
\begin{pmatrix}
\psi^r & -\psi^i \\
\psi^i & \psi^r
\end{pmatrix}
\]

(5)

This gives us a different template, one that implies different packing formats for matrices \( A \) and \( B \). Applying Eq. 5 to the special case of \( m = 4, n = 3, \) and \( k = 2 \), yields:

\[
\begin{pmatrix}
\gamma^r_{00} & \gamma^i_{00} & \gamma^r_{01} & \gamma^i_{01} & \gamma^r_{02} & \gamma^i_{02} \\
\gamma^r_{10} & \gamma^i_{10} & \gamma^r_{11} & \gamma^i_{11} & \gamma^r_{12} & \gamma^i_{12} \\
\gamma^r_{20} & \gamma^i_{20} & \gamma^r_{21} & \gamma^i_{21} & \gamma^r_{22} & \gamma^i_{22} \\
\gamma^r_{30} & \gamma^i_{30} & \gamma^r_{31} & \gamma^i_{31} & \gamma^r_{32} & \gamma^i_{32}
\end{pmatrix}
= 
\begin{pmatrix}
\alpha^r_{00} & \alpha^i_{00} & \alpha^i_{01} & \alpha^i_{01} \\
\alpha^r_{10} & \alpha^i_{10} & \alpha^i_{11} & \alpha^i_{11} \\
\alpha^r_{20} & \alpha^i_{20} & \alpha^i_{21} & \alpha^i_{21} \\
\alpha^r_{30} & \alpha^i_{30} & \alpha^i_{31} & \alpha^i_{31}
\end{pmatrix}
\begin{pmatrix}
\beta^r_{00} & \beta^i_{00} & \beta^i_{01} & \beta^i_{01} \\
\beta^r_{10} & \beta^i_{10} & \beta^i_{11} & \beta^i_{11} \\
\beta^r_{20} & \beta^i_{20} & \beta^i_{21} & \beta^i_{21} \\
\beta^r_{30} & \beta^i_{30} & \beta^i_{31} & \beta^i_{31}
\end{pmatrix}
\]

(6)

In this variant, we see that matrix \( B \), not \( A \), is stored according to the 1E format (where columns become rows), while matrix \( A \) is stored according to the 1R format (where rows become columns). Also, we can see that matrix \( C \) must be stored by rows in order to allow the real GEMM micro-kernel to correctly update its F.E. with the corresponding values from the matrix product.

Henceforth, we will refer to the 1M variant exemplified in Eq. 4 as 1M_C, since it is predicated on column storage on the output matrix. Similarly, we will refer to the variant depicted in Eq. 6 as 1M_R, since it assumes the output matrix is stored by rows.
which is quite common. Order as would a conventional assembly-based implementation based on the so-called broadcast method, however, by inspection we can see that the accumulation of intermediate terms occurs in the same dimension, k. As with the 1M variant of 1m, the appropriate blocksizes to use with gemm are broadcast from \( \tilde{A}_1 \) and \( \tilde{B}_p \), respectively. These values are analogous to the leading dimensions of matrices stored by columns or rows. In BLIS micro-kernels, typically \( m_R = m_P \) and \( n_R = n_P \), but sometimes the kernel author may find it useful for \( m_R < m_P \) or \( n_R < n_P \).

### 3.3 Determining complex blocksizes

As we alluded in Section 3.1, the appropriate blocksizes to use with 1M are a function of the real domain blocksizes. This makes sense, since the idea is to fool the real GEMM micro-kernel, and the various loops for register and cache blocking around the micro-kernel, into thinking that it is computing a real domain matrix multiplication. Which blocksizes must be modified (halved) and which are used unchanged depends on the variant of 1M being executed (and specifically, which matrix is packed according to the 1E format).

Table 1 summarizes the complex domain blocksizes prescribed for 1M_C and 1M_R as a function of the real domain values. This is somewhat analogous to the blocksize scaling described in Tables II and III in [9]. However, there the scaling was optional in the sense that different scaling factors would still work, albeit perhaps with a performance penalty. Here, in the case of Table 1, some scaling factors (namely, on \( m_R \) or \( n_R \)) are required in order for the 1M algorithm to function properly.

<table>
<thead>
<tr>
<th>Variant</th>
<th>Blocksizes, in terms of real domain values, required for ...</th>
</tr>
</thead>
<tbody>
<tr>
<td>1M_C</td>
<td>( \frac{1}{2} k_C ) ( \frac{1}{2} m_C ) ( n_C ) ( \frac{1}{2} m_R ) ( m_P ) ( n_R ) ( n_P )</td>
</tr>
<tr>
<td>1M_R</td>
<td>( \frac{1}{2} k_C ) ( m_C ) ( \frac{1}{2} n_C ) ( m_R ) ( \frac{1}{2} n_R ) ( n_P )</td>
</tr>
</tbody>
</table>

Table 1: 1M complex domain blocksizes as a function of real domain blocksizes. Note: Blocksizes \( m_P \) and \( n_P \) represent the so-called “packing dimensions” for the micro-panels of \( \tilde{A}_1 \) and \( \tilde{B}_p \), respectively. These values are analogous to the leading dimensions of matrices stored by columns or rows. In BLIS micro-kernels, typically \( m_R = m_P \) and \( n_R = n_P \), but sometimes the kernel author may find it useful for \( m_R < m_P \) or \( n_R < n_P \).

### 3.4 Numerical stability

As with the 3M and 4M methods of [9], a formal analysis of numerical stability is beyond the scope of this article. However, by inspection we can see that the accumulation of intermediate terms occurs in the same order as would a conventional assembly-based implementation based on the so-called broadcast method, which is quite common. The only other factor that would affect stability is the cache blocksize in the \( k \) dimension, \( k_C \). However, this blocksize value already varies across architectures, and also sometimes between...
Algorithm: \[ C \] := \text{RMMPB}( A, B, C )

<table>
<thead>
<tr>
<th>for ( j = 0 : n - 1 : n_C )</th>
</tr>
</thead>
<tbody>
<tr>
<td>Identify ( B_j, C_j ) from ( B, C )</td>
</tr>
<tr>
<td>for ( p = 0 : k - 1 : k_C )</td>
</tr>
<tr>
<td>Identify ( A_p, B_{jp} ) from ( A, B_j )</td>
</tr>
<tr>
<td>PACK ( B_{jp} \rightarrow \hat{B}_p )</td>
</tr>
<tr>
<td>for ( i = 0 : m - 1 : m_C )</td>
</tr>
<tr>
<td>Identify ( A_{pi}, C_{ji} ) from ( A_p, C_j )</td>
</tr>
<tr>
<td>PACK ( A_{pi} \rightarrow \hat{A}_i )</td>
</tr>
<tr>
<td>for ( h = 0 : n_C - 1 : n_R )</td>
</tr>
<tr>
<td>Identify ( \hat{B}<em>{ph}, C</em>{jih} ) from ( \hat{B}<em>p, C</em>{ji} )</td>
</tr>
<tr>
<td>for ( l = 0 : m_C - 1 : m_R )</td>
</tr>
<tr>
<td>Identify ( \hat{A}<em>{il}, C</em>{jihl} ) from ( \hat{A}<em>i, C</em>{jih} )</td>
</tr>
<tr>
<td>( C_{jihl} := \text{rkern}( \hat{A}<em>{il}, \hat{B}</em>{ph}, C_{jihl} ) )</td>
</tr>
</tbody>
</table>

Algorithm: \[ C \] := \text{RMMPB}( B, A, C )

<table>
<thead>
<tr>
<th>for ( j = 0 : m - 1 : n_C )</th>
</tr>
</thead>
<tbody>
<tr>
<td>Identify ( B_j, C_j ) from ( B, C )</td>
</tr>
<tr>
<td>for ( p = 0 : k - 1 : k_C )</td>
</tr>
<tr>
<td>Identify ( A_p, B_{jp} ) from ( A, B_j )</td>
</tr>
<tr>
<td>PACK ( B_{jp} \rightarrow \hat{B}_p )</td>
</tr>
<tr>
<td>for ( i = 0 : n - 1 : m_C )</td>
</tr>
<tr>
<td>Identify ( A_{pi}, C_{ji} ) from ( A_p, C_j )</td>
</tr>
<tr>
<td>PACK ( A_{pi} \rightarrow \hat{A}_i )</td>
</tr>
<tr>
<td>for ( h = 0 : n_C - 1 : n_R )</td>
</tr>
<tr>
<td>Identify ( \hat{B}<em>{ph}, C</em>{jih} ) from ( \hat{B}<em>p, C</em>{ji} )</td>
</tr>
<tr>
<td>for ( l = 0 : m_C - 1 : m_R )</td>
</tr>
<tr>
<td>Identify ( \hat{A}<em>{il}, C</em>{jihl} ) from ( \hat{A}<em>i, C</em>{jih} )</td>
</tr>
<tr>
<td>( C_{jihl} := \text{rkern}( \hat{A}<em>{il}, \hat{B}</em>{ph}, C_{jihl} ) )</td>
</tr>
</tbody>
</table>

Figure 2: Abbreviated pseudo-codes for implementing the general matrix multiplication algorithms depicted in Figure 1. Here, \text{rkern} calls a real domain GEMM micro-kernel. Note that the only difference between the algorithms is that the loop bounds on the 5th and 3rd loops around the micro-kernel are swapped. Also, \text{RMMPB} labels the matrix product operands differently, with \( B \) referring to the left-hand matrix and \( A \) referring to the right-hand matrix.

real and complex micro-kernels on the same architecture, so this factor is not new. Therefore, we would expect the \text{1M} method (both in its \text{1M}_{LR} and \text{1M}_{LC} forms) to exhibit the numerical properties that were virtually identical to a solution based on complex micro-kernels.

3.5 Algorithms

3.5.1 General algorithm

Before investigating \text{1M} method algorithms, we will first provide algorithms for computing real matrix multiplication to serve as a reference for the reader. Specifically, we provide pseudo-code, targeting the real domain, for the two algorithms depicted in Figure 1. These algorithms are shown as \text{RMMPB} and \text{RMMPB} in Figure 2.

Notice that in the context of \text{RMMPB}, the 5th and 3rd loops around the micro-kernel (\text{rkern}) iterate over different dimensions, relative to those loops in \text{RMMPB}, but the blocksizes used in those loops are the same. Thus, \( n_C \) is used to partition in the \( m \) dimension while \( m_C \) is used to partition in the \( n \) dimension. The authors of [10] initially studied matrix multiplication only in the context of an algorithm based on block-panel subproblems, which led them to name the blocksizes after the dimensions along which they partitioned. When generalized to support the panel-block algorithm, these blocksizes refer not to a specific dimension but rather to a specific level of cache being targeted. That is, \( m_C \) targets the L2 cache while \( n_C \) blocks for the L3 cache. However, for historical purposes, we will continue to use the original names throughout this article. We also choose to label the matrix product operands differently in the panel-block setting; the left-hand matrix is named \( B \) while the right-hand matrix is named \( A \).
3.5.2 1m-specific algorithm

When applied to the block-panel algorithm depicted in Figure 1 (left), 1m_C and 1m_R yield nearly identical algorithms whose differences can be encoded within a few conditional statements within key parts of the high and low levels of code. We will refer to these specific algorithms as 1m_C_BP and 1m_R_BP, respectively. Figure 3 shows a hybrid algorithm that encompasses both, supporting row- and column-stored matrices C.

As with the 3M and 4M algorithms in [9], we have separated the so-called virtual micro-kernel into a separate function, shown in Figure 3 (right). This 1M-specific virtual micro-kernel, Vkm, largely consists of a call to the real domain micro-kernel rkern, with some added special case handling and book-keeping needed to properly induce complex matrix multiplication. Some of the details of the virtual micro-kernel will be addressed shortly.

Note that 1m_C and 1m_R can also be applied to the panel-block algorithm depicted in Figure 1 (right), yielding 1m_C_PB and 1m_R_PB. We omit pseudo-code for these algorithms for the sake of brevity.

3.6 Performance properties and algorithmic pairs

Table 2 tallies the total number of F.E. memops required by the block-panel and panel-block algorithms of both variants of 1m (1m_C and 1m_R). For comparison, we also include the corresponding memop counts for a selection of 4M algorithms as well as a conventional assembly-based solution, as first published in Table III in [9].

We can hypothesize that the observed performance signatures of 1m_C_BP and 1m_R_BP may be slightly different, because each places the additional memop overhead unique to 1m on different parts of the computation. This stems from the fact that there exists an asymmetry in the assignment of packing formats.

---

<table>
<thead>
<tr>
<th>Algorithm: ( [C] := 1M_{[CR]}_{BP}(A, B, C) )</th>
<th>( [C] := Vkm(A, B, C) )</th>
</tr>
</thead>
<tbody>
<tr>
<td>Set bool colStore if rs(( C )) = 1</td>
<td></td>
</tr>
<tr>
<td>for ((j = 0 : n - 1 : n_C))</td>
<td></td>
</tr>
<tr>
<td>Identify ( B_j, C_j ) from ( B, C )</td>
<td></td>
</tr>
<tr>
<td>for ((p = 0 : k - 1 : k_C))</td>
<td></td>
</tr>
<tr>
<td>Identify ( A_p, B_{jp} ) from ( A, B_j )</td>
<td></td>
</tr>
<tr>
<td>if colStore Pack1r ( B_{jp} \rightarrow \tilde{B}_p )</td>
<td></td>
</tr>
<tr>
<td>else Pack1e ( B_{jp} \rightarrow B_p )</td>
<td></td>
</tr>
<tr>
<td>for ((i = 0 : m - 1 : m_C))</td>
<td></td>
</tr>
<tr>
<td>Identify ( A_{pi}, C_{ji} ) from ( A_p, C_j )</td>
<td></td>
</tr>
<tr>
<td>if colStore Pack1e ( A_{pi} \rightarrow \tilde{A}_i )</td>
<td></td>
</tr>
<tr>
<td>else Pack1r ( A_{pi} \rightarrow \tilde{A}_i )</td>
<td></td>
</tr>
<tr>
<td>for ((h = 0 : n_C - 1 : n_R))</td>
<td></td>
</tr>
<tr>
<td>Identify ( \tilde{B}<em>{ph}, C</em>{jih} ) from ( \tilde{B}<em>p, C</em>{ji} )</td>
<td></td>
</tr>
<tr>
<td>for ((l = 0 : m_C - 1 : m_R))</td>
<td></td>
</tr>
<tr>
<td>Identify ( \tilde{A}<em>{il}, C</em>{jihl} ) from ( \tilde{A}<em>i, C</em>{jih} )</td>
<td></td>
</tr>
<tr>
<td>( C_{jihl} := Vkm(\tilde{A}<em>{il}, \tilde{B}</em>{ph}, C_{jihl}) )</td>
<td></td>
</tr>
<tr>
<td>Acquire workspace ( W )</td>
<td></td>
</tr>
<tr>
<td>Determine if using ( W ); set usew</td>
<td></td>
</tr>
<tr>
<td>if ( usew )</td>
<td></td>
</tr>
<tr>
<td>Alias ( C_{\text{use}} \leftarrow W, C_{\text{in}} \leftarrow 0 )</td>
<td></td>
</tr>
<tr>
<td>else</td>
<td></td>
</tr>
<tr>
<td>Alias ( C_{\text{use}} \leftarrow C, C_{\text{in}} \leftarrow C )</td>
<td></td>
</tr>
<tr>
<td>Set bool colStore if rs(( C_{\text{use}} )) = 1</td>
<td></td>
</tr>
<tr>
<td>if ( \text{COLSTORE} ) ( 1m \times 2 = 2 )</td>
<td></td>
</tr>
<tr>
<td>else ( rs(C_{\text{use}}) = 2 )</td>
<td></td>
</tr>
<tr>
<td>( N(A) \times 2; m(B) \times 2 )</td>
<td></td>
</tr>
<tr>
<td>( C_{\text{use}} := \text{rkern}(A, B, C_{\text{in}}) )</td>
<td></td>
</tr>
<tr>
<td>if ( usew )</td>
<td></td>
</tr>
<tr>
<td>( C := W )</td>
<td></td>
</tr>
</tbody>
</table>

---

It is important to note that the renaming in panel-block algorithms such as rmmpb does not necessarily represent a swapping of the operands, as would happen if the entire operation were transposed to \( C^T := B^T A^T \), nor does it represent a change to the operation’s interface. rmmpb and rmmpb simply use different names for the left- and right-hand matrices within the algorithm descriptions. This renaming allows us to reference the L2-bound block and L3-bound panel as \( \tilde{A}_i \) and \( \tilde{B}_p \), respectively, in both algorithms.
to matrices in each 1M variant. Specifically, additional memops are required during the initial packing and the movement between caches for the matrix packed according to the algorithm’s performance properties listed in Table 3 to align indistinguishably with that of similar symmetry would be observed with similar properties sometimes make it convenient to refer to them simultaneously as an algorithmic pair. The bound packed matrix \( \tilde{\mathbf{M}} \) differs from the source operand. Also, if the operand is reused as well as a measure of memory movement efficiency. The information listed for we expect and similarities sometimes make it convenient to refer to them simultaneously as an algorithmic pair. The bound packed matrix \( \tilde{\mathbf{M}} \) is formed with differences in \( \beta \), such as when blocksize scaling of \( \beta \) is required, the precise value \( \left\lceil \frac{n}{\beta} \right\rceil \) is expressed as \( \frac{2n}{\beta} \). These simplifications allow easier comparison between algorithms while still providing meaningful approximations.

Table 2: F.E. memops incurred by various algorithms, broken down by stage of computation.

\[ \begin{array}{|c|c|c|c|c|}
\hline
\text{Algorithm} & \text{update micro-tiles}^b & \text{pack} & \text{move} & \text{move} \\
& \mathbf{C}' \rightarrow \mathbf{C}^a & \tilde{\mathbf{A}}_i & \tilde{\mathbf{A}}_i & \tilde{\mathbf{B}}_p \\
\hline
4M_H & 8mn \frac{k}{k_C} & 8mk \frac{n}{n_C} & 4mk \frac{n}{n_R} & 8kn & 4kn \frac{m}{m_C} \\
4M_IB & 8mn \frac{k}{k_C} & 8mk \frac{m}{n_C} & 4mk \frac{n}{n_R} & 8kn & 4kn \frac{2m}{m_C} \\
4M_IA & 8mn \frac{2k}{k_C} & 8mk \frac{n}{n_C} & 4mk \frac{n}{n_R} & 8kn & 4kn \frac{m}{m_C} \\
\text{assembly} & 4mn \frac{k}{k_C} & 4mk \frac{n}{n_C} & 2mk \frac{n}{n_R} & 4kn & 2kn \frac{m}{m_C} \\
1M_C_BP & 2mn \frac{2k}{k_C} & 6mk \frac{n}{n_C} & 4kn \frac{m}{n_R} & 4km & 2kn \frac{2m}{m_C} \\
1M_R_PB & 6kn \frac{m}{n_C} & 4kn \frac{m}{n_R} & 6kn & 4kn \frac{m}{m_C} \\
1M_R_BP & 4mk \frac{n}{n_C} & 2mk \frac{2n}{n_R} & 6kn & 4kn \frac{m}{m_C} \\
1M_C_PB & 4kn \frac{m}{n_C} & 2kn \frac{2m}{n_R} & 6mk & 4mk \frac{n}{m_C} \\
\hline
\end{array} \]

Table 2: F.E. memops incurred by various algorithms, broken down by stage of computation.

\(^a\) We express the number of iterations executed in the 5th, 4th, 3rd, and 2nd loops as \( \frac{n}{n_C}, \frac{k}{k_C}, \frac{m}{m_C}, \text{ and } \frac{n}{n_R} \) (with \( m \) and \( n \) swapped for panel-block algorithms 1M_C_BP and 1M_R_PB). The precise number of iterations along a dimension \( x \) using a cache blocksize \( x_C \) would actually be \( \left\lceil \frac{x}{x_C} \right\rceil \). Similarly, when blocksize scaling of \( \beta \) is required, the precise value \( \left\lfloor \frac{x}{\beta x_C} \right\rfloor \) is expressed as \( \frac{2x}{\beta x_C} \). These simplifications allow easier comparison between algorithms while still providing meaningful approximations.

\(^b\) As described in Section 3.8.1, \( m_R \times n_R \) workspace sometimes becomes mandatory, such as when \( \beta^t \neq 0 \). When workspace is employed in a 4x-based algorithm, the number of F.E. memops incurred updating the micro-tile typically doubles.
Whenever possible, the BLIS framework will perform logical transpositions
preference
Within the BLIS framework, micro-kernels are registered with a property that describes their output
3.7.1 Micro-kernel storage preference
3.7 Algorithm-sensitive details
Now that we have established two algorithmic pairs—
3.7.2 The panel-block algorithm
and
Algorithmic variant must use a micro-kernel that prefers to access
Algorithmic variant must employ a micro-kernel that prefers

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Total F.E. memops required</th>
<th>Level from which F.E. of matrix (X) are reused, and (l_{L1}): # of times each cache line is moved into the L1 cache (per rank-(k_C) update).</th>
</tr>
</thead>
<tbody>
<tr>
<td>4M_H</td>
<td>(8mn \left( \frac{k}{k_C} + 4mk + \frac{2n}{n_C}m + \frac{n}{n_R} + 2kn \right)) (4 + \frac{2m}{m_C})</td>
<td>(C) (t^C_{L1})</td>
</tr>
<tr>
<td>4M_IB</td>
<td>(8mn \left( \frac{k}{k_C} + 4mk + \frac{4n}{n_C} + \frac{n}{n_R} + 2kn \right)) (4 + \frac{4m}{m_C})</td>
<td>MEM 4</td>
</tr>
<tr>
<td>4M_1A</td>
<td>(8mn \left( \frac{2k}{k_C} + 4mk + \frac{2n}{n_C} + \frac{n}{n_R} + 2kn \right)) (4 + \frac{2m}{m_C})</td>
<td>L2 (2^n)</td>
</tr>
<tr>
<td>assembly</td>
<td>(4mn \left( \frac{k}{k_C} + 2mk + \frac{2n}{n_C} + \frac{n}{n_R} + 2kn \right)) (2 + \frac{m}{m_C})</td>
<td>L1 (1^a)</td>
</tr>
<tr>
<td>1M_C_BP</td>
<td>(2mn \left( \frac{2k}{k_C} + 2mk + \frac{3n}{n_C} + \frac{2n}{n_R} + 2kn \right)) (2 + \frac{2m}{m_C})</td>
<td>REG 1</td>
</tr>
<tr>
<td>1M_R_PB</td>
<td>(2mn \left( \frac{2k}{k_C} + 2mk + \frac{2n}{n_C} + \frac{n}{n_R} + 2kn \right)) (2 + \frac{m}{m_C})</td>
<td>REG 1</td>
</tr>
<tr>
<td>1M_R_BP</td>
<td>(2mn \left( \frac{2k}{k_C} + 2mk + \frac{3n}{n_C} + \frac{2n}{n_R} + 2kn \right)) (3 + \frac{2m}{m_C})</td>
<td>REG 1</td>
</tr>
<tr>
<td>1M_C_PB</td>
<td>(2mn \left( \frac{2k}{k_C} + 2kn \frac{2m}{n_C} + \frac{2n}{n_R} + 2kn \right)) (3 + \frac{2n}{m_C})</td>
<td>REG 1</td>
</tr>
</tbody>
</table>

\(^a\) This assumes that the micro-tile is not evicted from the L1 cache during the next call to \(rkern\).
\(^b\) In the case of 1M algorithms, we consider F.E. of \(A\) and \(B\) to be “reused” from the level of cache in which the 1E-formatted matrix resides.

### 3.7.1 Micro-kernel storage preference

Within the BLIS framework, micro-kernels are registered with a property that describes their output
preference—that is, the semantic row or column orientation of the vector registers that load and store \(C\).\(^{11}\)
Whenever possible, the BLIS framework will perform logical transpositions\(^{12}\) so that the apparent storage of \(C\) matches the preference property of the micro-kernel being used. This guarantees that the micro-kernel will be able to load and store F.E. of \(C\) using vector instructions.

This preference property is merely an interesting performance detail for conventional implementations (real and complex). However, in the case of 1M, it becomes rather important for constructing a correctly-functioning implementation. Specifically, the micro-kernel’s storage preference determines whether the 1M_C or 1M_R algorithm is prescribed. Generally speaking, a 1M_C algorithmic variant must employ a micro-kernel that prefers to access \(C\) by columns, while a 1M_R algorithmic variant must use a micro-kernel that prefers to access \(C\) by rows. An exception (or at least a caveat) to this rule is discussed in Section 3.7.3.

### 3.7.2 The panel-block algorithm

Now that we have established two algorithmic pairs—1M_C_BP/1M_R_PB and 1M_R_BP/1M_C_PB—the question becomes, why choose one algorithm over another within the same pair? It turns out that, at least in the context of 1M, we suspect that the reasons apply to limited scenarios, and that otherwise the two algorithms are interchangeable.

\(^{11}\)Even though micro-kernels are always registered as having a row or column preference for the purposes of accessing \(C\), these kernels still support matrices stored with general stride. However, 1m will never exercise the real micro-kernel’s built-in support for general stride. This topic is discussed in Section 3.8.1.

\(^{12}\)This amounts to a swapping of the row and column strides, and a swapping of the \(m\) and \(n\) dimensions.
### Small \(m\) or \(n\) dimensions

One reason to consider the panel-block algorithm is that it is somewhat more forgiving of problems where the \(m\) dimension is relatively small. To understand this phenomenon, let us instead begin by explaining why the block-panel algorithm can tolerate small \(n\) somewhat better than small \(m\).

Notice that, according to Figure 1 (left), micro-panels of \(\tilde{B}_p\) typically reside in the L3 cache. Those micro-panels are therefore reused from the L3 cache as they are multiplied against different blocks \(\tilde{A}_i\). That is, they are reused from the L3 cache across iterations of the 3rd loop. However, if \(n\) is very small, such that \(\tilde{B}_p\) consists of only a few micro-panels, then \(\tilde{B}_p\) may not fall back to the L3 cache, but rather linger in the portion of the L2 cache that is not occupied by \(\tilde{A}_i\). This tends to result in a small performance improvement when the micro-kernel accesses those micro-panels of \(\tilde{B}_p\) after computation has proceeded to the next block of \(\tilde{A}_i\) (i.e., with the next iteration of the 3rd loop).

Now, let us consider what happens if \(m\) is very small, such that \(\tilde{A}_i\) consists of only a few micro-panels. Note that the algorithm is already predisposed to keeping those micro-panels in the L2 cache, and the L1 cache is usually too small to hold even one additional micro-panel of \(\tilde{A}_i\). Thus, there exists no cache benefit to small values of \(m\) for block-panel algorithms.

The corresponding small dimension affinity for the panel-block algorithm is reversed. That algorithm tolerates small \(m\) somewhat better for the same reason that the block-panel algorithm tolerates small 
\(n\) dimensions—that is, because \(m\) is the dimension along which the \(n_C\) blocksize is applied, which determines the number of micro-panels in \(\tilde{B}_p\).

### Odd register blocksizes

Usually, \(m_R\) and \(n_R\) are both even numbers. However, sometimes one or the other is odd.\(^{13}\) In this case, the variant of \(1M\) is pre-determined. That is, if \(n_R\) is odd, then only \(1M_{C}\) may be used. This stems from the fact that, as shown in Table 1, the complex register blocksize \(n_R^z\) used during packing of \(1M_{R}\) must be equal to \(\frac{1}{2}n_R\). Obviously, \(n_R^z = \frac{1}{2}n_R\) cannot be computed as a whole number if \(n_R\) is odd. Similarly, if \(m_R\) is odd, then only \(1M_{R}\) may be used.

Figure 4 illustrates distinguishing details of the four \(1M\) algorithms at the level of the macro-kernel, while Table 4 summarizes key properties of each \(1M\) algorithm.

---

\(^{13}\) The authors have never observed or heard of an optimal or near-optimal case where both register blocksizes were odd. This can be attributed to the fact that the register blockizes capture the orientation and layout of the vector registers used to accumulate the micro-kernel’s matrix product. Since vector register lengths are (so far) universally powers of two, they are also even numbers, and we cannot imagine any benefits to forgoing use of some vector register elements. In fact, the analytical model proposed in [5] explicitly constrains the space of possible register allocations to those where vector registers are fully populated. For these reasons, we anticipate that all reasonable register allocations will have at least one even register blocksize.

---

### Table 4: Summary of Key Properties of \(1M\) Algorithms

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Packing format / Level from which F.E. are reused</th>
<th>Requires (\mu)-kernel with preference for ...</th>
<th>Tolerates (\mu)-kernel with odd ...</th>
<th>Small dimension affinity</th>
</tr>
</thead>
<tbody>
<tr>
<td>1M_C_BP</td>
<td>1E / L2 1R / REG</td>
<td>columns</td>
<td>(n_R)</td>
<td>(n)</td>
</tr>
<tr>
<td>1M_R_PB*</td>
<td>1E / L2 1R / REG</td>
<td>rows</td>
<td>(n_R)</td>
<td>(m)</td>
</tr>
<tr>
<td>1M_R_BP</td>
<td>1R / REG 1E / L1</td>
<td>rows</td>
<td>(m_R)</td>
<td>(n)</td>
</tr>
<tr>
<td>1M_C_PB*</td>
<td>1R / REG 1E / L1</td>
<td>columns</td>
<td>(m_R)</td>
<td>(m)</td>
</tr>
</tbody>
</table>

*Recall that panel-block algorithms 1M\_R\_PB and 1M\_C\_PB compute \(C = BA\), with matrix \(A\) being multiplied by matrix \(B\) from the right.
Figure 4: Illustrations that depict key details of the four $1M$ algorithms at the level of the macro-kernel, including the $1E/1R$ packing formats of $\tilde{A}_i$ and $\tilde{B}_p$, whether each packed matrix is a left- or right-hand operand to the matrix product, and the corresponding relationship between the $m$ and $n$ dimensions and the cache and register block sizes $m_C$, $m_R$, $n_C$, and $n_R$. In each diagram, the storage of the output matrix prescribed by the $1M_C/1M_R$ variant is shown as vertical (column-stored) or horizontal (row-stored) lines. The algorithms represented by the diagrams on the left will exhibit similar performance properties when the matrices are large. A similar relationship holds between the pair of algorithms on the right.

### 3.7.3 Implementing the panel-block algorithm

At first glance, it may seem that implementing the panel-block algorithms $1M_C$PB and $1M_R$PB would require a significant effort. After all, panel-block algorithms partition the matrix dimensions in a different order, result in a different macro-kernel, and produce different micro-kernel subproblems that multiply $n_R \times k_C$ micro-panels of $\tilde{B}_p$ by $k_C \times m_R$ micro-panels of $\tilde{A}_i$. In fact, this would seem to call for a different micro-kernel altogether.

As it turns out, the panel-block algorithm can be implemented by recycling existing code and leveraging the similarities within algorithmic pairs. Let us consider the pair $1M_C$BP and $1M_R$PB. Note that the micro-kernel for $1M_C$BP as well as higher-level code (including the block-panel macro-kernel) already exist within the BLIS framework. The higher-level code for $1M_R$PB can be implemented in BLIS by building an alternate control tree structure that steps through the loop partitioning functions in a different order.\(^\text{14}\) Furthermore, it is possible to implement the panel-block macro-kernel needed by $1M_R$PB in terms of the existing block-panel macro-kernel used by $1M_C$BP. Upon calling the panel-block macro-kernel function, we simply apply a logical transposition of the matrix operands, and swap $\tilde{B}_p$ and $\tilde{A}_i$. This transformation orients $C$ and the micro-panels of the packed matrices into a format and sequence expected by $1M_C$BP's macro-kernel and its column-preferential micro-kernel.

In a similar manner, $1M_C$PB can be implemented by recycling the code and row-preferential micro-kernel used by $1M_R$BP.

Note that in Section 4 (specifically in Figures 7 and 8), we will report performance results in which we refer to panel-block algorithms $1M_C$PB and $1M_R$PB as employing row-prefering and column-prefering micro-kernels, respectively. This reversal comes from the fact that we implement each panel-block macro-

\(^{14}\)A control tree is a data structure used internally by BLIS to encode the basic properties of level-3 algorithms. It specifies, for example, which dimension to partition, the blocksize to use in that partitioning, and a reference to the next node in the tree, which is processed in the body of the next loop. In general, control trees allow a library developer to factor out certain parameters from the code local to any given loop. These codes become generic and, when stored as separate functions, can then be composed together at runtime into arbitrary compound algorithms.
kernel in terms of its block-panel counterpart, which, in our present discussion, uses a micro-kernel of opposite preference.

3.8 Algorithm-agnostic details

3.8.1 Workspace

In some cases, a small amount of \( m_R \times n_R \) workspace is needed. These cases fall into one of four scenarios:

1. \( C \) is row-stored and the real micro-kernel \( \text{rkern} \) has a column preference;
2. \( C \) is column-stored and \( \text{rkern} \) has a row preference;
3. \( C \) is general-stored (i.e., neither \( \text{rs}(C) \) nor \( \text{cs}(C) \) is unit); and
4. \( \beta \) has a non-zero imaginary component.

If any of these situations apply, then the 1M virtual micro-kernel will need to use workspace. This corresponds to the setting of \( \text{usew} \) in \( \text{vk1m} \). The idea is simply that \( \text{rkern} \) will be called to compute the micro-panel product and store it to the workspace \( W \). Subsequently, the result in \( W \) can be accumulated back to \( C \).

Cases (1) and (2), while supported, actually never occur in practice because BLIS will perform (at a high level within the framework) a logical transposition whenever necessary. The net effect is that the storage of \( C \) will always appear to match the preference of the micro-kernel.

Case (3) is needed because the real micro-kernel is programmed to support the updating of real matrices stored with general stride, which cannot be spoofed to match the updating of complex matrices stored with general stride. The reason is even when stored with general stride, complex matrices store real and imaginary f.e. in contiguous pairs. There is no way to coax this pattern of data access from a real domain micro-kernel. Thus, general stride support must be implemented outside \( \text{rkern} \), within \( \text{vk1m} \).

Case (4) is needed because real domain micro-kernels are not semantically equipped to scale \( C \) by complex scalars \( \beta \) (that is, \( \beta \) such that \( \beta^2 \neq 0 \)).

3.8.2 Handling alpha and beta scalars

As in the previous article, we have simplified the general matrix multiplication to \( C := C + AB \). In practice, the operation is implemented as \( C := \beta C + \alpha AB \), where \( \alpha, \beta \in \mathbb{C} \). Let us use Algorithms 1M\text{CBP} and 1M\text{RBP} in Figure 3 as an example of how to support arbitrary values of \( \alpha \) and \( \beta \).

If no workspace is needed (because none of the four situations described in Section 3.8.1 apply), we can simply pass \( \beta^T \) into the \( \text{rkern} \) call. However, if \( \beta \) is complex, or (regardless of whether \( \beta \) is real or complex) if any of the other three workspace cases apply, then we must pass in a local \( \beta_{\text{loc}} = 0 \) to \( \text{rkern} \), compute to local workspace \( W \), and then apply \( \beta \) at the end of \( \text{vk1m} \), when \( W \) is accumulated to \( C \).

When \( \alpha \) is real, the scaling may be performed directly by \( \text{rkern} \). This situation is ideal since it almost always incurs no additional costs (since many micro-kernels multiply their intermediate \( AB \) product by \( \alpha \) unconditionally). Scaling by \( \alpha \) with non-zero imaginary components can be performed by the packing function when either \( \tilde{A}_i \) or \( \tilde{B}_p \) are packed. Though somewhat less than ideal, the overhead incurred by this treatment of \( \alpha \) is minimal since packing is a memory-bound operation.

3.8.3 Multithreading

As with Algorithm 4M\text{1A} in the previous article, 1M\text{CBP} and 1M\text{RBP} parallelize in a straightforward manner for multicore and many-core environments. Because those algorithms encode the 1M method entirely within the packing functions and the virtual micro-kernel, all other levels of code are completely oblivious to, and therefore unaffected by, the specifics of the new algorithms. Therefore, we expect that 1M\text{CBP} and 1M\text{RBP} will yield multithreaded performance that is on-par with that of the corresponding real domain matrix multiplication function, \text{RMMP}.

A similar analysis holds for panel-block algorithmic variants 1M\text{CPB} and 1M\text{RPB}.

3.8.4 Bypassing the virtual micro-kernel

Because the 1M virtual micro-kernel serves as a wrapper to the real domain micro-kernel, it would seem at first glance that there exists the potential for additional overhead in 1M algorithms, particularly from the
extra function calls. However, there are a few things to consider.

First, we should consider that a conventional solution would implement matrix multiplication using a complex micro-kernel, which sometimes has a smaller micro-tile footprint (i.e., fewer F.E.). But, a complex micro-kernel that updates fewer F.E. would need to be called more times in order to fully update the output matrix. Thus, the function call overhead incurred by 1M algorithms may already be at or near parity with that of a conventional implementation.

Secondly, even if the complex micro-kernel updates the same number of F.E. as its real domain counterpart, there exists a simple optimization that can be applied as long as that of a conventional implementation.

1m micro-kernel that updates fewer F.E. However, there are a few things to consider.

First, we should consider that a conventional solution would implement matrix multiplication using a complex micro-kernel, which sometimes has a smaller micro-tile footprint (i.e., fewer F.E.). But, a complex micro-kernel that updates fewer F.E. would need to be called more times in order to fully update the output matrix. Thus, the function call overhead incurred by 1M algorithms may already be at or near parity with that of a conventional implementation.

Secondly, even if the complex micro-kernel updates the same number of F.E. as its real domain counterpart, there exists a simple optimization that can be applied as long as that of a conventional implementation.

Finally, we suspect that, even if this optimization cannot be applied, the slowdown that results from the additional overhead may not necessarily be prohibitive.

### 3.9 Other possible packing formats

An astute reader may have already noticed that the 1E and 1R formats are not unique in their ability to facilitate induced complex matrix multiplication. Returning to Eqs. 4 and 6, it is easy to see that any permutation that permutes columns of A simultaneously with the corresponding rows of B will yield a valid packing configuration (i.e., combination of packing formats on A and B) that facilitates computing the desired result. The only difference between any two arbitrary packing configurations will be in the order in which the intermediate terms \( \alpha^r \beta^r, \alpha^i \beta^i, -\alpha^i \beta^i, \) and \( \alpha^i \beta^r \) are accumulated into \( \gamma^r \) and \( \gamma^i \).

Perhaps the most interesting of these permutations is that in which even-numbered\(^{15}\) columns of A are permuted to be consecutive with one another and grouped together, while odd-numbered columns are permuted to immediately follow them. This format, with its permutation of \( k \) dimension, would transform Eq. 4 to:

\[
\begin{pmatrix}
\gamma_{00}^r & \gamma_{01}^r & \gamma_{02}^r & \gamma_{03}^r \\
\gamma_{10}^r & \gamma_{11}^r & \gamma_{12}^r & \gamma_{13}^r \\
\gamma_{20}^r & \gamma_{21}^r & \gamma_{22}^r & \gamma_{23}^r \\
\gamma_{30}^r & \gamma_{31}^r & \gamma_{32}^r & \gamma_{33}^r
\end{pmatrix}
\begin{pmatrix}
\alpha_{00}^r & \alpha_{01}^r & -\alpha_{01}^i & -\alpha_{02}^i \\
\alpha_{10}^r & \alpha_{11}^r & -\alpha_{11}^i & -\alpha_{12}^i \\
\alpha_{20}^r & \alpha_{21}^r & -\alpha_{21}^i & -\alpha_{22}^i \\
\alpha_{30}^r & \alpha_{31}^r & -\alpha_{31}^i & -\alpha_{32}^i
\end{pmatrix}
\begin{pmatrix}
\beta_{00}^r & \beta_{01}^r & \beta_{02}^r & \beta_{03}^r \\
\beta_{10}^r & \beta_{11}^r & \beta_{12}^r & \beta_{13}^r \\
\beta_{20}^r & \beta_{21}^r & \beta_{22}^r & \beta_{23}^r \\
\beta_{30}^r & \beta_{31}^r & \beta_{32}^r & \beta_{33}^r
\end{pmatrix}
\]

Or, more generally,

\[
C = (A|\bar{A}) \left(\frac{B^r}{B^i}\right)
\]

(7)

where \( A \) and \( \bar{A} \) are column-stored, \( B^r \) and \( B^i \) are row-stored, and \( C \) remains column-stored, as before with 1M_C based on 1E and 1R. The storage format of \( B \) would recycle some of the packing infrastructure created to support 4M, wherein real and imaginary F.E. are packed to separate but adjacent micro-panels. It would still, however, require the support of a new format to store \( \bar{A} \), wherein real and imaginary F.E. are swapped and the imaginary parts are negated.

A similar permutation can be applied to Eq. 6, yielding:

\[
C = (A^r|A^i) \left(\frac{B}{\bar{B}}\right)
\]

(8)

where \( A^r \) and \( A^i \) are column-stored, \( B \) and \( \bar{B} \) are row-stored, and \( C \) remains row-stored, as before with 1M_R based on 1R and 1E.

\(^{15}\)This assumes a zero-based indexing.
These alternative packing formats do not immediately appear to have any advantages over 1E and 1R.\textsuperscript{16} Indeed, the opposite may be true: for small values of \( m_R \) and \( n_R \) (actually, small values of \( m_P \) and \( n_P \)), packing to the 1E and 1R formats may enjoy a slight advantage over other formats due to increased spatial locality when packing to 1E and 1R. This would occur because the formats call for copying (parts of) a complex scalar to both the current column or row (of \( m_R \) or \( n_R \) F.E.) as well as next column or row, which may reside within the same cache line, or within a cache line that is more likely to have already been hardware-prefetched.

3.10 2m

The previous article noted that because of its expression in terms of real and imaginary matrices, 4M would perform more favorably on complex matrices that stored real and imaginary values separately, as two real matrices. This would not benefit the accesses on \( A \) and \( B \), since those matrices must almost always be packed to achieve high performance and can easily be separated during that process. However, it would benefit the accesses on \( C \). Referring back to Eq. 4, we can see that storing the real and imaginary F.E. of \( C \) separately is equivalent to a permutation of the rows of \( C \). In order to keep the computation expressed unchanged, this permutation would need to be applied to matrix \( A \) as well, since they both share an \( m \) dimension. Applying such a permutation to Eq. 4 yields:

\[
\left( \begin{array}{cccc}
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\end{array} \right) =
\left( \begin{array}{cccc}
\alpha_0 & -\alpha_0 & 0 & 0 \\
\alpha_0 & 0 & -\alpha_0 & 0 \\
\alpha_0 & 0 & 0 & -\alpha_0 \\
\alpha_0 & 0 & 0 & 0 \\
\end{array} \right)
\left( \begin{array}{cccc}
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\end{array} \right)
\left( \begin{array}{cccc}
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\end{array} \right)
\]

(9)

And if we also apply the permutation along the \( k \) dimension discussed in Section 3.9, we have:

\[
\left( \begin{array}{cccc}
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\end{array} \right) =
\left( \begin{array}{cccc}
\alpha_0 & 0 & 0 & 0 \\
\alpha_0 & 0 & 0 & 0 \\
\alpha_0 & 0 & 0 & 0 \\
\alpha_0 & 0 & 0 & 0 \\
\end{array} \right)
\left( \begin{array}{cccc}
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\beta_0 & \beta_0 & \beta_0 & \beta_0 \\
\end{array} \right)
\left( \begin{array}{cccc}
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\gamma_0 & \gamma_1 & \gamma_2 & \gamma_3 \\
\end{array} \right)
\]

(10)

Or, more generally:

\[
\left( \frac{C^r}{C^i} \right) + = \left( \frac{A^r}{A^i} \right) \left( \frac{-A^i}{A^r} \right) \left( \frac{B^r}{B^i} \right)
\]

This matrix multiplication can be computed via just two calls to a real domain matrix multiplication primitive:

\[
C^r + = \left( \frac{A^r}{A^i} \right) \left( \frac{B^r}{B^i} \right), \quad C^i + = \left( \frac{A^i}{A^r} \right) \left( \frac{B^r}{B^i} \right)
\]

provided that \( A^r \), \( A^i \), \( B^r \), and \( B^i \) can each be stored so that they can be referenced as single matrices. We call this the 2M method, and the specific instance derived from Eq. 10 as 2M.C.

\textsuperscript{16}One could make the case that the packing formats described by Eqs. 7 and 8 exhibit the benefit of being somewhat easier to describe than 1E and 1R, because they can be expressed in terms of more familiar real, imaginary, complex, and modified complex matrices.
Similarly, applying permutations to the \(n\) and \(k\) dimensions to Eq. 6 yields:

\[
\begin{pmatrix}
C^r & C^i
\end{pmatrix}
= 
\begin{pmatrix}
A^r & A^i
\end{pmatrix}
\begin{pmatrix}
B^r & B^i
\end{pmatrix}
\]

which can also be broken down in two instances of real matrix multiplication:

\[
C^r = \begin{pmatrix}
A^r & A^i
\end{pmatrix}
\begin{pmatrix}
B^r
\end{pmatrix}, \quad C^i = \begin{pmatrix}
A^r & A^i
\end{pmatrix}
\begin{pmatrix}
B^i
\end{pmatrix}
\]

which corresponds to \(2M_{R,R}\).

We can now make a few observations about \(2M\):

- Eqs. 10 and 11 are identical to Eqs. 3 and 5, respectively, except that scalars are replaced with matrices.
- The initial permutation along the \(k\) dimension is actually unnecessary, and so the formatting captured by Eq. 9 also falls under \(2M\). This permutation (or lack thereof) only changes the order in which intermediate terms are accumulated.
- The storage of \(C^r\) and \(C^i\) in both Eqs. 10 and 11 is unspecified and does not depend on which matrix, \(A\) or \(B\), was originally formatted with \(1e\) (prior to permutation). Either \(C^r\) or \(C^i\) may be stored by rows, columns, or a more exotic storage scheme, and their storage formats need not even be identical. Thus, neither \(2M_{C}\) nor \(2M_{R}\) implies the storage of \(C\); rather, they only imply how the \(A\) and \(B\) matrices are formatted and stored—that is, which one contains duplicated (and negated) entries.
- The \(2M\) method can be applied at an arbitrary level of matrix multiplication. For example, if we assume from Eq. 10 that input matrices \(\begin{pmatrix}
A^r & \ldots
\end{pmatrix}, \begin{pmatrix}
A^i & \ldots
\end{pmatrix},\) and \(\begin{pmatrix}
B^r & \ldots
\end{pmatrix}\) are each stored as micro-panels (column-stored, column-stored, and row-stored, respectively), then this application of \(2M\) prescribes that \(C\) is stored by contiguous \(n_R \times n_R\) micro-tiles. If, instead, the aforementioned input matrices were generally large, then \(C\) would need to be stored in whatever manner is appropriate for the real matrix multiplication primitive.

4 Performance

In this section we present performance results for implementations of \(1M\) algorithms on a recent Intel architectures. For comparison, we include results for select \(4M\) algorithms as well as the conventional assembly-based approach.

4.1 Platform and implementation details

Results presented in this section were gathered on a single Cray XC40 compute node consisting of two 12-core Intel Xeon E5-2690 v3 processors featuring the “Haswell” microarchitecture. Each core, running at a clock rate of 3.2 GHz\(^{17}\), provides a single-core peak performance of 51.2 gigaflops (GFLOPS) in double precision and 102.4 GFLOPS in single precision.\(^{18}\) Each socket has a 30MB L3 cache that is shared among cores, and each core has a private 256KB L2 cache and 32KB L1 (data) cache. Performance experiments were gathered under the SuSE 11 operating system running the Linux 3.0.101 (x86_64) kernel. Source code was compiled by the GNU C compiler (\texttt{gcc}), version 5.2.0.\(^{19}\) The version of BLIS used in all tests was not officially released at the time of this writing, and was adapted from version 0.2.1-85.\(^{20}\)

\(^{17}\)This system uses Intel’s Turbo Boost 2.0 dynamic frequency throttling technology. According to \([?]\), the maximum the clock frequency when executing AVX instructions is 3.2 GHz when utilizing one or two cores, and 3.0 GHz when utilizing three or more cores.

\(^{18}\)Accounting for the reduced AVX clock frequency, the peak performance when utilizing 24 cores is 48 GFLOPS/core in double precision and 96 GFLOPS/core in single precision.

\(^{19}\)The following optimization flags were used during compilation: \texttt{-O3 -mavx2 -mfma -mfpmath=sse -march=core-avx2}.

\(^{20}\)Despite not yet having an official version number, this version of BLIS may be uniquely identified, with high probability, by the first 10 digits of its git “commit” (SHA1 hash) number: 1c732d3ddc.
<table>
<thead>
<tr>
<th>Precision/Domain</th>
<th>Implementation</th>
<th>$m_R$</th>
<th>$n_R$</th>
<th>$m_C$</th>
<th>$n_C$</th>
</tr>
</thead>
<tbody>
<tr>
<td>single complex</td>
<td>BLIS 1m_c</td>
<td>16/2</td>
<td>6</td>
<td>144/2</td>
<td>256/2</td>
</tr>
<tr>
<td></td>
<td>BLIS 1m_r</td>
<td>6</td>
<td>16/2</td>
<td>144</td>
<td>256/2</td>
</tr>
<tr>
<td></td>
<td>BLIS assembly</td>
<td>3</td>
<td>8</td>
<td>144</td>
<td>256</td>
</tr>
<tr>
<td></td>
<td>OpenBLAS</td>
<td>8</td>
<td>2</td>
<td>384</td>
<td>192</td>
</tr>
<tr>
<td>double complex</td>
<td>BLIS 1m_c</td>
<td>8/2</td>
<td>6</td>
<td>72/2</td>
<td>256/2</td>
</tr>
<tr>
<td></td>
<td>BLIS 1m_r</td>
<td>6</td>
<td>8/2</td>
<td>72</td>
<td>256/2</td>
</tr>
<tr>
<td></td>
<td>BLIS assembly</td>
<td>3</td>
<td>4</td>
<td>72</td>
<td>256</td>
</tr>
<tr>
<td></td>
<td>OpenBLAS</td>
<td>4</td>
<td>2</td>
<td>256</td>
<td>128</td>
</tr>
</tbody>
</table>

Table 5: Register and cache block sizes used by the various implementations of matrix multiplication, as configured for an Intel Xeon E5-2690 v3 “Haswell” processor. Note: Division by 2 is made explicit to allow the reader to quickly see the 1m block sizes as well as the block sizes that would be used by the underlying real domain micro-kernels when performing real matrix multiplication.

^a We were unable to confidently determine the $n_C$ block sizes used by OpenBLAS for the Haswell architecture.

Algorithms 1m_c_bp, 1m_c_pb, 1m_r_bp, and 1m_r_pb were implemented in the BLIS framework, as described in Sections 3.5 through 3.8. We also refer to results based on existing conventional, assembly-based micro-kernels written by hand for the Haswell microarchitecture via GNU extended inline assembly syntax.

All experiments were performed on randomized, column-stored matrices with GEMM scalars held constant: $\alpha = 2$ and $\beta = 1$. In all performance graphs, each data point represents the best of three trials.

Blocksizes for each of the implementations tested are provided in Table 5. For reference, we also provide the block sizes for single-precision and double-precision real domain matrix multiplication, as well as those used by complex GEMM implementations in OpenBLAS 0.2.19.

In all graphs presented in this section the x-axes denote the problem size, the y-axes show observed floating-point performance in units of GFLOPS, and the theoretical peak performance coincides with the top of each graph.

### 4.2 Comparing to other implementations

As in the previous article, our primary goal is not to compare the performance of the newly developed 1m implementations with that of other established BLAS solutions. Rather, our intent is to focus comparisons within and between families of induced methods. That said, we agree that some basic comparison is appropriate, and thus have included Figure 5. Here, we compare conventional assembly-coded complex GEMM implementations in BLIS with those provided by OpenBLAS 0.2.19 and Intel MKL 11.3. We also include performance for BLIS’s assembly-coded real domain GEMM, since its underlying kernel forms the basis for all induced methods presented here and in the previous article. Figure 6 shows multithreaded performance of the same libraries on 24 cores. These graphs show that BLIS’s solutions are quite competitive, and in some cases outperform OpenBLAS, while falling short of Intel’s highly optimized MKL library.

### 4.3 Sequential results

Figure 7 reports performance results for various implementations of double- and single-precision complex matrix multiplication on a single core of the Haswell processor. For these results, $m = n$ was bound to the problem size while the $k$ dimension was fixed to the corresponding value of $k_C$, as listed in Table 5. As in the previous article, we focus on this problem shape—rank-$k_C$ update—because: (1) it will yield near-optimal performance for all of our implementations tested, (2) this type of matrix multiplication frequently appears
Figure 5: Single-threaded performance of various conventional (i.e., assembly-coded) implementations of double-precision (left) and single-precision (right) complex GEMM on a single core of an Intel Xeon E5-2690 v3 “Haswell” processor. The theoretical peak performance coincides with the top of each graph.

Figure 6: Multithreaded performance of various conventional (i.e., assembly-coded) implementations of double-precision (left) and single-precision (right) complex GEMM on 24 cores of an Intel Xeon E5-2690 v3 “Haswell” processor. The theoretical peak performance coincides with the top of each graph.
Figure 7: Single-threaded performance of various implementations of double-precision (top) and single-precision (bottom) complex GEMM on a single core of an Intel Xeon E5-2690 v3 “Haswell” processor. The left and right graphs contain an identical reference curve for assembly-coded complex GEMM as well as results for three additional reference implementations (corresponding to the assembly-coded real GEMM, as well as 4M_1A and 4M_HW). These latter three implementations differ from left to right graphs in the preference of their underlying micro-kernel, indicated by a “(c)” or “(r)” (for column- or row-prefering). The left and right graphs also differ in which 1M implementations they report, with the left graphs reporting 1M_C_BP and 1M_R_PB (which employ column-prefering micro-kernels) and the right graphs reporting 1M_R_BP and 1M_C_PB (which employ row-prefering micro-kernels). The theoretical peak performance coincides with the top of each graph.
within high-performance implementations of higher-level DLA operations such as Cholesky, LU, and QR factorizations, and (3) it is the foundation for matrix multiplications where all three dimensions are large.\(^{21}\)

The primary reference implementations chosen for Figure 7 consist of a high-performance complex GEMM based on conventional assembly-coded complex kernels. This reference implementation (the solid line) was held constant between the top-left and top-right graphs, which report double-precision results, and the corresponding single-precision implementation is similarly duplicated between the bottom-left and bottom-right graphs. We also include a high-performance real GEMM and, as representatives of the 4M method, implementations of Algorithms 4M_1A and 4M_HW. For these three reference codes, we report results based on column-preferential micro-kernels on the left and those of row-preferential micro-kernels on the right. (We indicate the preference of the underlying micro-kernel of those algorithms with a “(c)” or “(r)” in the graph legends.\(^{22}\)) We include results for both types of micro-kernels because the left-right difference also includes which 1M implementations are shown. That is, results for 1M_C_BP and 1M_R_PB (which use column-prefering micro-kernels) appear on the left while those of 1M_R_BP and 1M_C_PB (which use row-prefering micro-kernels) appear on the right. We group the 1M algorithms in this manner so that we can visually confirm the similarities (or differences) in performance within algorithm pairs.

As predicted in Section 3.7.2, we find that the performance signatures of the algorithms within each of the aforementioned 1M algorithm pairs are virtually indistinguishable. That is, 1M_C_BP tracks closely with 1M_R_PB, and 1M_R_BP with 1M_C_PB. This behavior holds for both single- and double-precision. The performance signatures between pairs, however, differ slightly. This was expected given that the 1E and 1R packing formats place different memory access burdens on different packed matrices, \(A_i\) and \(B_p\), which reside in different levels of cache. It was not previously clear, however, which pair would be superior over the other, or if there would be a material difference at all. It seems that, at least in the sequential case, the difference between the 1M pairs’ performance signatures are minimal, though the difference is somewhat more noticeable in double-precision. This difference is almost certainly due to the different performance characteristics of the row- and column-preferential micro-kernels. We find evidence of this in the 4M_1A and 4M_HW results as well as those of the real GEMM implementation, which are also affected by the change in micro-kernel preference.

In all cases, the 1M implementations outperform both 4M_1A and 4M_HW, with the margin growing in single-precision. Somewhat surprisingly, the 1M implementations match or exceed the performance of their real domain benchmarks (the dotted lines in each graph), and are very competitive with assembly-coded complex GEMM regardless of the algorithm employed.

## 4.4 Multithreaded results

Figure 8 shows double- and single-precision performance using 24 threads, with one thread bound to each core of the processor. Performance is presented in units of gigaflops per core to facilitate visual assessment of scalability. For all implementations, we employed 2-way parallelism within the 5th loop and 12-way parallelism within the 3rd loop, for a total of 24 threads. This parallelization scheme was chosen in a manner consistent with that of the previous article\(^ {23}\), using a strategy set forth in [7].

Like in the single-threaded case, the performance of 1M algorithms closely track one another within pairs, albeit with somewhat more jitter. However, here we find a marked difference in performance between pairs.

\(^{21}\)The authors of [2] propose a taxonomy that includes other shape scenarios besides rank-\(k_C\) update and large quasi-square multiplication. Some of these other types of matrix product may favor algorithms that are different from the one depicted in Figure 1. This topic deserves special treatment, and thus is beyond the scope of the present article.

\(^{22}\)Though it is not indicated in the graph legends, we chose to always compare against an assembly-coded complex GEMM based on a row-preferential micro-kernel because it exhibited higher performance than its column-preferential counterpart on the test hardware. The likely reason for this outperformance relates to the way \(m_R\) and \(n_R\) affect the bandwidth needed from the L2 and L1 caches, and is briefly discussed in Section 4.4.

\(^{23}\)The two sockets of the Xeon E5-2690 v3 each have an L3 cache that is shared among those sockets’ cores. This encourages two-way parallelism in the 5th loop, which produces two panels \(B_p\) to be packed and used simultaneously on completely independent parts of matrix \(C\). Furthermore, by also parallelizing the 3rd loop, each of the 12 cores of either socket pack separate blocks \(A_i\) into their private L2 caches. Thus, when each core executes the 2nd loop (i.e., the macro-kernel), it multiplies its local block \(A_i\) by the row-panel \(B_p\) that is shared among all cores on the socket.
Figure 8: Multithreaded performance of various implementations of double-precision (top) and single-precision (bottom) complex GEMM on two Intel Xeon E5-2690 v3 “Haswell” processors, each with 12 cores. All data points reflect the use of 24 threads. The left and right graphs contain an identical reference curve for assembly-coded complex GEMM as well as results for three additional reference implementations (corresponding to the assembly-coded real GEMM, as well as 4M_1A and 4M_HW). These latter three implementations differ from left to right graphs in the preference of their underlying micro-kernel, indicated by a “(c)” or “(r)” (for column- or row-preferring). The left and right graphs also differ in which 1M implementations they report, with the left graphs reporting 1M_C_BP and 1M_R_PB (which employ column-preferring micro-kernels) and the right graphs reporting 1M_R_BP and 1M_C_PB (which employ row-preferring micro-kernels). The theoretical peak performance coincides with the top of each graph.
Specifically, the $1M_R\text{_BP}$ and $1M_C\text{_PB}$ implementations (those based on the row-prefering micro-kernel) outperform those of $1M_C\text{_BP}$ and $1M_R\text{_PB}$ (based on the column-prefering micro-kernel), with the difference more pronounced in single-precision. We suspect this is rooted not in the algorithms, per se, but in the differing micro-kernel implementations used by each pair. The $1M_R\text{_BP}/1M_C\text{_PB}$ algorithms are implemented with a real micro-kernel that is $6 \times 8$ and $6 \times 16$ in the single- and double-precision cases, respectively, while $1M_C\text{_BP}/1M_R\text{_PB}$ use $8 \times 6$ and $16 \times 6$ micro-kernels for single- and double-precision implementations, respectively. The observed difference in performance between the $1M$ pairs is likely attributable to the fact that the micro-kernels’ different values for $m_R$ and $n_R$ place different bandwidth requirements when reading f.e. from the caches (primarily L1 and L2). Specifically, larger values of $m_R$ place a heavier burden on loading elements from the L2 cache, which is usually disadvantageous since that cache resides further from the processor. By contrast, a micro-kernel with larger $n_R$ loads more elements (per $m_R \times n_R$ rank-1 update) from the L1 cache, which resides closer to the processor, and relies less heavily on loading elements from the L2 cache.

Even in the worst case (for $1M_C\text{_BP}/1M_R\text{_PB}$), the $1M$ implementations match or exceed their real domain counterparts. And when the $1M_R\text{_BP}/1M_C\text{_PB}$ algorithm pair is employed, performance is competitive with that of the conventional implementations based on complex assembly-coded micro-kernels, particularly in double-precision.

The $1M$ algorithms based on row-preferential micro-kernels, $1M_R\text{_BP}/1M_C\text{_PB}$, outperform $4M_{1A}$, especially in single-precision where the margin is quite wide. The $1M_C\text{_BP}/1M_R\text{_PB}$ algorithms, based on a column-preferential micro-kernel, fare more poorly relative to $4M_{1A}$, but that pair generally still matches or exceeds the $4M$ implementation. We suspect that $4M_{1A}$ is more resilient to the lower-performing column-preferential micro-kernel because that algorithm’s virtual micro-kernel leans heavily on the L1 cache, which on this architecture is capable of being read from and written to at relatively high bandwidth (64 bytes/cycle and 32 bytes/cycle, respectively) [4].

5 Observations

5.1 4m limitations circumvented

The previous article concluded by identifying a number of limitations inherent in the $4M$ method that, collectively, prevent the approach from becoming both a feasible and competitive alternative to matrix multiplication via conventional assembly-based kernels. We now revisit this list and briefly discuss whether, to what degree, and how those limitations are overcome by algorithms based on the $1M$ method.

- **Number of calls to primitive.** The most versatile $4M$ algorithm, $4M_{1A}$, incurs up to a 400% increase in function call overhead over a comparable assembly-based implementation. By comparison, $1M$ algorithms require at most a doubling of micro-kernel function call overhead, and in certain common cases (e.g., when $\beta \in \mathbb{R}$ and $C$ is row- or column-stored), this overhead can be avoided completely. The $1M$ method is clearly an improvement over $4M$ due to its reliance on a single invocation of the matrix multiplication primitive.

- **Inefficient reuse of input date from $A$, $B$, and $C$.** The most cache-efficient application of $4M$ is the lowest level algorithm, $4M_{1A}$, which reuses f.e. of $A$, $B$, and $C$ from the L1 cache. But, as shown in Table 3, both $1M_R$ and $1M_C$ variants reuse f.e. of two of the three matrices from registers, and with the inclusion of the panel-block algorithm, two of the four $1M$ algorithms (one from each variant) reuse f.e. of the third matrix from the L1 cache. This would seem to be a significant improvement, though observed performance improvement over $4M_{1A}$ will depend on properties of the hardware (i.e., the L1 cache performance).

- **Non-contiguous output to $C$.** Algorithms based on the $4M$ method must update only the real and then only the imaginary parts of the output matrix, twice. Since $C$ is typically stored (by rows or columns) with real and imaginary f.e. interleaved, this piecemeal approach prevents the real micro-kernel from using vector load and store instructions on $C$ during those four updates. The $1M$
method avoids this issue altogether by packing $A$ and $B$ to formats that allow the real micro-kernel to update contiguous real and imaginary F.E. simultaneously, during a single invocation. We suspect that this is, perhaps, the largest contributor to 1M’s performance superiority over 4M.

- **Reduction of $k_C$.** Algorithm 4M_1A requires that the real micro-kernel’s preferred $k_C$ blocksize be halved in the complex algorithm in order to maintain proper cache footprints of $\tilde{A}_i$ and $\tilde{B}_p$, as well the footprints of their constituent micro-panels.\(^{24}\) Using such sub-optimally sized micro-panels can noticeably hobble the performance of 4M_1A. Looking back at Table 1, it may seem like 1M suffers a similar handicap, however, the reason for halving $k_C$ and its effect are both completely different. In the case of 1M, the use of $k_C^C = \frac{1}{2}k_C$ is simply a conversion of units (complex elements to real F.E.) for the purposes of identifying the size of the complex submatrices to be packed that will induce the optimal $k_C$ value from the perspective of the real micro-kernel, not a reduction in the F.E. footprint of the micro-panels operated upon by the real micro-kernel. Indeed, the ability of 1M to achieve high performance when $k = \frac{1}{2}k_C$ is a strength in the context of certain higher-level applications, such as Cholesky, LU, and QR factorizations based on rank-k update. Those operations tend to perform better when the algorithmic blocksize (corresponding to $k_C^C$) is as narrow as possible in order to limit the amount of computation in the lower-performing unblocked subproblem.

- **Framework accommodation.** The 1M algorithms are no more disruptive to the BLIS framework than the most accommodating of 4M algorithms, 4M_1A, and much less disruptive than the remaining algorithms. This is because, like with 4M_1A, almost all of the 1M implementation details are sequestered to the packing routines and the virtual micro-kernel.

- **Interference with multithreading.** Because the 1M algorithms are implemented entirely within the packing facility and virtual micro-kernel, they parallelize just as easily as the most thread-friendly of the 4M algorithms, 4M_1A, and entirely avoid the threading difficulties of higher-level 4M algorithms.

- **Non-applicability to two-operand operations.** Like 4M_1A, algorithms based on 1M can be applied to other level-3 operations, including those that involve only two operands, such as TRMM and TRSM.\(^{25}\) Certain higher-level applications of 4M, however, are inherently incompatible with two-operand operations because they would overwrite the original contents of the input/output operand even though subsequent stages of computation depend on that original input.

This analysis suggests that the 1M method solves or avoids most of the performance-degrading weaknesses of 4M, and in the remaining cases is no worse off than the best 4M algorithm. Thus, its observed performance superiority was predictable.

### 5.2 Further discussion

Before concluding, here we offer some final thoughts on the 1M method and its place in the larger spectrum of approaches to implementing complex matrix multiplication.

#### 5.2.1 Geometric interpretation

Matrix multiplication is sometimes thought of as a three-dimensional operation with a contraction (accumulation) over the $k$ dimension. This interpretation carries into the complex domain as well. However, when each complex element is viewed in terms of its real and imaginary components, we find that a fourth pseudo-dimension of computation (of fixed size 2) emerges, one which also involves a contraction. The 1M method reorders and duplicates elements of $A$ and $B$—a form of swizzling applied when the data is packed—in such a way that effectively “flattens” this extra dimension of computation. This, combined with the exposed treatment of real and imaginary F.E., causes the resulting floating-point operations to appear

\(^{24}\)Recall that the halving of $k_C$ for 4M_1A was motivated by the desire to keep the not just two, but four real micro-panels in the L1 cache simultaneously. These correspond to the real and imaginary parts of the current micro-panels of $A_i$ and $B_p$.

\(^{25}\)As with 4M_1A, support for TRSM requires a separate pair of micro-kernels that fuse a matrix multiplication with a triangular solve with $n_R$ right-hand sides.
indistinguishable from a real domain matrix multiplication with \( m \) and \( k \) (for column-stored \( C \)) or \( k \) and \( n \) (for row-stored \( C \)) dimensions that are twice as long.

### 5.2.2 Data reuse: efficiency vs. programmability

Both the conventional approach and 1M move data efficiently through the memory hierarchy.\(^{26}\) However, once in registers, a conventional complex micro-kernel reuses those loaded values to perform twice as many flops as 1M. The previous article observes that all 4M algorithms make different variations of the same tradeoff: by forgoing the reuse of F.E. from registers and instead reusing those data from some level of cache, the algorithms avoid the need to explicitly encode complex arithmetic at the assembly level. As it turns out, 1M makes a similar tradeoff, but gives up less while gaining more: it is able to reuse F.E. from two of the three matrix operands from registers while still avoiding the need for a complex micro-kernel, and manages to replace that kernel operation with a single real matrix multiplication. And we would argue that increasing programmability and productivity by forfeiting a modest performance advantage is a good trade to make under almost any circumstance.

### 5.2.3 Micro-kernel bandwidth

Because 1M algorithms do not explicitly reuse F.E. of \( \hat{A} \) and \( \hat{B} \) from registers, they require higher memory bandwidth during the micro-kernel computation than a conventional assembly-based solution.\(^{27}\) Specifically, increased bandwidth is utilized when reading from the copy of the matrix that is packed into the 1e format, which resides in either the L2 or L1 cache, depending on which algorithm is being employed.\(^{28}\) In practice, this potential weakness of 1M is not a concern. Yes, in principle, one could design an architecture with sufficiently low memory bandwidth from L2 or L1 cache that a conventional complex matrix multiplication achieves high performance while a 1M-based implementation struggles. However, this would imply a corresponding performance shortfall in the underlying real domain matrix kernel. Given the ubiquity and importance of real matrix multiplication in the scientific community, hardware vendors have great incentive to design architectures that allow \texttt{sgemm} and \texttt{dgemm} to achieve high performance. Thus, we would expect that 1M will remain a viable alternative for the foreseeable future.

### 5.2.4 Storage

The supremacy of the 1M method is closely tied to the interleaved, pairwise storage of real and imaginary values—specifically, of the output matrix \( C \). If users and applications decide to start storing complex matrices as two real matrices (traditionally-stored, by rows or columns), one each for real and imaginary components, the 2M approach (for numerically sensitive settings) as well as low-level applications of 3M (for numerically insensitive settings) become more appropriate.

### 6 Conclusions

We began the article by reviewing the general motivations for induced methods for complex matrix multiplication, and the specific methods, 3M and 4M, studied in the previous article. Then, we recast complex scalar multiplication (and accumulation) in such a way that it revealed a template that could be used to fashion a new induced method that casts the complex matrix multiplication in terms of a single real matrix product. The key is the application of two new packing formats on the left- and right-hand matrix product

\(^{26}\)This is in contrast to, for example, Algorithm 4M\_4W, which the previous article showed makes rather inefficient use of cache lines as they travel through the L3, L2, and L1 caches.

\(^{27}\)While this bandwidth distinction actually holds for all induced methods, different algorithms will place differing degrees of bandwidth pressure on the memory hierarchy, depending on the level(s) of cache from which it reuses F.E. of \( A \) and \( B \).

\(^{28}\)The bandwidth from the 1r-formatted matrix is unaffected since it only reorders (rather than duplicates) its F.E.. And since 1M uses half the \( k_C \) that is optimal in the real domain (and therefore would perform roughly twice as many rank-\( k_C \) updates), the bandwidth required for accessing F.E. of the output matrix may also be higher, though the precise level of increase will depend on the value of \( k_C^2 \), that would be used by a comparable assembly-based implementation. However, bandwidth requirements on \( C \) are already quite low, so we would not expect this difference to measurably impact performance.
operands that allows us to disguise the complex matrix multiplication as a real matrix multiplication with slightly modified input parameters. This 1M method is shown to have two variants, depending on whether the output matrix is stored by rows or columns. We introduced an alternative “panel-block” algorithm that, combined with the original block-panel algorithm, gives rise to a total of four 1M algorithms, and discussed the similarities and differences in the performance properties of each. We also considered alternative packing formats, including a 2M method that would be applicable to more exotic storage arrangements that separate the real and imaginary components. When implemented in the BLIS framework, competitive performance was observed for 1M algorithms on a recent Intel microarchitecture. These tests provided empirical evidence of predicted performance similarities within algorithmic pairs, and also confirmed that performance can differ between pairs. Finally, we reviewed the limitations of the 4M method that are overcome by 1M and then concluded by discussing a few high-level observations.

The key takeaway from our study of induced methods is that the real and imaginary elements of complex matrices can always be reordered to accommodate the desired fundamental primitives, whether those primitives are defined to be various forms of real matrix multiplication (as is the case for the 4M, 3M, 2M, and 1M methods), or vector instructions (as is the case for micro-kernels that implement complex arithmetic in assembly code). Indeed, even in the real domain, the classic matrix multiplication algorithm’s packing format is simply a reordering of data that targets the fundamental primitive implicit in the micro-kernel—namely, an $m_R \times n_R$ rank-1 update. The family of induced methods presented here and in the previous article expand upon this basic reordering so that the mathematics of complex arithmetic can be expressed at different levels of the algorithm and of its corresponding implementation, each yielding different benefits, costs, and performance.

Acknowledgements

This research was partially sponsored by grants from Intel Corporation and the National Science Foundation (Award ACI-1550493). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

We thank the Texas Advanced Computing Center for providing access to the the Intel Xeon “Lonestar5” compute cluster on which performance data was gathered.

References


