【Neon】????码带为了Neon - Part3：Matrix Multiplication

程序员文章站 2022-04-19 17:16:54

...

原文链接：[Coding for Neon - Part 3: Matrix Multiplication]
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon—part-3-matrix-multiplication

第1部分讲了如何用Neon来加载和存储数据，第2部分讲了对向量进行操作后，如何处理最后多出来的数据（指数据不够，无法凑成一个完整的向量）。
本文介绍一种有用的数据处理：矩阵乘法。

Matrices（矩阵）

我们来看看怎样高效滴做4x4矩阵的乘法，这种操作在3D图像处理中经常用到。假设矩阵是按照列序（column-major order）的形式存储在内存里——OpenGL-ES就是采用这种格式。
【Neon】????码带为了Neon - Part3：Matrix Multiplication

Algorithm（算法）

我们对矩阵乘法进行分解，看看其中哪些操作可以通过Neon指令来实现。

【Neon】????码带为了Neon - Part3：Matrix Multiplication

上图显示了结果矩阵C的第一列的计算方法。按照常识（行乘以列），矩阵C的第1列的第1个元素为矩阵A的第1行乘以矩阵B的第1列。

【Neon】????码带为了Neon - Part3：Matrix Multiplication

如上图所示，从另外一个角度来理解，矩阵C的第1列可以这样算：它由4个列向量相加，第1个列向量为矩阵A的第1列乘以矩阵B第1列的第1个元素，如图中蓝色框框；第2个列向量为矩阵A的第2列乘以矩阵B的第1列的第2个元素，如图中紫色框框；剩下两个灰色框框的列向量请结合图示理解。这种思路可以用Neon来实现。↓↓↓
【Neon】????码带为了Neon - Part3：Matrix Multiplication

如果矩阵的每一列数据，都是存储在Neon寄存器中的一个向量，我们可以用向量-标量乘法（vector-by-scalar multiplication）指令来完成这种操作，如上图所示。最后还要把这几个列向量的对应元素相加，这可以通过同样指令的加法版本来实现。
当我们对第一个矩阵的列进行操作，产生一列结果时，从内存中读写元素是一种线性操作，不需要交错（interleaving）加载或存储指令。

Code（代码）

Floating Point（浮点型）

先来看看单精度浮点矩阵乘法的实现。首先将矩阵从内存加载到Neon寄存器中，矩阵是列序（column-major order）的，所以矩阵的列是线性存储在内存里的。每一列可以通过VLD1加载到Neon寄存器中，然后用VST1写回到内存。


@ r是地址寄存器, r1指向矩阵0(也就就是矩阵A), r2指向矩阵(也就是矩阵B)

    vld1.32  {d16-d19}, [r1]!            @ 加载矩阵A的前8个元素,矩阵是列序的,也就是加载了两列. 然后更新指针
    vld1.32  {d20-d23}, [r1]!            @ 加载矩阵A的接下来8个元素
    vld1.32  {d0-d3}, [r2]!              @ 同理加载矩阵B的前两列
    vld1.32  {d4-d7}, [r2]!              @ 同理加载矩阵B的后两列

Neon有32个64位寄存器，这里可以把两个矩阵的全部数据加载到寄存器里，并且还有多的寄存器可以用作累加器。这里d16到d23存储第一个矩阵的16个数字，d0到d7存储第二个矩阵的16个数字。（单精度浮点数是4个字节32位，每个寄存器是64位，可以存2个数字，所以8个寄存器可以存16个数字）

An Aside: D and Q registers（插一段介绍：D和Q寄存器）

大多数Neon指令可以有两种方式使用寄存器组：

分成16个Quad-word（4字长）寄存器，每个寄存器128位，名称为q0到q15
分成32个Double-word寄存器（2字长），每个寄存器64位，名称为d0到d31

【Neon】????码带为了Neon - Part3：Matrix Multiplication

32位系统里一个字长（word）是32位，64位系统是64位。
Reference：https://blog.csdn.net/fabulous1111/article/details/79525384

这些d和q只是别名，其实q0中的数据与d0 d1的数据相同，通过d0可以访问q0的前面64位数据，通过d1可以访问q0的后面64位数据。在C语言中这和联合体（union）很像。
在本文的浮点矩阵乘法例子中，常使用Q寄存器，因为本文矩阵的一列是4个32位的浮点数，刚好存进一个128位的Q寄存器里。

Back to the Code（看回代码）

用4行Neon乘法指令算出矩阵C的第一列：


    vmul.f32    q12, q8, d0[0]      @ q8就是d16d17, 是矩阵A的第一列, q12 = q8 * d0[0]
    vmla.f32    q12, q9, d0[1]      @ q9就是矩阵B的第二列, q12 += q9 * d0[1]
    vmla.f32    q12, q10, d1[0]     @ 同理
    vmla.f32    q12, q11, d1[1]     @ 算完了

注意在本代码中，乘法指令中通过D寄存器来获取和使用标量。尽管q0[3]和d0[1]是一样的值，并且从语义上更好理解（第1列的第4个值）。但是作者正在使用的GNU汇编器不支持这种格式，所以只能从D寄存器来指定标量，或许你的汇编器不必如此。

所以第一行代码的意思就是 $q_{12} =[ x_0y_0,x_1y_0,x_2y_0,x_3y_0]^T$ ，q8就是矩阵A的第一列的4个数字，d0就是矩阵B第一列的前2个数字，d1就是矩阵B第一列的后2个数字。
后面几行代码差不多意思，自行理解，用的是乘加指令。4行代码实现了下面的运算：

$q_{12}= \left[ \begin{array}{ccc} x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \end{array} \right]\times y_0 + \left[ \begin{array}{ccc} x_{4} \\ x_{5} \\ x_{6} \\ x_{7} \end{array} \right]\times y_1 + \left[ \begin{array}{ccc} x_{8} \\ x_{9} \\ x_{A} \\ x_{B} \end{array} \right]\times y_2 + \left[ \begin{array}{ccc} x_{C} \\ x_{D} \\ x_{E} \\ x_{F} \end{array} \right]\times y_3 \\= \left[ \begin{array}{ccc} x_{0}y_0 + x_{4}y_1 + x_{8}y_2 + x_{C} y_{3} \\ x_{1}y_0 + x_{5}y_1 + x_{9}y_2 + x_{D} y_{3} \\ x_{2}y_0 + x_{6}y_1 + x_{A}y_2 + x_{E} y_{3} \\ x_{3}y_0 + x_{7}y_1 + x_{B}y_2 + x_{F} y_{3} \end{array} \right] = \left[ \begin{array}{ccc} x_{0} & x_{4} & x_{8} & x_{C} \\ x_{1} & x_{5} & x_{9} & x_{D} \\ x_{2} & x_{6} & x_{A} & x_{E} \\ x_{3} & x_{7} & x_{B} & x_{F} \end{array} \right] \cdot \left[ \begin{array}{ccc} y_{0} \\ y_{1} \\ y_{2} \\ y_{3} \end{array} \right]$ q12=⎣⎢⎢⎡x0x1x2x3⎦⎥⎥⎤×y0+⎣⎢⎢⎡x4x5x6x7⎦⎥⎥⎤×y1+⎣⎢⎢⎡x8x9xAxB⎦⎥⎥⎤×y2+⎣⎢⎢⎡xCxDxExF⎦⎥⎥⎤×y3=⎣⎢⎢⎡x0y0+x4y1+x8y2+xCy3x1y0+x5y1+x9y2+xDy3x2y0+x6y1+xAy2+xEy3x3y0+x7y1+xBy2+xFy3⎦⎥⎥⎤=⎣⎢⎢⎡x0x1x2x3x4x5x6x7x8x9xAxBxCxDxExF⎦⎥⎥⎤⋅⎣⎢⎢⎡y0y1y2y3⎦⎥⎥⎤

如果只是算矩阵和一个向量的相乘（3D图形中另外一种常见的操作），那上面几行代码就搞定了，并且可以把结果向量保存到内存中。然而，要完成矩阵乘矩阵，还需要迭代几次，算完q1到q3寄存器中y4到yF的值。

现在把上面这段代码封装成函数：

@ 上面的懂这个就懂了, 不写注释了
@ 函数名是mul_col_f32, 后面3个是参数

    .macro mul_col_f32 res_q, col0_d, col1_d
    vmul.f32    res_q, q8,  col0_d[0]      
    vmla.f32    res_q, q9,  col0_d[1]     
    vmla.f32    res_q, q10, col1_d[0]      
    vmla.f32    res_q, q11, col1_d[1]      
    .endm

$4\times4$ 浮点数矩阵的乘法的代码看起来像这样：

@ 上面的懂这个就懂了, 不写注释了
@ 函数名是mul_col_f32, 后面3个是参数

    @ 加载数据
    vld1.32  {d16-d19}, [r1]!            
    vld1.32  {d20-d23}, [r1]!           
    vld1.32  {d0-d3}, [r2]!             
    vld1.32  {d4-d7}, [r2]!       
          
    @ 调用函数, 实现矩阵与向量的乘法
    mul_col_f32 q12, d0, d1           
    mul_col_f32 q13, d2, d3           
    mul_col_f32 q14, d4, d5           
    mul_col_f32 q15, d6, d7           
 
    @ 保存数据
    vst1.32  {d24-d27}, [r0]!  
    vst1.32  {d28-d31}, [r0]!

Fixed Point（定点数）

使用定点（fixed point）算术来计算，通常比浮点运算快，因为它使用较少的内存带宽来读写位数较少的值。对于相同的操作，整数的乘法通常快过浮点数的乘法。
但是在用定点算术时，必须谨慎选择表示方式，以避免溢出或饱和，保证应用程序所需的精度。
用定点算术实现矩阵乘法，与浮点相似。在本例中使用Q1.14定点数格式（Q1.14数字具有1个整数位和14个小数位），其操作与浮点数相似，只需要更改累加器最后部分的位移，下面是封装的函数：


    .macro mul_col_s16 res_d, col_d
    vmull.s16   q12, d16, \col_d[0]   @ q12 = d16 * col_d[0], d16是64位寄存器, 刚好存4个整形数
    vmlal.s16   q12, d17, \col_d[1]
    vmlal.s16   q12, d18, \col_d[2]
    vmlal.s16   q12, d19, \col_d[3]
    vqrshrn.s32 \res_d, q12, #14      @ shift right and narrow accumulator into
                                      @ Q1.14 fixed point format, with saturation
    .endm

相比于前面浮点数封装的函数，可以看到主要的区别如下：

现在一个数字是16位的，所以一个D寄存器可以hold得住4个数字。
两个16位数乘以两个16位数，结果是一个32位数。这里用VMULL和VMLAL，因为结果存储在Q寄存器，用两倍大小的元素保护结果的所有位数。
最后的结果必须是16位，但是累加器是32位的。用VQRSHRN得到一个16位的结果

每个元素从32位缩小到16位也会对内存访问有所影响；可以使用较少的指令加载和存储数据。定点数矩阵乘法的代码如下：


    vld1.16  {d16-d19}, [r1]       @ load sixteen elements of matrix 0
    vld1.16  {d0-d3}, [r2]         @ load sixteen elements of matrix 1
 
    mul_col_s16 d4, d0                      @ matrix 0 * matrix 1 col 0
    mul_col_s16 d5, d1                      @ matrix 0 * matrix 1 col 1
    mul_col_s16 d6, d2                      @ matrix 0 * matrix 1 col 2
    mul_col_s16 d7, d3                      @ matrix 0 * matrix 1 col 3
 
    vst1.16  {d4-d7}, [r0]         @ store sixteen elements of result

Scheduling（调度）

后面再详细讲讲调度（scheduling），现在先来看看改进后的调度指令的效果。
在上面的.macro函数里，相邻的乘法指令写进同样的寄存器，由于指令集周期的问题，Neon（pipeline）必须等到每次乘法结束，才能开始下一条指令。
如果从函数里面取出这些指令，重新排列，则可以用其它互不依赖的指令，来分离这些相互依赖的指令，就可以并行地进行。（原文不是这样讲的，但我没读透，不好翻译）
在本例中我们可以重排代码，把访问累加寄存器的时机错开。


    vmul.f32    q12, q8, d0[0]              @ rslt col0  = (mat0 col0) * (mat1 col0 elt0)
    vmul.f32    q13, q8, d2[0]              @ rslt col1  = (mat0 col0) * (mat1 col1 elt0)
    vmul.f32    q14, q8, d4[0]              @ rslt col2  = (mat0 col0) * (mat1 col2 elt0)
    vmul.f32    q15, q8, d6[0]              @ rslt col3  = (mat0 col0) * (mat1 col3 elt0)
 
    vmla.f32    q12, q9, d0[1]              @ rslt col0 += (mat0 col1) * (mat1 col0 elt1)
    vmla.f32    q13, q9, d2[1]              @ rslt col1 += (mat0 col1) * (mat1 col1 elt1)
     ...
     ...

这个版本的代码在基于Cortex-A8的系统上性能翻倍。

你可以看看你的内核的技术参考手册，里面有关于指令周期延迟的详细描述。你值得花点时间熟悉它，以找到一些潜在的性能改进办法。

Source（源码）

这里懒得翻译了，其细节在前面都分析了，这里是完整源码。保留原文保留原味。

@
@ NEON matrix multiplication examples
@

.syntax unified

@
@ matrix_mul_float:
@ Calculate 4x4 (matrix 0) * (matrix 1) and store to result 4x4 matrix.
@  matrix 0, matrix 1 and result pointers can be the same,
@  ie. my_matrix = my_matrix * my_matrix is possible.
@
@ r0 = pointer to 4x4 result matrix, single precision floats, column major order
@ r1 = pointer to 4x4 matrix 0, single precision floats, column major order
@ r2 = pointer to 4x4 matrix 1, single precision floats, column major order
@

    .global matrix_mul_float
matrix_mul_float:
    vld1.32     {d16-d19}, [r1]!            @ load first eight elements of matrix 0
    vld1.32     {d20-d23}, [r1]!            @ load second eight elements of matrix 0
    vld1.32     {d0-d3}, [r2]!              @ load first eight elements of matrix 1
    vld1.32     {d4-d7}, [r2]!              @ load second eight elements of matrix 1

    vmul.f32    q12, q8, d0[0]              @ rslt col0  = (mat0 col0) * (mat1 col0 elt0)
    vmul.f32    q13, q8, d2[0]              @ rslt col1  = (mat0 col0) * (mat1 col1 elt0)
    vmul.f32    q14, q8, d4[0]              @ rslt col2  = (mat0 col0) * (mat1 col2 elt0)
    vmul.f32    q15, q8, d6[0]              @ rslt col3  = (mat0 col0) * (mat1 col3 elt0)

    vmla.f32    q12, q9, d0[1]              @ rslt col0 += (mat0 col1) * (mat1 col0 elt1)
    vmla.f32    q13, q9, d2[1]              @ rslt col1 += (mat0 col1) * (mat1 col1 elt1)
    vmla.f32    q14, q9, d4[1]              @ rslt col2 += (mat0 col1) * (mat1 col2 elt1)
    vmla.f32    q15, q9, d6[1]              @ rslt col3 += (mat0 col1) * (mat1 col3 elt1)

    vmla.f32    q12, q10, d1[0]             @ rslt col0 += (mat0 col2) * (mat1 col0 elt2)
    vmla.f32    q13, q10, d3[0]             @ rslt col1 += (mat0 col2) * (mat1 col1 elt2)
    vmla.f32    q14, q10, d5[0]             @ rslt col2 += (mat0 col2) * (mat1 col2 elt2)
    vmla.f32    q15, q10, d7[0]             @ rslt col3 += (mat0 col2) * (mat1 col2 elt2)

    vmla.f32    q12, q11, d1[1]             @ rslt col0 += (mat0 col3) * (mat1 col0 elt3)
    vmla.f32    q13, q11, d3[1]             @ rslt col1 += (mat0 col3) * (mat1 col1 elt3)
    vmla.f32    q14, q11, d5[1]             @ rslt col2 += (mat0 col3) * (mat1 col2 elt3)
    vmla.f32    q15, q11, d7[1]             @ rslt col3 += (mat0 col3) * (mat1 col3 elt3)

    vst1.32     {d24-d27}, [r0]!            @ store first eight elements of result
    vst1.32     {d28-d31}, [r0]!            @ store second eight elements of result

    mov         pc, lr                      @ return to caller





@ Macro: mul_col_s16
@
@ Multiply a four s16 element column of a matrix by the columns of a second matrix
@ to give a column of results. Elements are assumed to be in Q1.14 format.
@ Inputs:   col_d - d register containing a column of the matrix
@ Outputs:  res_d - d register containing the column of results 
@ Corrupts: register q12
@ Assumes:  the second matrix columns are in registers d16-d19 in column major order
@

    .macro mul_col_s16 res_d, col_d
    vmull.s16   q12, d16, \col_d[0]         @ multiply col element 0 by matrix col 0
    vmlal.s16   q12, d17, \col_d[1]         @ multiply-acc col element 1 by matrix col 1
    vmlal.s16   q12, d18, \col_d[2]         @ multiply-acc col element 2 by matrix col 2
    vmlal.s16   q12, d19, \col_d[3]         @ multiply-acc col element 3 by matrix col 3
    vqrshrn.s32 \res_d, q12, #14            @ shift right and narrow accumulator into
                                            @  Q1.14 fixed point format, with saturation
    .endm

@
@ matrix_mul_fixed:
@ Calculate 4x4 (matrix 0) * (matrix 1) and store to result 4x4 matrix.
@  matrix 0, matrix 1 and result pointers can be the same,
@  ie. my_matrix = my_matrix * my_matrix is possible
@
@ r0 = pointer to 4x4 result matrix, Q1.14 fixed point, column major order
@ r1 = pointer to 4x4 matrix 0, Q1.14 fixed point, column major order
@ r2 = pointer to 4x4 matrix 1, Q1.14 fixed point, column major order
@

    .global matrix_mul_fixed
    
matrix_mul_fixed:
    vld1.16     {d16-d19}, [r1]             @ load sixteen elements of matrix 0
    vld1.16     {d0-d3}, [r2]               @ load sixteen elements of matrix 1

    mul_col_s16 d4, d0                      @ matrix 0 * matrix 1 col 0
    mul_col_s16 d5, d1                      @ matrix 0 * matrix 1 col 1
    mul_col_s16 d6, d2                      @ matrix 0 * matrix 1 col 2
    mul_col_s16 d7, d3                      @ matrix 0 * matrix 1 col 3

    vst1.16     {d4-d7}, [r0]               @ store sixteen elements of result

    mov         pc, lr                      @ return to caller

【Neon】????码带为了Neon - Part3：Matrix Multiplication

Matrices（矩阵）

Algorithm（算法）

Code（代码）

Floating Point（浮点型）

An Aside: D and Q registers（插一段介绍：D和Q寄存器）

Back to the Code（看回代码）

Fixed Point（定点数）

Scheduling（调度）

Source（源码）

【Neon】????码带为了Neon - Part3：Matrix Multiplication

????码带为了Neon - Part2:Dealing With Leftovers