欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

arm架构64位优化

程序员文章站 2022-06-08 22:09:51
...

####序
  本文介绍arm架构64位neon汇编优化,适合于任何基础, 前文《arm架构32位优化》已经讲述arm的基本语法。
  温馨提醒:嵌入式设备(即arm架构的板子)在编译时,最好加上 -fsigned-char 因为嵌入式设备默认类型为unsigned char类型,非char 类型。此外在编译arm汇编优化代码时,编译选项需要加上-c 。

####1、arm架构64位寄存器介绍
#####1.1、arm寄存器
   本文中无特别说明,arm寄存器均指aarch64寄存器
   arm寄存器有31个64位通用寄存器(X0X30),他们的低32位称为W寄存器(W0W30),Xn和Wn的对应关系如图:
arm架构64位优化
  此图参考http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf B1.2.1 Register in AArch64 state
  需注意的是,arm寄存器的调用规则遵循AAPCS调用规则,如图:
   arm架构64位优化
   X0~X7用来传递函数形参和返回结果,一般来说,单个64位的返回结果存储在X0中,单个128位的返回结果存储在X1:X0中;
   X8被用来保存子程序(在这指被调用者函数,后续没特别说明,均指此意)的返回地址;
   X19~X28是易损坏的寄存器,在子程序中使用时需要保存;
   X18(Platform Register,PR)是跟平台相关的寄存器,用于特殊用途,不要使用他;
   注意:SP需要16字节对齐,在对Xn寄存器压栈时特别小心。更多信息参考:https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf General language issues
英文原文:摘自:https://wiki.cdot.senecacollege.ca/wiki/Aarch64_Register_and_Instruction_Quick_Start
   r0-r7 are used for arguments and return values; additional arguments are on the stack
   For syscalls, the syscall number is in r8
   r9-r15 are for temporary values (may get trampled)
   r16-r18 are used for intra-procedure-call and platform values (avoid)
   The called routine is expected to preserve r19-r28 *** These registers are generally safe to use in your program.
   r29 and r30 are used as the frame register and link register (avoid)
详细信息参考:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.1 General-purpose Registers 
#####1.2 neon寄存器
   neon寄存器有32个128位的寄存器(V0~V31),
######1.2.1 标量寄存器
   每个寄存器可以根据数据类型映射成不同的标量寄存器,如:
    一个128位的寄存器(Q0~Q31);
    一个64位的寄存器(D0~D31);
    一个32位的寄存器(S0~S31);
    一个16位的寄存器(H0~S31);
    一个8位的寄存器(B0~B31)。
  注意: S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half of D1, which is the bottom half of Q1, and so on. 如图:
   arm架构64位优化
   此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第54页

1.2.2 矢量寄存器

64位宽或128位宽的矢量寄存器可以有一个或多个元素,如图:
   arm架构64位优化
   然后使用索引去访问相应的元素,如V0.2D[0]。
   此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第55页

######1.2.3 调用规则
   V0~V7 用于传递函数形参和返回结果;
   V8~V15在子程序中被使用时需要压栈保存;
   V0V7和V16V31 调用者可能需要保存;
   
   参考网址:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.2 SIMD and Floating-Point Registers

#####2、Neon指令集

2.1 ARMv8/AArch64指令格式

In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:

{<prefix>}<op>{<suffix>}  Vd.<T>, Vn.<T>, Vm.<T>

Where:

  • < prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.
  • < op> – operation, such as ADD, AND etc.
  • < suffix> - suffix
    • P: “pairwise” operations, such as ADDP.
    • V: the new reduction (across-all-lanes) operations, such as FMAXV.
    • 2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.

ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.
SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.

  • < T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).
     For example:
UADDLP    V0.8H, V0.16B
FADD V0.4S, V0.4S, V0.4S

For more information, please refer to the documents listed in the Appendix.
参考网址:http://caxapa.ru/thumbs/845405/armv8-neon-programming.pdf
参考网址:https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

######2.2 关于指令中post-index\pre-index的介绍
arm架构64位优化
arm架构64位优化
参考网址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf 第150页

#####3、arm 64位架构指令手册
######3.1 aarch64英文手册
  下载地址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf

######3.2 arm32位指令和aarch64位指令对照表
  下载地址:https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

######3.3 指令速查卡
  下载地址:https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf

4、arm32优化到aarch64的转变

参考网址:https://blog.linuxplumbersconf.org/2014/ocw/system/presentations/2343/original/08 - Migrating code from ARM to ARM64.pdf
######4.1 函数返回
arm架构64位优化
######4.2 寄存器压栈
######4.2.1 对于普通寄存器压栈
arm架构64位优化
  因为SP指针需要16字节对齐,所以aarch64对寄存器压栈需要成对压栈。
######4.2.2 对于neon寄存器压栈

.macro push_v_regs
	stp    d8, d9, [sp, #-16]!
	stp    d10, d11, [sp, #-16]!
	stp    d12, d13, [sp, #-16]!
	stp    d14, d15, [sp, #-16]!
.endm
.macro pop_v_regs
	ldp    d14, d15, [sp], #16
	ldp    d12, d13, [sp], #16
	ldp    d10, d11, [sp], #16
	ldp    d8, d9, [sp], #16
.endm

至于要用的是v8v15寄存器,为什么成了压d8d15,参考“1.2.3 调用规则”。
不幸的是,在GDB调试时,此种压栈方式会提示:

tbreak _Unwind_RaiseException aarch64-tdep.c:335: internal-error: CORE_ADDR aarch64_analyze_prologue(gdbarch*, CORE_ADDR, CORE_ADDR, aarch64_prologue_cache*): Assertion `inst.operands[0].type == AARCH64_OPND_Rt’ failed.

解决办法:

.macro push_v_regsd
   sub   sp, sp, #128
   st1   {v8.8h, v9.8h}, [sp], #32
   st1   {v10.8h, v11.8h}, [sp], #32
   st1   {v12.8h, v13.8h}, [sp], #32
   st1   {v14.8h, v15.8h}, [sp], 
.endm
.macro pop_v_regsd
  ld1   {v14.5h, v15.8h}, [sp]
  sub   sp, sp, #32
  ld1   {v12.5h, v13.8h}, [sp]
  sub   sp, sp, #32
  ld1   {v10.5h, v11.8h}, [sp]
  sub   sp, sp, #32
  ld1   {v8.5h, v9.8h}, [sp]
  add   sp, sp, #128
.endm

需要注意的是:此方法虽能解决在GDB调试过程中出现的问题,但是在GDB调试完后,还需使用压d寄存器的方法(即push_v_regs),否则出现时间信息统计不出的情况。为了便于这两种方式进行切换,可使用宏定义:

 #define push_v_regs push_v_regsd
**关于更多aarch64压栈信息可参见:**   压栈介绍网址1:https://*.com/questions/40271180/push-and-pop-a-full-128-bit-neon-register-to-from-the-stack-in-aarch64   压栈介绍网址2:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch32-and-aarch64   压栈介绍网址3:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop

#####5、知识扩展
参考网址:https://www.raspberrypi.org/forums/viewtopic.php?t=191774
参考网址:https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARM_CPU_Architecture.pdf
参考网址:http://my.presentations.techweb.com/events/esc/boston/2017/conference/download/5299
参考网址:https://www.nxp.com/docs/en/application-note/AN12212.pdf