arm架构64位优化

程序员文章站 2022-06-08 22:09:51

...

####序
　　本文介绍arm架构64位neon汇编优化，适合于任何基础，　前文《arm架构32位优化》已经讲述arm的基本语法。
　　温馨提醒：嵌入式设备（即arm架构的板子）在编译时，最好加上 -fsigned-char 因为嵌入式设备默认类型为unsigned char类型，非char 类型。此外在编译arm汇编优化代码时，编译选项需要加上-c 。

####1、arm架构64位寄存器介绍
#####1.1、arm寄存器
　　　本文中无特别说明，arm寄存器均指aarch64寄存器
　　　arm寄存器有31个64位通用寄存器（X0_{X30），他们的低32位称为W寄存器（W0}W30），Xn和Wn的对应关系如图：
arm架构64位优化
　　此图参考http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf B1.2.1 Register in AArch64 state
　　需注意的是，arm寄存器的调用规则遵循AAPCS调用规则，如图：
　　　
　　　X0~X7用来传递函数形参和返回结果，一般来说，单个64位的返回结果存储在X0中，单个128位的返回结果存储在X1:X0中；
　　　X8被用来保存子程序（在这指被调用者函数，后续没特别说明，均指此意）的返回地址；
　　　X19~X28是易损坏的寄存器，在子程序中使用时需要保存；
　　　X18（Platform Register，PR）是跟平台相关的寄存器，用于特殊用途，不要使用他；
　　　注意：SP需要16字节对齐，在对Xn寄存器压栈时特别小心。更多信息参考：https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf　General language issues
英文原文：摘自：https://wiki.cdot.senecacollege.ca/wiki/Aarch64_Register_and_Instruction_Quick_Start
　　　r0-r7 are used for arguments and return values; additional arguments are on the stack
　　　For syscalls, the syscall number is in r8
　　　r9-r15 are for temporary values (may get trampled)
　　　r16-r18 are used for intra-procedure-call and platform values (avoid)
　　　The called routine is expected to preserve r19-r28 *** These registers are generally safe to use in your program.
　　　r29 and r30 are used as the frame register and link register (avoid)
详细信息参考：http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf　5.1.1 General-purpose Registers　
#####1.2 neon寄存器
　　　neon寄存器有32个128位的寄存器（V0~V31），
######1.2.1 标量寄存器
　　每个寄存器可以根据数据类型映射成不同的标量寄存器，如：
　　　　一个128位的寄存器（Q0~Q31）；
　　　　一个64位的寄存器（D0~D31）；
　　　　一个32位的寄存器（S0~S31）；
　　　　一个16位的寄存器（H0~S31）;
　　　　一个8位的寄存器（B0~B31）。
　　注意： S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half of D1, which is the bottom half of Q1, and so on. 如图：
　　　 arm架构64位优化
　　　此图来自：http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第54页

1.2.2 矢量寄存器

64位宽或128位宽的矢量寄存器可以有一个或多个元素，如图：
　　　 arm架构64位优化
　　　然后使用索引去访问相应的元素，如V0.2D[0]。
　　　此图来自：http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第55页

######1.2.3 调用规则
　　　V0~V7 用于传递函数形参和返回结果；
　　　V8~V15在子程序中被使用时需要压栈保存；
　　　V0_V7和V16V31 调用者可能需要保存；
　　　
　　　参考网址：http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.2 SIMD and Floating-Point Registers

#####2、Neon指令集

2.1 ARMv8/AArch64指令格式

In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:

{<prefix>}<op>{<suffix>}  Vd.<T>, Vn.<T>, Vm.<T>

Where:

< prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.
< op> – operation, such as ADD, AND etc.
< suffix> - suffix
- P: “pairwise” operations, such as ADDP.
- V: the new reduction (across-all-lanes) operations, such as FMAXV.
- 2：new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.

ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.
SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.

< T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).
　For example:

UADDLP    V0.8H, V0.16B
FADD V0.4S, V0.4S, V0.4S

For more information, please refer to the documents listed in the Appendix.
参考网址：http://caxapa.ru/thumbs/845405/armv8-neon-programming.pdf
参考网址：https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

######2.2 关于指令中post-index\pre-index的介绍
arm架构64位优化

参考网址：https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf 第150页

#####3、arm 64位架构指令手册
######3.1 aarch64英文手册
　　下载地址：https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf

######3.2 arm32位指令和aarch64位指令对照表
　　下载地址：https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

######3.3 指令速查卡
　　下载地址：https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf

4、arm32优化到aarch64的转变

参考网址：https://blog.linuxplumbersconf.org/2014/ocw/system/presentations/2343/original/08 - Migrating code from ARM to ARM64.pdf
######4.1 函数返回
arm架构64位优化
######4.2 寄存器压栈
######4.2.1 对于普通寄存器压栈

　　因为SP指针需要16字节对齐，所以aarch64对寄存器压栈需要成对压栈。
######4.2.2 对于neon寄存器压栈

.macro push_v_regs
	stp    d8, d9, [sp, #-16]!
	stp    d10, d11, [sp, #-16]!
	stp    d12, d13, [sp, #-16]!
	stp    d14, d15, [sp, #-16]!
.endm
.macro pop_v_regs
	ldp    d14, d15, [sp], #16
	ldp    d12, d13, [sp], #16
	ldp    d10, d11, [sp], #16
	ldp    d8, d9, [sp], #16
.endm

至于要用的是v8_{v15寄存器，为什么成了压d8}d15，参考“1.2.3 调用规则”。
不幸的是，在GDB调试时，此种压栈方式会提示：

tbreak _Unwind_RaiseException aarch64-tdep.c:335: internal-error: CORE_ADDR aarch64_analyze_prologue(gdbarch*, CORE_ADDR, CORE_ADDR, aarch64_prologue_cache*): Assertion `inst.operands[0].type == AARCH64_OPND_Rt’ failed.

解决办法：

.macro push_v_regsd
   sub   sp, sp, #128
   st1   {v8.8h, v9.8h}, [sp], #32
   st1   {v10.8h, v11.8h}, [sp], #32
   st1   {v12.8h, v13.8h}, [sp], #32
   st1   {v14.8h, v15.8h}, [sp], 
.endm
.macro pop_v_regsd
  ld1   {v14.5h, v15.8h}, [sp]
  sub   sp, sp, #32
  ld1   {v12.5h, v13.8h}, [sp]
  sub   sp, sp, #32
  ld1   {v10.5h, v11.8h}, [sp]
  sub   sp, sp, #32
  ld1   {v8.5h, v9.8h}, [sp]
  add   sp, sp, #128
.endm

需要注意的是：此方法虽能解决在GDB调试过程中出现的问题，但是在GDB调试完后，还需使用压d寄存器的方法（即push_v_regs），否则出现时间信息统计不出的情况。为了便于这两种方式进行切换，可使用宏定义：

 #define push_v_regs push_v_regsd

**关于更多aarch64压栈信息可参见：** 　　压栈介绍网址1：https://*.com/questions/40271180/push-and-pop-a-full-128-bit-neon-register-to-from-the-stack-in-aarch64 　　压栈介绍网址2：https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch32-and-aarch64 　　压栈介绍网址3：https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop

#####5、知识扩展
参考网址：https://www.raspberrypi.org/forums/viewtopic.php?t=191774
参考网址：https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARM_CPU_Architecture.pdf
参考网址：http://my.presentations.techweb.com/events/esc/boston/2017/conference/download/5299
参考网址：https://www.nxp.com/docs/en/application-note/AN12212.pdf

上一篇：庞勋起义过程简介庞勋起义的影响是什么

下一篇：《PHP编程最快明白》第六讲：Mysql数据库操作

arm架构64位优化

1.2.2 矢量寄存器

2.1 ARMv8/AArch64指令格式

4、arm32优化到aarch64的转变

图灵学院Java架构师-VIP-【性能调优-Mysql索引数据结构详解与索引优化】

今年手机处理器大提升!ARM A77架构到底强在哪？

曝AMD正在研发ARM架构芯片欲与苹果M1芯片竞争

ARM架构是什么？为什么高通三星都依赖它

华为EMUI 9.1最新干货：架构级优化安卓性能革命

Marvell宣布7nm ThunderX3处理器：ARM架构、96核心384线程

英特尔首席架构师Raja：未来10年计算架构的优化和提升将比过去50年还多

macOS系统全力优化M1：通吃ARM/x86、iOS应用直接跑

苹果继续优化M1性能：新跑分碾压Windows ARM笔记本

关于在linux交叉编译出适用于树莓派ARM架构的可执行程序

arm架构64位优化

1.2.2 矢量寄存器

2.1 ARMv8/AArch64指令格式

4、arm32优化到aarch64的转变

图灵学院Java架构师-VIP-【性能调优-Mysql索引数据结构详解与索引优化】

今年手机处理器大提升!ARM A77架构到底强在哪？

曝AMD正在研发ARM架构芯片 欲与苹果M1芯片竞争

ARM架构是什么？为什么高通三星都依赖它

华为EMUI 9.1最新干货：架构级优化 安卓性能革命

Marvell宣布7nm ThunderX3处理器：ARM架构、96核心384线程

英特尔首席架构师Raja：未来10年计算架构的优化和提升将比过去50年还多

macOS系统全力优化M1：通吃ARM/x86、iOS应用直接跑

苹果继续优化M1性能：新跑分碾压Windows ARM笔记本

关于在linux交叉编译出适用于树莓派ARM架构的可执行程序

曝AMD正在研发ARM架构芯片欲与苹果M1芯片竞争

华为EMUI 9.1最新干货：架构级优化安卓性能革命