arm架构64位优化
####序
本文介绍arm架构64位neon汇编优化,适合于任何基础, 前文《arm架构32位优化》已经讲述arm的基本语法。
温馨提醒:嵌入式设备(即arm架构的板子)在编译时,最好加上 -fsigned-char 因为嵌入式设备默认类型为unsigned char类型,非char 类型。此外在编译arm汇编优化代码时,编译选项需要加上-c 。
####1、arm架构64位寄存器介绍
#####1.1、arm寄存器
本文中无特别说明,arm寄存器均指aarch64寄存器
arm寄存器有31个64位通用寄存器(X0X30),他们的低32位称为W寄存器(W0W30),Xn和Wn的对应关系如图:
此图参考http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf B1.2.1 Register in AArch64 state
需注意的是,arm寄存器的调用规则遵循AAPCS调用规则,如图:
X0~X7用来传递函数形参和返回结果,一般来说,单个64位的返回结果存储在X0中,单个128位的返回结果存储在X1:X0中;
X8被用来保存子程序(在这指被调用者函数,后续没特别说明,均指此意)的返回地址;
X19~X28是易损坏的寄存器,在子程序中使用时需要保存;
X18(Platform Register,PR)是跟平台相关的寄存器,用于特殊用途,不要使用他;
注意:SP需要16字节对齐,在对Xn寄存器压栈时特别小心。更多信息参考:https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf General language issues
英文原文:摘自:https://wiki.cdot.senecacollege.ca/wiki/Aarch64_Register_and_Instruction_Quick_Start
r0-r7 are used for arguments and return values; additional arguments are on the stack
For syscalls, the syscall number is in r8
r9-r15 are for temporary values (may get trampled)
r16-r18 are used for intra-procedure-call and platform values (avoid)
The called routine is expected to preserve r19-r28 *** These registers are generally safe to use in your program.
r29 and r30 are used as the frame register and link register (avoid)
详细信息参考:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.1 General-purpose Registers
#####1.2 neon寄存器
neon寄存器有32个128位的寄存器(V0~V31),
######1.2.1 标量寄存器
每个寄存器可以根据数据类型映射成不同的标量寄存器,如:
一个128位的寄存器(Q0~Q31);
一个64位的寄存器(D0~D31);
一个32位的寄存器(S0~S31);
一个16位的寄存器(H0~S31);
一个8位的寄存器(B0~B31)。
注意: S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half of D1, which is the bottom half of Q1, and so on. 如图:
此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第54页
1.2.2 矢量寄存器
64位宽或128位宽的矢量寄存器可以有一个或多个元素,如图:
然后使用索引去访问相应的元素,如V0.2D[0]。
此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第55页
######1.2.3 调用规则
V0~V7 用于传递函数形参和返回结果;
V8~V15在子程序中被使用时需要压栈保存;
V0V7和V16V31 调用者可能需要保存;
参考网址:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.2 SIMD and Floating-Point Registers
#####2、Neon指令集
2.1 ARMv8/AArch64指令格式
In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:
{<prefix>}<op>{<suffix>} Vd.<T>, Vn.<T>, Vm.<T>
Where:
- < prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.
- < op> – operation, such as ADD, AND etc.
-
< suffix> - suffix
- P: “pairwise” operations, such as ADDP.
- V: the new reduction (across-all-lanes) operations, such as FMAXV.
- 2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.
ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.
SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.
-
< T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).
For example:
UADDLP V0.8H, V0.16B
FADD V0.4S, V0.4S, V0.4S
For more information, please refer to the documents listed in the Appendix.
参考网址:http://caxapa.ru/thumbs/845405/armv8-neon-programming.pdf
参考网址:https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference
######2.2 关于指令中post-index\pre-index的介绍
参考网址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf 第150页
#####3、arm 64位架构指令手册
######3.1 aarch64英文手册
下载地址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf
######3.2 arm32位指令和aarch64位指令对照表
下载地址:https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf
######3.3 指令速查卡
下载地址:https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf
4、arm32优化到aarch64的转变
参考网址:https://blog.linuxplumbersconf.org/2014/ocw/system/presentations/2343/original/08 - Migrating code from ARM to ARM64.pdf
######4.1 函数返回
######4.2 寄存器压栈
######4.2.1 对于普通寄存器压栈
因为SP指针需要16字节对齐,所以aarch64对寄存器压栈需要成对压栈。
######4.2.2 对于neon寄存器压栈
.macro push_v_regs
stp d8, d9, [sp, #-16]!
stp d10, d11, [sp, #-16]!
stp d12, d13, [sp, #-16]!
stp d14, d15, [sp, #-16]!
.endm
.macro pop_v_regs
ldp d14, d15, [sp], #16
ldp d12, d13, [sp], #16
ldp d10, d11, [sp], #16
ldp d8, d9, [sp], #16
.endm
至于要用的是v8v15寄存器,为什么成了压d8d15,参考“1.2.3 调用规则”。
不幸的是,在GDB调试时,此种压栈方式会提示:
tbreak _Unwind_RaiseException aarch64-tdep.c:335: internal-error: CORE_ADDR aarch64_analyze_prologue(gdbarch*, CORE_ADDR, CORE_ADDR, aarch64_prologue_cache*): Assertion `inst.operands[0].type == AARCH64_OPND_Rt’ failed.
解决办法:
.macro push_v_regsd
sub sp, sp, #128
st1 {v8.8h, v9.8h}, [sp], #32
st1 {v10.8h, v11.8h}, [sp], #32
st1 {v12.8h, v13.8h}, [sp], #32
st1 {v14.8h, v15.8h}, [sp],
.endm
.macro pop_v_regsd
ld1 {v14.5h, v15.8h}, [sp]
sub sp, sp, #32
ld1 {v12.5h, v13.8h}, [sp]
sub sp, sp, #32
ld1 {v10.5h, v11.8h}, [sp]
sub sp, sp, #32
ld1 {v8.5h, v9.8h}, [sp]
add sp, sp, #128
.endm
需要注意的是:此方法虽能解决在GDB调试过程中出现的问题,但是在GDB调试完后,还需使用压d寄存器的方法(即push_v_regs),否则出现时间信息统计不出的情况。为了便于这两种方式进行切换,可使用宏定义:
#define push_v_regs push_v_regsd
**关于更多aarch64压栈信息可参见:**
压栈介绍网址1:https://*.com/questions/40271180/push-and-pop-a-full-128-bit-neon-register-to-from-the-stack-in-aarch64
压栈介绍网址2:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch32-and-aarch64
压栈介绍网址3:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop
#####5、知识扩展
参考网址:https://www.raspberrypi.org/forums/viewtopic.php?t=191774
参考网址:https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARM_CPU_Architecture.pdf
参考网址:http://my.presentations.techweb.com/events/esc/boston/2017/conference/download/5299
参考网址:https://www.nxp.com/docs/en/application-note/AN12212.pdf
上一篇: 庞勋起义过程简介 庞勋起义的影响是什么
推荐阅读
-
图灵学院Java架构师-VIP-【性能调优-Mysql索引数据结构详解与索引优化】
-
今年手机处理器大提升!ARM A77架构到底强在哪?
-
曝AMD正在研发ARM架构芯片 欲与苹果M1芯片竞争
-
ARM架构是什么?为什么高通三星都依赖它
-
华为EMUI 9.1最新干货:架构级优化 安卓性能革命
-
Marvell宣布7nm ThunderX3处理器:ARM架构、96核心384线程
-
英特尔首席架构师Raja:未来10年计算架构的优化和提升将比过去50年还多
-
macOS系统全力优化M1:通吃ARM/x86、iOS应用直接跑
-
苹果继续优化M1性能:新跑分碾压Windows ARM笔记本
-
关于在linux交叉编译出适用于树莓派ARM架构的可执行程序