[TOC]

概述

文章参考:https://zhuanlan.zhihu.com/p/441686632

官网指令集汇总:https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#markdown-toc-addition

文章参考:https://developer.arm.com/architectures/instruction-sets/intrinsics

intrinsics指令分类

正常指令:生成大小相同且类型通常与操作数向量相同的结果向量;

长指令:对双字向量操作数执行运算,生成四字向量的结果。所生成的元素一般是操作数元素宽度的两倍,并属于同一类型;

宽指令:一个双字向量操作数和一个四字向量操作数执行运算,生成四字向量结果。所生成的元素和第一个操作数的元素是第二个操作数元素宽度的两倍;

窄指令:四字向量操作数执行运算,并生成双字向量结果,所生成的元素一般是操作数元素宽度的一半;

饱和指令:当超过数据类型指定的范围则自动限制在该范围内。

功能类别 介绍
Load/Store 对数据进行向量加载和存储,既可以对单个数据进行加载和存储,也可以对向量结构体数据进行加载和存储
Arithmetic 对整数和浮点数向量加减运算
Multiply 整型或浮点型的向量乘法运算,同时包含了乘法和加法混合运算,以及乘法和减法的运算的混合运算
Shift 向量位移操作,其中位移数据可以是立即数也可以是向量
Logical and compare 包含了逻辑运算(与或非运算等)和比较运算(等于、大于、小于等)
Floating-point 包含了浮点和其他类型数据之间的相互转化操作
Permutation 对向量进行重排操作
Misecllaneous 标量数据赋值到向量的操作
Data processing 一般性处理,极值操作、绝对值差、数值取反、平方根倒数等
Type conversion 数值类型转换,数据的组合及提取等

NEON intrinsics指令详述

本节将对每种类型的 NEON intrinsics 指令做出详细的描述。

Load/Store

以解交织的方式加载数据(TODO 什么是解交织??)

1
2
3
4
5
// 以解交织方式加载数据到n个向量寄存器, n为1~4
Result_t vld[n]<q>_type(Scalar_t *p_addr);

// 以解交织方式加载数据到n个向量寄存器的第N通道, n为1~4
Result_t vld[n]<q>_lane_type(Scalar_t *p_addr, Vector_t M, int N);

以交织的方式存储数据

1
2
3
4
5
// 将n个向量寄存器数据以交织方式存储到内存中, n为1~4
void vst[n]<q>_type(Scalar_t* N, Vector_t M);

// 将n个寄存器的N通道数据以交织方式存储到内存中, n为1~4
void vst[n]<q>_lane_type(Scalar_t *p_addr, Vector_t M, int N);

img

Arithmetic

整数和浮点数的加减运算。

1
2
3
// 基本的加减操作
Result_t vadd<q>_type(Vector1_t N, Vector2_t M);
Result_t vsub<q>_type(Vector1_t N, Vector2_t M);

具体指令参考:

Intrinsic(内联指令) AArch64 Instruction Supported architectures 说明 备注
int8x8_t vadd_s8(int8x8_t a,int8x8_t b) ADD Vd.8B,Vn.8B,Vm.8B v7/A32/A64 正常指令add。向量8位8通道int数据a+b做加法运算,
int8x16_t vaddq_s8(int8x16_t a,int8x16_t b) ADD Vd.16B,Vn.16B,Vm.16B v7/A32/A64 正常指令add。向量8位16通道int数据a+b做加法运算,
int16x4_t vadd_s16( int16x4_t a, int16x4_t b) ADD Vd.4H,Vn.4H,Vm.4H v7/A32/A64
int16x8_t vaddq_s16( int16x8_t a, int16x8_t b) ADD Vd.8H,Vn.8H,Vm.8H v7/A32/A64
int32x2_t vadd_s32( int32x2_t a, int32x2_t b) ADD Vd.2S,Vn.2S,Vm.2S v7/A32/A64
int32x4_t vaddq_s32( int32x4_t a, int32x4_t b) ADD Vd.4S,Vn.4S,Vm.4S v7/A32/A64
int64x1_t vadd_s64( int64x1_t a, int64x1_t b) ADD Dd,Dn,Dm v7/A32/A64
int64x2_t vaddq_s64( int64x2_t a, int64x2_t b) ADD Vd.2D,Vn.2D,Vm.2D v7/A32/A64
uint8x8_t vadd_u8( uint8x8_t a, uint8x8_t b) ADD Vd.8B,Vn.8B,Vm.8B v7/A32/A64
uint8x16_t vaddq_u8( uint8x16_t a, uint8x16_t b) ADD Vd.16B,Vn.16B,Vm.16B v7/A32/A64
uint16x4_t vadd_u16( uint16x4_t a, uint16x4_t b) ADD Vd.4H,Vn.4H,Vm.4H v7/A32/A64
uint16x8_t vaddq_u16( uint16x8_t a, uint16x8_t b) ADD Vd.8H,Vn.8H,Vm.8H v7/A32/A64
uint32x2_t vadd_u32( uint32x2_t a, uint32x2_t b) ADD Vd.2S,Vn.2S,Vm.2S v7/A32/A64
uint32x4_t vaddq_u32( uint32x4_t a, uint32x4_t b) ADD Vd.4S,Vn.4S,Vm.4S v7/A32/A64
uint64x1_t vadd_u64( uint64x1_t a, uint64x1_t b) ADD Dd,Dn,Dm v7/A32/A64
uint64x2_t vaddq_u64( uint64x2_t a, uint64x2_t b) ADD Vd.2D,Vn.2D,Vm.2D v7/A32/A64
float32x2_t vadd_f32( float32x2_t a, float32x2_t b) FADD Vd.2S,Vn.2S,Vm.2S v7/A32/A64
float32x4_t vaddq_f32( float32x4_t a, float32x4_t b) FADD Vd.4S,Vn.4S,Vm.4S v7/A32/A64
float64x1_t vadd_f64( float64x1_t a, float64x1_t b) FADD Dd,Dn,Dm A64
float64x2_t vaddq_f64( float64x2_t a, float64x2_t b) FADD Vd.2D,Vn.2D,Vm.2D A64
int64_t vaddd_s64( int64_t a, int64_t b) ADD Dd,Dn,Dm A64
uint64_t vaddd_u64( uint64_t a, uint64_t b) ADD Dd,Dn,Dm A64

Multiply

整型和浮点型的乘法运算, 参与计算的都是向量

1
2
// 基本乘法操作
Result_t vmul<q>_type(Vector1_t N, Vector2_t M);
Intrinsic(内联函数) AArch64 Instruction Supported architectures 说明 备注
int8x8_t vmul_s8(int8x8_t a,int8x8_t b) a -> Vn.8B b -> Vm.8B MUL Vd.8B,Vn.8B,Vm.8B Vd.8B -> result v7/A32/A64
int8x16_t vmulq_s8( int8x16_t a, int8x16_t b) a -> Vn.16B b -> Vm.16B MUL Vd.16B,Vn.16B,Vm.16B Vd.16B -> result v7/A32/A64
int16x4_t vmul_s16( int16x4_t a, int16x4_t b) a -> Vn.4H b -> Vm.4H MUL Vd.4H,Vn.4H,Vm.4H Vd.4H -> result v7/A32/A64
int16x8_t vmulq_s16( int16x8_t a, int16x8_t b) a -> Vn.8H b -> Vm.8H MUL Vd.8H,Vn.8H,Vm.8H Vd.8H -> result v7/A32/A64
int32x2_t vmul_s32( int32x2_t a, int32x2_t b) a -> Vn.2S b -> Vm.2S MUL Vd.2S,Vn.2S,Vm.2S Vd.2S -> result v7/A32/A64
int32x4_t vmulq_s32( int32x4_t a, int32x4_t b) a -> Vn.4S b -> Vm.4S MUL Vd.4S,Vn.4S,Vm.4S Vd.4S -> result v7/A32/A64
uint8x8_t vmul_u8( uint8x8_t a, uint8x8_t b) a -> Vn.8B b -> Vm.8B MUL Vd.8B,Vn.8B,Vm.8B Vd.8B -> result v7/A32/A64
uint8x16_t vmulq_u8( uint8x16_t a, uint8x16_t b) a -> Vn.16B b -> Vm.16B MUL Vd.16B,Vn.16B,Vm.16B Vd.16B -> result v7/A32/A64
uint16x4_t vmul_u16( uint16x4_t a, uint16x4_t b) a -> Vn.4H b -> Vm.4H MUL Vd.4H,Vn.4H,Vm.4H Vd.4H -> result v7/A32/A64
uint16x8_t vmulq_u16( uint16x8_t a, uint16x8_t b) a -> Vn.8H b -> Vm.8H MUL Vd.8H,Vn.8H,Vm.8H Vd.8H -> result v7/A32/A64
uint32x2_t vmul_u32( uint32x2_t a, uint32x2_t b) a -> Vn.2S b -> Vm.2S MUL Vd.2S,Vn.2S,Vm.2S Vd.2S -> result v7/A32/A64
uint32x4_t vmulq_u32( uint32x4_t a, uint32x4_t b) a -> Vn.4S b -> Vm.4S MUL Vd.4S,Vn.4S,Vm.4S Vd.4S -> result v7/A32/A64
float32x2_t vmul_f32( float32x2_t a, float32x2_t b) a -> Vn.2S b -> Vm.2S FMUL Vd.2S,Vn.2S,Vm.2S Vd.2S -> result v7/A32/A64
float32x4_t vmulq_f32( float32x4_t a, float32x4_t b) a -> Vn.4S b -> Vm.4S FMUL Vd.4S,Vn.4S,Vm.4S Vd.4S -> result v7/A32/A64
float64x1_t vmul_f64( float64x1_t a, float64x1_t b) a -> Dn b -> Dm FMUL Dd,Dn,Dm Dd -> result A64
float64x2_t vmulq_f64( float64x2_t a, float64x2_t b) a -> Vn.2D b -> Vm.2D FMUL Vd.2D,Vn.2D,Vm.2D Vd.2D -> result A64
uint64x2_t vmull_high_u32( uint32x4_t a, uint32x4_t b) a -> Vn.4S b -> Vm.4S UMULL2 Vd.2D,Vn.4S,Vm.4S Vd.2D -> result A64