SIMD And AVX Explanation With C# Language
1 所需文件和软件版本信息
- Microsoft.Bcl.Simd
- System.Numerics.Vectors
- Visual Studio 2015 Professional
- CPU-Z Software
2 关键资源来源
Microsoft.Bcl.Simd,下载链接:Microsoft.Bcl.Simd
Microsoft SIMD-enabled Vector Types,下载链接:System.Numerics.Vectors
CPU-Z Software,下载链接:CPU-Z
粗略介绍:
Microsoft.Bcl.Simd 和 Microsoft SIMD-enabled Vector Types是在C#或.NET环境下直接调用Intrinsic Functions进行加速的关键中间程序集。
CPU-Z 可视作收集系统主要设备信息的一款免费软件,类似GPU-Z,收点击打开链接集内容包含:
- Processor name and number, codename, process, package, cache levels.
- Mainboard and chipset.
- Memory type, size, timings, and module specifications (SPD).
- Real time measurement of each core's internal frequency, memory frequency.
3 基本概念
3.1 背景(来源于SSE & AVX Vectorization)
In recent years, CPUs have reached some physical and power limitations, so CPU speeds have not increased noticeably, in terms of Ghz. As computation requirements continue increasing, CPU designers have decided to solve this problem with three solutions:
- Adding more cores. This way Operating Systems can distribute running applications among different cores. Also, programs can create multiple threads to maximize core usage.
- Adding vector operations to each core. This solution allows the CPU to execute the same instructions on a vector of data. This can only be done at the application level.
- Out-of-order execution of multiple instructions. Modern CPUs can execute up to four instructions at the same time if they are independent.
Vector registers started in 1997 with MMX instruction set, having 80-bit registers. After that SSE instruction sets were released (several versions of them, from SSE1 to SEE4.2), with 128-bit registers. In 2011, Intel released the Sandy Bridge architecture with the AVX instruction set (256-bit registers). In 2016 the first AVX-512 CPU was released, with 512-bit registers (up to 16x 32-bit float vectors).
In this Course we'll focus on both SSE and AVX instruction sets, because they are commonly found in recent processors. AVX-512 is out of scope, but most of the course can be reused, just by changing the 256-bit registers to the 512-bit counterparts (ZMM registers).
简言之,随着微处理器历史的发展SIMD 计算单元和寄存器的长度也在不断地进化,Intel 从最初MMX 的64-bit 寄存器到后来SSE 系列128-bit 寄存器,再到AVX 扩展为256-bit,最新的AVX-512 已经有了512-bit 的SIMD 寄存器。
3.2 SIMD
Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment. SIMD is particularly applicable to common tasks such as adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions to improve the performance of multimedia use.
From:SIMD Wiki
3.3 AVX or AVX2
Advanced Vector Extensions (AVX, also known as Sandy Bridge New Extensions) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.
AVX2 expands most integer commands to 256 bits and introduces fused multiply-accumulate (FMA) operations. AVX-512 expands AVX to 512-bit support utilizing a new EVEX prefix encoding proposed by Intel in July 2013 and first supported by Intel with the Knights Landing processor, which shipped in 2016.
From:AVX Wiki
3 指令支持测试
采用CPU-Z可以查看自己电脑是否有由硬件加速的向量并行计算集,即SSE、AVX等。
4 程序代码
关键:新建C# Project后,需重点采用Nuget管理控制台输入如下指令,以便在新建项目中添加采用C#调用SSE或AVX程序集以进行向量化操作,具体如下:
Install-Package System.Numerics.Vectors -Version 4.4.0
Install-Package Microsoft.Bcl.Simd -Version 1.1.5-beta
注意,此处“已存在于项目”表明我已按照上述两个指令执行了添加,所以再次添加会出现此提示!
代码图片形式
代码文本形式
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Numerics;
namespace SIMD
{
public class Stud
{
static void allBench(int loop, int arraySize, int rerun)
{
Stopwatch sw = new Stopwatch();
Random random = new Random();
Func<int, int[]> genRandomIntArray = n => Enumerable.Repeat(0, n).Select(i => random.Next(2 * n)).ToArray();
Func<int, int[]> genRandomIntArray2 = n =>
{
int[] ar = new int[n];
for (int k = 0; k < n; ++k) ar[k] = random.Next(2 * n);
return ar;
};
Action<int, int> bench1 = (K, N) =>
{
sw.Restart();
int[] x = genRandomIntArray2(N);
int[] y = genRandomIntArray2(N);
int[] r = new int[N];
for (int k = 0; k < K; ++k)
{
for (int i = 0; i < N; ++i) r[i] = x[i] + y[i];
}
//Console.WriteLine(String.Join(" ", x.Take(10)));
//Console.WriteLine(String.Join(" ", y.Take(10)));
//Console.WriteLine(String.Join(" ", r.Take(10)));
Console.WriteLine($"Elapsed {sw.ElapsedMilliseconds,6} ms");
};
Action<int, int> bench2 = (K, N) =>
{
sw.Restart();
int[] x = genRandomIntArray2(N);
int[] y = genRandomIntArray2(N);
int[] r = new int[N];
Vector<int> v0 = Vector<int>.Zero;
for (int k = 0; k < K; ++k)
{
for (int i = 0; i < N; i += Vector<int>.Count)
{
v0 = new Vector<int>(x, i) + new Vector<int>(y, i);
v0.CopyTo(r, i);
}
}
//Console.WriteLine(String.Join(" ", x.Take(10)));
//Console.WriteLine(String.Join(" ", y.Take(10)));
//Console.WriteLine(String.Join(" ", r.Take(10)));
Console.WriteLine($"Elapsed {sw.ElapsedMilliseconds,6} ms");
};
Action<int, int> bench3 = (K, N) =>
{
sw.Restart();
int[] xb = genRandomIntArray2(N);
short[] x = xb.Select(ix => (short)ix).ToArray();
int[] yb = genRandomIntArray2(N);
short[] y = yb.Select(ix => (short)ix).ToArray();
short[] r = new short[N];
Vector<short> v0 = Vector<short>.Zero;
for (int k = 0; k < K; ++k)
{
for (int i = 0; i < N; i += Vector<short>.Count)
{
v0 = new Vector<short>(x, i) + new Vector<short>(y, i);
v0.CopyTo(r, i);
}
}
//Console.WriteLine(String.Join(" ", x.Take(10)));
//Console.WriteLine(String.Join(" ", y.Take(10)));
//Console.WriteLine(String.Join(" ", r.Take(10)));
Console.WriteLine($"Elapsed {sw.ElapsedMilliseconds,6} ms");
};
int m0 = rerun;
int m1 = loop;
int m2 = arraySize * 16;
Console.WriteLine("#### Normal Computation");
for (int i = 0; i < m0; ++i) bench1(m1, m2);
Console.WriteLine();
Console.WriteLine("#### SIMD Computation");
for (int i = 0; i < m0; ++i) bench2(m1, m2);
Console.WriteLine();
Console.WriteLine("#### SIMD Computation 2");
for (int i = 0; i < m0; ++i) bench3(m1, m2);
Console.WriteLine();
int kk = 1;
int h = kk;
}
public static void run_SIMD_AVX_on_CSharp()
{
allBench(loop: 10000, arraySize: 500, rerun: 10); // array size will be multiplied by 16
}
}
}
5 测试结果
6 重要链接
http://www.qingpingshan.com/bc/aspnet/334584.html
https://instil.co/2016/03/21/parallelism-on-a-single-core-simd-with-c/
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Software
https://www.nuget.org/packages/System.Numerics.Vectors/
上一篇: Quicksort
下一篇: php连接数据库mysql 用户注册