C++AMP基础
C++AMP(AMP:Accelerated Massive Parallelism)是一个并行库和语言层面的小扩展,能够帮助在C++应用程序中实现异构计算。Visual Studio 2012及以上版本提供了新的工具和功能支持,可以用来调试和剖析C++AMP应用程序的性能,包括GPU调试和GPU并行可视化。对于适合数据并行计算的应用而言,可以实现显著的加速。
微软官网的C++AMP结束,里面有入门的小例子C++ AMP Overview。
https://msdn.microsoft.com/en-us/library/hh265137.aspx
AMP简单语法结束及如何用VS调式GPU代码的介绍PPT:
http://www.gregcons.com/KateBlog/content/binary/GregoryCppAMP.pdf
谷歌:c++ amp accelerated massive parallelismwith microsoft visual c++ pdf可以下载电子书
国内有中文翻译版的书:
https://download.csdn.net/download/qq_18521747/8906191
书名引用:KateGregory, AdeMiller. C++ AMP:用Visual C++加速大规模并行计算[M]. 人民邮电出版社, 2014.
就是把通常在CPU上的独立计算的循环放到GPU去加速,GPU为每一个计算单元都分配一个线程,比如c[1000]=a[1000]+b[1000],GPU分配1000个计算线程同时进行计算,AMP支持一维、二维、三维的矩阵运算,基本的数学函数库运算,甚至可以FFT。(AMPFFT库我暂时没调通,就没用上)
For example, you might want to add {1, 2,3, 4, 5} and {6, 7, 8, 9, 10} to obtain {7, 9, 11, 13, 15}.
不使用C++AMP,通常的写法就是用循环遍历数组得到每一个值:
#include <iostream>
void StandardMethod() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[5];
for (int idx = 0; idx < 5; idx++)
{
sumCPP[idx] = aCPP[idx] + bCPP[idx];
}
for (int idx = 0; idx < 5; idx++)
{
std::cout << sumCPP[idx] << "\n";
}
}
Using C++ AMP, you might write thefollowing code instead:
#include <amp.h>
#include <iostream>
using namespace concurrency;
const int size = 5;
void CppAmpMethod() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[size];
// Create C++ AMP objects.
array_view<const int, 1> a(size, aCPP);
array_view<const int, 1> b(size, bCPP);
array_view<int, 1> sum(size, sumCPP);
sum.discard_data();
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
sum.extent,
// Define the code to run on each thread on the accelerator.
[=](index<1> idx) restrict(amp)
{
sum[idx] = a[idx] + b[idx];
}
);
// Print the results. The expected output is "7, 9, 11, 13, 15".
for (int i = 0; i < size; i++) {
std::cout << sum[i] << "\n";
}
}
下面笔记摘录自图书:KateGregory, AdeMiller. C++ AMP:用Visual C++加速大规模并行计算[M]. 人民邮电出版社, 2014.
第三章C++AMP基础
array<T,N>
accelerator 与 accelerator_view
index<N>
extent<N>
array_view<T,N>
parallel_for_each
restrict(amp)
这些都类模板很标识符
在CPU和GPU直接赋值数据
数学库函数
3.1
array<T,N>
array模板位于concurrency命名空间,有两个参数T和N,T是Type:即该集合元素的类型;N是正整数,即维度,或秩,一般为1,2,3维。
array是GPU上的一组相同类型元素的信息,矩阵。array在加速器(GPU)的一个视图上acclerator_view。每个加速器(GPU)至少有一个这样的视图,每个加速器有自己默认的视图。
array<int,1> a(5);//声明了一个一维的int数组,该数组由5个元素组成
在构造数组的同时也会分配响应的存储空间。
array<float,2> b(4,2);
array<int,3> c(4,3,2);
上面声明的这三个数组array里面没有任何值;构造函数只创建了空数组,我们可以把元素写入数组;或者在创建数组的时候就把元素复制进去:
std::vector<int> v(5);
array<int,1> a(5,v.begin(),v.end());
array的内存布局是限定的,所以的元素都会按顺序存储在连续的内存块上(GPU上的显存)
从array取回数据(即从GPU显存到CPU内存的数据拷贝):
copy(a,v);//将显存的数组array的数据拷贝到CPU的内存中的向量v的内存空间中
数组会与某个加速器的某个视图发生绑定关系。如果系统中只有一种加速器。
如果系统装有多个加速器,就可以把代码指定到特定的加速器上运行,
可以用accelerator_view av ;指定在哪个加速器上构造数组:
array<float,1> m(n,v.begin(),av);
3.2 accelerator与acclerator_view
accelerator对象位于concurrency命名空间,不仅可以表示GPU,还可以表示虚拟加速器。
accelerator的内存可以装载一个或多个数组,可以在这些数组上执行运算,一优化数据并行计算操作。
函数accelerator::get_all()会返回运行时加速器向量。这样我们就可以根据目标计算机的不同配置,选择不同的代码执行路径。
例如,我们可以检测加速器属性,看它们到底是仿真器还是GPU。我们可以查询加速器的功能,是否支持双精度计算等。
默认构造器是运算时选择的最佳加速器。
加速器通常都是物理设备,这类设备可能有好几个逻辑视图。这些视图之间是隔离的。加速器是一种隔离资源和执行上下文环境的计算单元。我们可以让多个线程共享一个视图,也可以在同一个加速器上使用多个单独的视图,消除变量共享的问题。
没一类加速器都有一个默认视图。
accelerator device(accelerator::default_accelerator);
accelerator_view av=device.default_view;
array<float,1> C(n,av);
以上三行代码与下面一行代码等效:
array<float,1> C(n);
下面的程序用于输出本地计算的GPU加速器相关配置信息:
//===============================================================================
//
// Microsoft Press
// C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++
//
//===============================================================================
// Copyright (c) 2012-2013 Ade Miller & Kate Gregory. All rights reserved.
// This code released under the terms of the
// Microsoft Public License (Ms-PL), http://ampbook.codeplex.com/license.
//
// THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF
// ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO
// THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
// PARTICULAR PURPOSE.
//===============================================================================
#include <tchar.h>
#include <SDKDDKVer.h>
#include <iostream>
#include <iomanip>
#include <vector>
#include <amp.h>
using namespace concurrency;
// Note: This code is somewhat different from the code described in the book. It produces a more detailed
// output and accepts a /a switch that will show the REF and CPU accelerators. If you want the original
// output, as show on page 22 then the /o switch will produce that.
int _tmain(int argc, _TCHAR* argv[])
{
bool show_all = false;
bool old_format = false;
if (argc > 1)
{
if (std::wstring(argv[1]).compare(L"/a") == 0)
{
show_all = true;
}
if (std::wstring(argv[1]).compare(L"/o") == 0)
{
show_all = false;
old_format = true;
}
}
std::vector<accelerator> accls = accelerator::get_all();
if (!show_all)
{
accls.erase(std::remove_if(accls.begin(), accls.end(), [](accelerator& a)
{
return (a.device_path == accelerator::cpu_accelerator) || (a.device_path == accelerator::direct3d_ref);
}), accls.end());
}
if (accls.empty())
{
std::wcout << "No accelerators found that are compatible with C++ AMP" << std::endl << std::endl;
return 0;
}
std::cout << "Show " << (show_all ? "all " : "") << "AMP Devices (";
#if defined(_DEBUG)
std::cout << "DEBUG";
#else
std::cout << "RELEASE";
#endif
std::cout << " build)" << std::endl;
std::wcout << "Found " << accls.size()
<< " accelerator device(s) that are compatible with C++ AMP:" << std::endl;
int n = 0;
if (old_format)
{
std::for_each(accls.cbegin(), accls.cend(), [=, &n](const accelerator& a)
{
std::wcout << " " << ++n << ": " << a.description
<< ", has_display=" << (a.has_display ? "true" : "false")
<< ", is_emulated=" << (a.is_emulated ? "true" : "false")
<< std::endl;
});
std::wcout << std::endl;
return 1;
}
std::for_each(accls.cbegin(), accls.cend(), [=, &n](const accelerator& a)
{
std::wcout << " " << ++n << ": " << a.description << " "
<< std::endl << " device_path = " << a.device_path
<< std::endl << " dedicated_memory = " << std::setprecision(4) << float(a.dedicated_memory) / (1024.0f * 1024.0f) << " Mb"
<< std::endl << " has_display = " << (a.has_display ? "true" : "false")
<< std::endl << " is_debug = " << (a.is_debug ? "true" : "false")
<< std::endl << " is_emulated = " << (a.is_emulated ? "true" : "false")
<< std::endl << " supports_double_precision = " << (a.supports_double_precision ? "true" : "false")
<< std::endl << " supports_limited_double_precision = " << (a.supports_limited_double_precision ? "true" : "false")
<< std::endl;
});
std::wcout << std::endl;
system("pause");
return 1;
}
/*
Show AMP Devices (DEBUG build)
Found 3 accelerator device(s) that are compatible with C++ AMP:
1: Intel(R) HD Graphics 4600
device_path = PCI\VEN_8086&DEV_0416&SUBSYS_380117AA&REV_06\3&11583659&0&10
dedicated_memory = 0.1099 Mb
has_display = true
is_debug = true
is_emulated = false
supports_double_precision = true
supports_limited_double_precision = true
2: AMD Radeon HD 8570M
device_path = PCI\VEN_1002&DEV_6663&SUBSYS_380117AA&REV_00\4&57D6125&0&0008
dedicated_memory = 1.988 Mb
has_display = false
is_debug = true
is_emulated = false
supports_double_precision = true
supports_limited_double_precision = true
3: Microsoft Basic Render Driver
device_path = direct3d\warp
dedicated_memory = 0 Mb
has_display = false
is_debug = true
is_emulated = true
supports_double_precision = true
supports_limited_double_precision = true
*/
下一篇: 贵州凯里旅游景点 必去景区推荐