欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

C++AMP基础

程序员文章站 2022-07-04 12:02:23
...

C++AMP(AMP:Accelerated Massive Parallelism)是一个并行库和语言层面的小扩展,能够帮助在C++应用程序中实现异构计算。Visual Studio 2012及以上版本提供了新的工具和功能支持,可以用来调试和剖析C++AMP应用程序的性能,包括GPU调试和GPU并行可视化。对于适合数据并行计算的应用而言,可以实现显著的加速。

微软官网的C++AMP结束,里面有入门的小例子C++ AMP Overview

https://msdn.microsoft.com/en-us/library/hh265137.aspx

AMP简单语法结束及如何用VS调式GPU代码的介绍PPT:

http://www.gregcons.com/KateBlog/content/binary/GregoryCppAMP.pdf

谷歌:c++ amp accelerated massive parallelismwith microsoft visual c++ pdf可以下载电子书

国内有中文翻译版的书:

https://download.csdn.net/download/qq_18521747/8906191

书名引用:KateGregory, AdeMiller. C++ AMP:用Visual C++加速大规模并行计算[M]. 人民邮电出版社, 2014.

就是把通常在CPU上的独立计算的循环放到GPU去加速,GPU为每一个计算单元都分配一个线程,比如c[1000]=a[1000]+b[1000],GPU分配1000个计算线程同时进行计算,AMP支持一维、二维、三维的矩阵运算,基本的数学函数库运算,甚至可以FFT。(AMPFFT库我暂时没调通,就没用上)


For example, you might want to add {1, 2,3, 4, 5} and {6, 7, 8, 9, 10} to obtain {7, 9, 11, 13, 15}.

不使用C++AMP,通常的写法就是用循环遍历数组得到每一个值:




#include <iostream>  
  
void StandardMethod() {  
  
    int aCPP[] = {1, 2, 3, 4, 5};  
    int bCPP[] = {6, 7, 8, 9, 10};  
    int sumCPP[5];  
  
    for (int idx = 0; idx < 5; idx++)  
    {  
        sumCPP[idx] = aCPP[idx] + bCPP[idx];  
    }  
  
    for (int idx = 0; idx < 5; idx++)  
    {  
        std::cout << sumCPP[idx] << "\n";  
    }  
}  

 

Using C++ AMP, you might write thefollowing code instead:

#include <amp.h>  
#include <iostream>  
using namespace concurrency;  
  
const int size = 5;  
  
void CppAmpMethod() {  
    int aCPP[] = {1, 2, 3, 4, 5};  
    int bCPP[] = {6, 7, 8, 9, 10};  
    int sumCPP[size];  
  
    // Create C++ AMP objects.  
    array_view<const int, 1> a(size, aCPP);  
    array_view<const int, 1> b(size, bCPP);  
    array_view<int, 1> sum(size, sumCPP);  
    sum.discard_data();  
  
    parallel_for_each(   
        // Define the compute domain, which is the set of threads that are created.  
        sum.extent,   
        // Define the code to run on each thread on the accelerator.  
 [=](index<1> idx) restrict(amp)  
    {  
        sum[idx] = a[idx] + b[idx];  
    }  
    );  
  
    // Print the results. The expected output is "7, 9, 11, 13, 15".  
    for (int i = 0; i < size; i++) {  
        std::cout << sum[i] << "\n";  
    }  
}  

下面笔记摘录自图书:KateGregory, AdeMiller. C++ AMP:用Visual C++加速大规模并行计算[M]. 人民邮电出版社, 2014.

第三章C++AMP基础

  1. array<T,N>
  2. accelerator 与 accelerator_view
  3. index<N>
  4. extent<N>
  5. array_view<T,N>
  6. parallel_for_each
  7. restrict(amp)

这些都类模板很标识符

在CPU和GPU直接赋值数据

数学库函数

3.1

array<T,N>

array模板位于concurrency命名空间,有两个参数T和N,T是Type:即该集合元素的类型;N是正整数,即维度,或秩,一般为1,2,3维。

array是GPU上的一组相同类型元素的信息,矩阵。array在加速器(GPU)的一个视图上acclerator_view。每个加速器(GPU)至少有一个这样的视图,每个加速器有自己默认的视图。

array<int,1> a(5);//声明了一个一维的int数组,该数组由5个元素组成

在构造数组的同时也会分配响应的存储空间。

array<float,2> b(4,2);

array<int,3> c(4,3,2);

上面声明的这三个数组array里面没有任何值;构造函数只创建了空数组,我们可以把元素写入数组;或者在创建数组的时候就把元素复制进去:

std::vector<int> v(5);

array<int,1> a(5,v.begin(),v.end());

array的内存布局是限定的,所以的元素都会按顺序存储在连续的内存块上(GPU上的显存)

从array取回数据(即从GPU显存到CPU内存的数据拷贝):

copy(a,v);//将显存的数组array的数据拷贝到CPU的内存中的向量v的内存空间中

数组会与某个加速器的某个视图发生绑定关系。如果系统中只有一种加速器。

如果系统装有多个加速器,就可以把代码指定到特定的加速器上运行,

可以用accelerator_view av ;指定在哪个加速器上构造数组:

array<float,1> m(n,v.begin(),av);


3.2 accelerator与acclerator_view

accelerator对象位于concurrency命名空间,不仅可以表示GPU,还可以表示虚拟加速器。

accelerator的内存可以装载一个或多个数组,可以在这些数组上执行运算,一优化数据并行计算操作。


函数accelerator::get_all()会返回运行时加速器向量。这样我们就可以根据目标计算机的不同配置,选择不同的代码执行路径。

例如,我们可以检测加速器属性,看它们到底是仿真器还是GPU。我们可以查询加速器的功能,是否支持双精度计算等。


默认构造器是运算时选择的最佳加速器。

加速器通常都是物理设备,这类设备可能有好几个逻辑视图。这些视图之间是隔离的。加速器是一种隔离资源和执行上下文环境的计算单元。我们可以让多个线程共享一个视图,也可以在同一个加速器上使用多个单独的视图,消除变量共享的问题。

没一类加速器都有一个默认视图。


accelerator device(accelerator::default_accelerator);

accelerator_view av=device.default_view;

array<float,1> C(n,av);


以上三行代码与下面一行代码等效:

array<float,1> C(n);

下面的程序用于输出本地计算的GPU加速器相关配置信息:


//===============================================================================
//
// Microsoft Press
// C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++
//
//===============================================================================
// Copyright (c) 2012-2013 Ade Miller & Kate Gregory.  All rights reserved.
// This code released under the terms of the 
// Microsoft Public License (Ms-PL), http://ampbook.codeplex.com/license.
//
// THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF
// ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO
// THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
// PARTICULAR PURPOSE.
//===============================================================================

#include <tchar.h>
#include <SDKDDKVer.h>
#include <iostream>
#include <iomanip>
#include <vector>
#include <amp.h>

using namespace concurrency;

// Note: This code is somewhat different from the code described in the book. It produces a more detailed
// output and accepts a /a switch that will show the REF and CPU accelerators. If you want the original 
// output, as show on page 22 then the /o switch will produce that.

int _tmain(int argc, _TCHAR* argv[])
{
    bool show_all = false;
    bool old_format = false;
    if (argc > 1) 
    {
        if (std::wstring(argv[1]).compare(L"/a") == 0)
        {
            show_all = true;
        }
        if (std::wstring(argv[1]).compare(L"/o") == 0)
        {
            show_all = false;
            old_format = true;
        }
    }

    std::vector<accelerator> accls = accelerator::get_all();
    if (!show_all)
    {
        accls.erase(std::remove_if(accls.begin(), accls.end(), [](accelerator& a) 
        { 
            return (a.device_path == accelerator::cpu_accelerator) || (a.device_path == accelerator::direct3d_ref); 
        }), accls.end());
    }

    if (accls.empty())
    {
        std::wcout << "No accelerators found that are compatible with C++ AMP" << std::endl << std::endl;
        return 0;
    }
    std::cout << "Show " << (show_all ? "all " : "") << "AMP Devices (";
#if defined(_DEBUG)
    std::cout << "DEBUG";
#else
    std::cout << "RELEASE";
#endif
    std::cout <<  " build)" << std::endl;
    std::wcout << "Found " << accls.size() 
        << " accelerator device(s) that are compatible with C++ AMP:" << std::endl;
    int n = 0;
    if (old_format)
    {
        std::for_each(accls.cbegin(), accls.cend(), [=, &n](const accelerator& a)
        {
            std::wcout << "  " << ++n << ": " << a.description 
                << ", has_display=" << (a.has_display ? "true" : "false") 
                << ", is_emulated=" << (a.is_emulated ? "true" : "false")
                << std::endl;
        });
        std::wcout << std::endl;
        return 1;
    }

    std::for_each(accls.cbegin(), accls.cend(), [=, &n](const accelerator& a)
    {
        std::wcout << "  " << ++n << ": " << a.description << " "  
            << std::endl << "       device_path                       = " << a.device_path
            << std::endl << "       dedicated_memory                  = " << std::setprecision(4) << float(a.dedicated_memory) / (1024.0f * 1024.0f) << " Mb"
            << std::endl << "       has_display                       = " << (a.has_display ? "true" : "false") 
            << std::endl << "       is_debug                          = " << (a.is_debug ? "true" : "false") 
            << std::endl << "       is_emulated                       = " << (a.is_emulated ? "true" : "false") 
            << std::endl << "       supports_double_precision         = " << (a.supports_double_precision ? "true" : "false") 
            << std::endl << "       supports_limited_double_precision = " << (a.supports_limited_double_precision ? "true" : "false") 
            << std::endl;
    });
    std::wcout << std::endl;
	system("pause");
	return 1;
}

/*
Show AMP Devices (DEBUG build)
Found 3 accelerator device(s) that are compatible with C++ AMP:
1: Intel(R) HD Graphics 4600
device_path                       = PCI\VEN_8086&DEV_0416&SUBSYS_380117AA&REV_06\3&11583659&0&10
dedicated_memory                  = 0.1099 Mb
has_display                       = true
is_debug                          = true
is_emulated                       = false
supports_double_precision         = true
supports_limited_double_precision = true
2: AMD Radeon HD 8570M
device_path                       = PCI\VEN_1002&DEV_6663&SUBSYS_380117AA&REV_00\4&57D6125&0&0008
dedicated_memory                  = 1.988 Mb
has_display                       = false
is_debug                          = true
is_emulated                       = false
supports_double_precision         = true
supports_limited_double_precision = true
3: Microsoft Basic Render Driver
device_path                       = direct3d\warp
dedicated_memory                  = 0 Mb
has_display                       = false
is_debug                          = true
is_emulated                       = true
supports_double_precision         = true
supports_limited_double_precision = true
*/








相关标签: C AMP CPP AMP