运筹系列61：Julia中使用GPU

程序员文章站 2022-03-06 18:24:46

1. 快速上手源码地址：https://github.com/exanauts/Simplex.jl，使用julia实现。git clone https://github.com/exanauts/Simplex.jl下载源码，然后cd netlib,运行gcc emps.c -o emps运行./get.sh下载例子，如果下载不下来，按照下图修改代码：测试报告使用netlib.jl进行打印：......

https://nextjournal.com/sdanisch/julia-gpu-programming

1. 基础

1.1 安装

参见https://juliagpu.gitlab.io/CUDA.jl/installation/overview/#NVIDIA-driver
还有一种使用docker的方式：docker run --rm -it --gpus=all nvcr.io/hpc/julia:v1.2.0，注意docker要安装GPU版本，julia版本号要更换成最新的。
CUDA库非常方便，里面包含了很多以前要分开调用的库：
运筹系列61：Julia中使用GPU

1.2 性能对比

下面是一个简单的性能对比：

julia> for Typ in (CuArray, Array)
           x = Typ(ones(Float32, 5000000))
           y = Typ(zeros(Float32, 5000000))
           t = @elapsed begin
               for i in 0:100
                   for j in 0:100
                       @sync y .= x .* 3.2
                   end
               end
           end
           if y isa CuArray
               println("GPU time: ", t)
           else
               println("CPU time: ", t)
           end
       end

GPU time: 0.865418836
CPU time: 27.222574322

这段代码是在P100上跑的。在GeForce GTX 960M上要11秒。
再看一个QR分解的例子：

julia> x=randn(10000,10000);qr(x);@time qr(x);
 14.761824 seconds (7 allocations: 768.433 MiB, 0.67% gc time)

julia> x=CUDA.randn(10000,10000);CUDA.qr(x);CUDA.@time CUDA.qr(x);
  0.417191 seconds (79 CPU allocations: 2.016 KiB) (4 GPU allocations: 391.399 MiB, 0.28% gc time of which 98.06% spent allocating)

julia> x=randn(1000,1000);qr(x);@time qr(x);
  0.037463 seconds (7 allocations: 8.179 MiB)
  
julia> x=CUDA.randn(10000,10000);CUDA.qr(x);CUDA.@time CUDA.qr(x);
  0.010269 seconds (78 CPU allocations: 1.969 KiB) (4 GPU allocations: 4.920 MiB, 3.40% gc time of which 93.56% spent allocating)

x=randn(10000,10000);@time x*selectdim(x,1,1);
  0.025847 seconds (5 allocations: 78.297 KiB)

CUDA.@time y*selectdim(y,1,1);
  0.001221 seconds (66 CPU allocations: 2.797 KiB) (1 GPU allocation: 39.062 KiB, 0.47% gc time)

只有当矩阵足够大时，差异才比较明显。

需要注意，GPU是一个独立的硬件，具有自己的内存空间和不同的架构。因此，从RAM到GPU存储器（VRAM）的传输时间很长。即使在GPU上启动内核（换句话说，调度函数调用）也会带来较大的延迟。 GPU的时间约为10us，而CPU的时间则为几纳秒。
此外，GPU默认较低的精度，而较高的精度计算可以轻松地消除所有性能增益。最后，GPU函数(内核)本质上是并行的，所以编写GPU内核至少和编写并行CPU代码一样困难，因此许多算法都不能很好地移植到GPU上。

1.3 基本信息查看

查看信息

julia> CUDA.name(CuDevice(0))
"Tesla P100-PCIE-16GB"

查看是否可用：

__init__() = @assert CUDA.functional(true)

或者

if CUDA.functional()
    to_gpu_or_not_to_gpu(x::AbstractArray) = CuArray(x)
else
    to_gpu_or_not_to_gpu(x::AbstractArray) = x
end

使用CUDA.memory_status()查看GPU显存状态，通过CUDA.reclaim()释放所有显存，使用CUDA.unsafe_free!(a)来明确释放变量显存。

2 CuArray

用于数组和矩阵编程，使用Cu(cpu)可以快速将数据从cpu转移到gpu上。

2.1 声明与定义

julia> CuArray{Int}(undef, 2)
2-element CuArray{Int64,1}:
 0
 0

julia> CuArray{Int}(undef, (1,2))
1×2 CuArray{Int64,2}:
 0  0

julia> similar(ans)
1×2 CuArray{Int64,2}:
 0  0
julia> a = CuArray([1,2])
2-element CuArray{Int64,1}:
 1
 2

julia> b = Array(a)
2-element Array{Int64,1}:
 1
 2

julia> copyto!(b, a)
2-element Array{Int64,1}:
 1
 2

julia> CUDA.rand(2)
2-element CuArray{Float32,1}:
 0.74021935
 0.9209938

julia> CUDA.randn(Float64, 2, 1)
2×1 CuArray{Float64,2}:
 -0.3893830994647195
  1.618410515635752

2.2 操作符

可以使用原子操作符，以及map/reduce操作符，从1.5版本开始加上了accumulate。注意没有for循环操作。

julia> a = CuArray{Float32}(undef, (1,2));

julia> a .= 5
1×2 CuArray{Float32,2}:
 5.0  5.0

julia> map(sin, a)
1×2 CuArray{Float32,2}:
 -0.958924  -0.958924
 
julia> a = CUDA.ones(2,3)
2×3 CuArray{Float32,2}:
 1.0  1.0  1.0
 1.0  1.0  1.0

julia> reduce(+, a)
6.0f0

# 先用sin做map，再用*做reduce
julia> mapreduce(sin, *, a; dims=2)
2×1 CuArray{Float32,2}:
 0.59582335
 0.59582335

julia> b = CUDA.zeros(1)
1-element CuArray{Float32,1}:
 0.0

julia> Base.mapreducedim!(identity, +, b, a)
1×1 CuArray{Float32,2}:
 6.0

# accumulate会保留中间结果
julia> accumulate(+, a; dims=2)
2×3 CuArray{Float32,2}:
 1.0  2.0  3.0
 1.0  2.0  3.0

使用逻辑符可以进行元素筛选：

julia> a = CuArray([1,2,3])
3-element CuArray{Int64,1}:
 11
 12
 13

julia> a[[false,true,false]]
1-element CuArray{Int64,1}:
 12
julia> findall(isodd, a)
2-element CuArray{Int64,1}:
 1
 3

julia> findfirst(isodd, a)
1

julia> b = CuArray([11 12 13; 21 22 23])
2×3 CuArray{Int64,2}:
 11  12  13
 21  22  23

julia> findmin(b)
(11, CartesianIndex(1, 1))

julia> findmax(b; dims=2)
([13; 23], CartesianIndex{2}[CartesianIndex(1, 3); CartesianIndex(2, 3)])

2.3 数组函数

julia> a = CuArray(collect(1:6))
6-element CuArray{Int64,1}:
 1
 2
 3
 4
 5
 6

julia> b = reshape(a, (2,3))
2×3 CuArray{Int64,2}:
 1  3  5
 2  4  6

julia> c = view(a, 2:5)
4-element CuArray{Int64,1}:
 2
 3
 4
 5
julia> x = CUDA.rand(2)
2-element CuArray{Float32,1}:
 0.74021935
 0.9209938

julia> y = CUDA.rand(2)
2-element CuArray{Float32,1}:
 0.03902049
 0.9689629

julia> CUBLAS.dot(2, x, 0, y, 0)
0.057767443f0

2.4 线性规划

julia> using LinearAlgebra

julia> dot(Array(x), Array(y))
0.92129254f0

julia> a \ b
2×2 CuArray{Float32,2}:
  1.29018    0.942772
 -0.765663  -0.782648

julia> Array(a) \ Array(b)
2×2 Array{Float32,2}:
  1.29018    0.942773
 -0.765663  -0.782648

GPU还没有CPU速度快……

2.5 稀疏矩阵

将cpu的sparse array转移到cuda上即可。二维的矩阵可以用CuSparseMatrixCSC和CuSparseMatrixCSR。

julia> using SparseArrays

julia> x = sprand(10,0.2)
10-element SparseVector{Float64,Int64} with 4 stored entries:
  [3 ]  =  0.585812
  [4 ]  =  0.539289
  [7 ]  =  0.260036
  [8 ]  =  0.910047

julia> using CUDA.CUSPARSE

julia> d_x = CuSparseVector(x)
10-element CuSparseVector{Float64} with 4 stored entries:
  [3 ]  =  0.585812
  [4 ]  =  0.539289
  [7 ]  =  0.260036
  [8 ]  =  0.910047
  
julia> nonzeros(d_x)
4-element CuArray{Float64,1}:
 0.5858115517433242
 0.5392892841426182
 0.26003585026904785
 0.910046541351011

julia> nnz(d_x)
4

2.6 批操作

注意这里使用了enumerate+CuIterator函数

julia> batches = [([1], [2]), ([3], [4])]

julia> for (batch, (a,b)) in enumerate(CuIterator(batches))
         println("Batch $batch: ", a .+ b)
       end
Batch 1: [3]
Batch 2: [7]

3 使用@cuda宏

一般的做法是：
1）定义一个kernel function，传入数组
2）使用threadIdx().x寻找到下标
3）直接修改线程中的值
4）使用@cuda threads = num function()进行调用
先看一个简单的例子：

julia> a = CuArray([1., 2., 3.])
julia> function apply(op, a)
        i = threadIdx().x
        a[i] = op(a[i])
        return
      end
julia> @cuda threads=length(a) apply(x->x^2, a)
julia> a
3-element CuArray{Float32,1}:
1.0
4.0
9.0

a = CUDA.zeros(1024)

function kernel(a)
    i = threadIdx().x
    a[i] += 1
    return
end

@cuda threads=length(a) kernel(a)

2.4 注意事项

1）不要使用标量循环

julia> function diff_y(a, b)
   s = size(a)
   for j = 1:s[2]
       for i = 1:s[1]
           @inbounds a[i,j] = b[i,j+1] - b[i,j]
       end
   end
end
N = 64
nx = N^2
ny = N
a = ones(Float32,nx,ny-1)
b = ones(Float32,nx,ny)
julia> using BenchmarkTools

julia> @btime diff_y(a,b);
 39.599 μs (0 allocations: 0 bytes)

julia> @btime diff_y((CuArray(a)),(CuArray(b)));
 4.499 s (3354624 allocations: 165.38 MiB)
 
julia> function diff_y(a, b)
   s = size(a)
   for j = 1:s[2]
       @inbounds a[:,j] .= b[:,j+1] - b[:,j]
   end
end

julia> @btime diff_y((CuArray(a)),(CuArray(b)));
 2.503 ms (16884 allocations: 661.50 KiB)

2）不要使用多核

function diff_y(a, b)
   a .= @views b[:, 2:end] .- b[:, 1:end-1]
end

@btime diff_y((CuArray(a)),(CuArray(b)));
 39.057 μs (40 allocations: 2.08 KiB)

将数据保留在GPU上

3. GPU的应用

3.1 深度学习

首先测试一个简单的高斯模糊

运筹系列61：Julia中使用GPU

本文地址：https://blog.csdn.net/kittyzc/article/details/109069777