利用numba进行加速大规模的卡方检验检测交互作用
https://numba.pydata.org/numba-doc/latest/cuda
https://zhuanlan.zhihu.com/p/68846159
目前推测核心的四个环节,将内存数据传输到GPU,GPU进行矩阵统计,GPU进行矩阵运算,使用GPU进行并行运算
1.数据传输
To copy host->device a numpy array:
ary = np.arange(10)
d_ary = cuda.to_device(ary)
stream = cuda.stream()
d_ary = cuda.to_device(ary, stream=stream)
To copy device->host:
hary = d_ary.copy_to_host()
ary = np.empty(shape=d_ary.shape, dtype=d_ary.dtype)
d_ary.copy_to_host(ary)
2.矩阵统计和运算
3.简单实现:
这是使用CUDA内核的矩阵乘法的简单实现,numpy包中的 简单运算都兼容:
@cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
4.并行运算:
利用@njit修饰和prange函数
The example below demonstrates a parallel loop with a reduction (A
is a one-dimensional Numpy array):
from numba import njit, prange
@njit(parallel=True)
def prange_test(A):
s = 0
# Without "parallel=True" in the jit-decorator
# the prange statement is equivalent to range
for i in prange(A.shape[0]):
s += A[i]
return s
No Leanote account? Sign up now.