利用numba进行加速

利用numba进行加速大规模的卡方检验检测交互作用

https://numba.pydata.org/numba-doc/latest/cuda

https://zhuanlan.zhihu.com/p/68846159

目前推测核心的四个环节，将内存数据传输到GPU，GPU进行矩阵统计，GPU进行矩阵运算，使用GPU进行并行运算

1.数据传输

To copy host->device a numpy array:

ary = np.arange(10)
d_ary = cuda.to_device(ary)

stream = cuda.stream()
d_ary = cuda.to_device(ary, stream=stream)﻿​

To copy device->host:

hary = d_ary.copy_to_host()

ary = np.empty(shape=d_ary.shape, dtype=d_ary.dtype)
d_ary.copy_to_host(ary)﻿​

2.矩阵统计和运算

3.简单实现:

这是使用CUDA内核的矩阵乘法的简单实现，numpy包中的简单运算都兼容：

@cuda.jit
def matmul(A, B, C):
    """Perform square matrix multiplication of C = A * B
    """
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[i, k] * B[k, j]
        C[i, j] = tmp﻿​

4.并行运算：

利用@njit修饰和prange函数

The example below demonstrates a parallel loop with a reduction (A is a one-dimensional Numpy array):

from numba import njit, prange

@njit(parallel=True)
def prange_test(A):
    s = 0
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(A.shape[0]):
        s += A[i]
    return s﻿​

Song Jie 's Blog

Navigation

Recent Posts

Friend Links