Cuda 1 - Basics on GPU, CUDA, Memory Model

UDACITY教程 Intro to Parallel Programming

Basics on GPU, CUDA, Memory Model

Parallel Algorithms(Reduce, Scan, Histogram, Sort)

Optimize Parallel GPU Programs

Others(Library, OpenACC, Dynamic parallelism)

1. GPU Architecture

gpu-sm-block

Hardware -> SM: Streaming Multiprocessor, 高度线程化的多核流处理器

Software -> Block: Could run group of threads cooperate to work

One SM –> Multi-Block; Threads in different blocks should not cooperate(even in a same SM)

GPU Device Query

2. 3-Ways to Accelerate Applications

Libraries, OpenACC Directives, Programming Languages

3. Cuda Kernel

host(cpu): h_var;  device(gpu): d_var  // 约定写法，h/d开头，区分memory
square<<<1, 64>>> (d_out, d_in);  // 尖括号内blocks, threads(512 or 1024 at most)
<<<dim3(bx,by,bz), dim3(tx,ty,tz),shared_mem>>>  // 拓展写法

hello world example

4. Problem 1

Convert color image to gray: solution

5. Parallel Communication Patterns

gpu-communication-pattern

stencil patterns: data reuse, 从特定位置邻居获取data

transpose: reorder data elements in array: array of structures(AOS), structure of arrays(SOA)

out[i + j*128] = in[j + i*128] ==> transpose operation

6. Memory Model

Local > Shared >> Global Memory

gpu-memory-model

7. Synchronize & Mutex

同步：__syncthreads()

互斥：atomicAdd() example

8. Problem 2

Image blur: solution

__global__ void blur_kernel(const cv::cuda::PtrStepSz<uchar3> src,
                            cv::cuda::PtrStepSz<uchar3> dst,
                            const float* const filter, const int filterWidth) {
  int x = blockIdx.x * blockDim.x + threadIdx.x;
  int y = blockIdx.y * blockDim.y + threadIdx.y;
  if (x >= src.cols || y >= src.rows) return;

  vector<float> result(3, 0);
  for (int i = 0; i < 3; i++)
    for (int filter_r = -filterWidth / 2; filter_r <= filterWidth / 2;
         ++filter_r) {
      for (int filter_c = -filterWidth / 2; filter_c <= filterWidth / 2;
           ++filter_c) {
        int image_y = filter_r + y, image_x = filter_c + x;
        if (image_x < 0 || image_x >= src.cols || image_y < 0 ||
            image_y >= src.rows) {
          continue;
        }
        uchar3 v = src(image_y, image_x);
        float filter_value = filter[(filter_r + filterWidth / 2) * filterWidth +
                                    filter_c + filterWidth / 2];
        result[i] += v[i] * filter_value;
      }
    }
  dst(y, x) = uchar3(result);
}

void balance_white_gpu(cv::cuda::GpuMat& src) {
  const int m = 32;
  int numRows = src.rows, numCols = src.cols;
  if (numRows == 0 || numCols == 0) return;
  const dim3 gridSize(ceil((float)numCols / m), ceil((float)numRows / m), 1);
  const dim3 blockSize(m, m, 1);
  cv::cuda::GpuMat dst;
  blur_kernel << <gridSize, blockSize>>> (src, dst, filter, filterWidth);
}