Cuda 3 - Optimize GPU Programs

UDACITY教程 Intro to Parallel Programming

Basics on GPU, CUDA, Memory Model

Parallel Algorithms(Reduce, Scan, Histogram, Sort)

Optimize Parallel GPU Programs

Others(Library, OpenACC, Dynamic parallelism)

1. APOD

apod

2. HotsPots

Amdahl’s Law:

max speedup = 1 / (1-p) (p is portion can be parallelized)

y: the total speedup, x: the portion can be speedup

y = 1 / (1-p + p/x) ==> y = x / ((1-p)x + p)

if x is much bigger than p, y = 1 / (1-p)

3. Occupancy

gpu-occupancy

Optimize Example: Transpose

gpu-tile

4. WARP-Avoid Thread Divergence

Set of threads that execute the same instructions at a time

Nvidia Card is 32, so:

switch(threadIdx.x / 32) {case 0...31}  // no slowdown
switch(threadIdx.x % 32) {case 0...31}  // 32x slowdown

5. Streams

Sequence of operations execute in order(memory transfers, kernels) gpu-stream

cudaStream_t s1, s2;
// behaviour is full concurrency: default stream also maps to a single Stream
// https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
// cudaStreamCreateWithFlags(&s1, cudaStreamNonBlocking);

// 这里注意，cudaMemcpyAsync只是说这个func是异步返回，具体内部是否能完全把2个copy+计算并行化，还需要3个必要条件
// 1. host用pinned memory，用cudaMallocHost 2. 放到不同的stream  3. 这个gpu卡上有free DMA copy engine
// 否则，这个asycn的copy调用，行为上就退化到和cudaMemcpy一样，只是说最终host thread会卡在cudaDeviceSynchronize而已
// https://stackoverflow.com/questions/14093601/effect-of-using-page-able-memory-for-asynchronous-memory-copy、
cudaStreamCreate(&s1);
checkCudaErrors(cudaMemcpyAsync(d_fimTheta, out_t, numbytes, cudaMemcpyHostToDevice, s1));
cudaStreamCreate(&s2);
checkCudaErrors(cudaMemcpyAsync(d_fimMag, out_m, numbytes, cudaMemcpyHostToDevice, s2));

A<<<1, 1024, s1>>>(d_fimTheta);
B<<<1, 1024, s2>>>(d_fimMag);
cudaDeviceSynchronize();
cudaStreamDestroy(s1); cudaStreamDestroy(s2);

6. Summary

gpu-improve-1

gpu-improve-2

gpu-improve-3

7. Problem 5

Fast histogram: Fast histogram