Cuda 3 - Optimize GPU Programs

UDACITY教程 Intro to Parallel Programming

  • Basics on GPU, CUDA, Memory Model
  • Parallel Algorithms(Reduce, Scan, Histogram, Sort)
  • Optimize Parallel GPU Programs
  • Others(Library, OpenACC, Dynamic parallelism)

1. APOD

apod

2. HotsPots

Amdahl’s Law:

max speedup = 1 / (1-p) (p is portion can be parallelized)

y: the total speedup, x: the portion can be speedup

y = 1 / (1-p + p/x) ==> y = x / ((1-p)x + p)

if x is much bigger than p, y = 1 / (1-p)

3. Occupancy

gpu-occupancy

Optimize Example: Transpose

gpu-tile

4. WARP-Avoid Thread Divergence

Set of threads that execute the same instructions at a time

Nvidia Card is 32, so:

switch(threadIdx.x / 32) {case 0...31}  // no slowdown
switch(threadIdx.x % 32) {case 0...31}  // 32x slowdown

5. Streams

Sequence of operations execute in order(memory transfers, kernels) gpu-stream

cudaStream_t s1, s2;
// behaviour is full concurrency: default stream also maps to a single Stream
// https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
// cudaStreamCreateWithFlags(&s1, cudaStreamNonBlocking);

// 这里注意,cudaMemcpyAsync只是说这个func是异步返回,具体内部是否能完全把2个copy+计算并行化,还需要3个必要条件
// 1. host用pinned memory,用cudaMallocHost 2. 放到不同的stream  3. 这个gpu卡上有free DMA copy engine
// 否则,这个asycn的copy调用,行为上就退化到和cudaMemcpy一样,只是说最终host thread会卡在cudaDeviceSynchronize而已
// https://stackoverflow.com/questions/14093601/effect-of-using-page-able-memory-for-asynchronous-memory-copy、
cudaStreamCreate(&s1);
checkCudaErrors(cudaMemcpyAsync(d_fimTheta, out_t, numbytes, cudaMemcpyHostToDevice, s1));
cudaStreamCreate(&s2);
checkCudaErrors(cudaMemcpyAsync(d_fimMag, out_m, numbytes, cudaMemcpyHostToDevice, s2));

A<<<1, 1024, s1>>>(d_fimTheta);
B<<<1, 1024, s2>>>(d_fimMag);
cudaDeviceSynchronize();
cudaStreamDestroy(s1); cudaStreamDestroy(s2);

6. Summary

gpu-improve-1

gpu-improve-2

gpu-improve-3

7. Problem 5

Fast histogram: Fast histogram

Loading Disqus comments...
Table of Contents