Cuda 3 - Optimize GPU Programs
UDACITY教程 Intro to Parallel Programming
- Basics on GPU, CUDA, Memory Model
- Parallel Algorithms(Reduce, Scan, Histogram, Sort)
- Optimize Parallel GPU Programs
- Others(Library, OpenACC, Dynamic parallelism)
1. APOD
2. HotsPots
Amdahl’s Law:
max speedup = 1 / (1-p) (p is portion can be parallelized)
y: the total speedup, x: the portion can be speedup
y = 1 / (1-p + p/x) ==> y = x / ((1-p)x + p)
if x is much bigger than p, y = 1 / (1-p)
3. Occupancy
Optimize Example: Transpose
4. WARP-Avoid Thread Divergence
Set of threads that execute the same instructions at a time
Nvidia Card is 32, so:
5. Streams
Sequence of operations execute in order(memory transfers, kernels)
6. Summary
7. Problem 5
Fast histogram:
Fast histogram