Cuda 2 - Parallel Algorithms(Reduce, Scan, Histogram, Sort)

UDACITY教程 Intro to Parallel Programming

  • Basics on GPU, CUDA, Memory Model
  • Parallel Algorithms(Reduce, Scan, Histogram, Sort)
  • Optimize Parallel GPU Programs
  • Others(Library, OpenACC, Dynamic parallelism)

1. Reduce

Input: 
1) set of elements 
2) reduction operator: binary(2 to 1), associative(- is not, + is)

Serial Implementation: work/step complexity O(n)

Parallel Impplementation: O(log(n)), 并行归并 Example

2. Scan

统计该元素之前所有位置的op结果

gpu-scan

Parallel it: 1) logn steps and nlogn works 2) 2logn steps and 2n work

nvidia-scan-implementation

3. Histogram

  • 效率较低: Need to use atomic operation(Hash计数问题)

  • 优化: Reduced-based algorithm: Each thread has its own bin

4. Problem 3

Image Tone-mapping: solution

5. Use Scan in Sparse Matrix

Use Scan to compute the Address of the density-array.

  • Sparse Matrix: - CSR representation

    csr

  • Solution-1 - thread/element by Segmented Scan:

    multi

    • Use ROWPTR to generate segmented value array
    • Thread per element: map (VALUE[n] * X[COLUMN[n]])
    • Backwards inclusive segmented sum scan
    • Use ROWPTR to gather sparse output into dense vector
  • Solution-2 - thread/row cuda code:(similar # of elements/row, 3X faster)

    csr-cuda

  • Solution-3 - Hybrid:

    multi-3

6. Sort

  • ODD_EVEN Sort(O(n), 类似冒泡排序)
  • Merge Sort
  • Sorting Networks: 比较的顺序很tricky!
  • Quick Sort
  • Radix Sort: GPU容易实现,且效率高 radix-sort

7. Problem 4

Red Eye Removal: solution

Loading Disqus comments...
Table of Contents