Logo
ApraPipes 1.0
Loading...
Searching...
No Matches
CUDA Kernel Programming Guide

Performance Guide

Very important and useful. Follow the CUDA Documentation instead of other sources.

Coalesced Access to Global Memory

Coalesced Access to Global Memory

  • Refer OverlayKernel.cu and EffectsKernel.cu
  • uchar4 (4 bytes) - 32x32 threads per block - 4x32x32 - 4K bytes
  • A big difference - like 2x in Performance

Math Library

NVIDIA CUDA Math API

  • multiplication use from here
  • big difference

_ device _ functions

For writing clean/reusable code, I was using _ device _ function - but the Performance dropped by half. So, I started using macros. I didn’t investigate more on why?


© Copyright 2020-2024, Apra Labs Inc.