Performance Guide

Very important and useful. Follow the CUDA Documentation instead of other sources.

Coalesced Access to Global Memory

Refer OverlayKernel.cu and EffectsKernel.cu
uchar4 (4 bytes) - 32x32 threads per block - 4x32x32 - 4K bytes
A big difference - like 2x in Performance

Math Library

multiplication use from here
big difference

_ device _ functions

For writing clean/reusable code, I was using _ device _ function - but the Performance dropped by half. So, I started using macros. I didn’t investigate more on why?