Learn GPU Optimization

It has been a while since I learned GPU knowledge. I am going to keep updating more recent materials for ramping up my GPU knowledge. FlashAttention [1] We start from recapping the standard Self-Attention mechanism, which is computed in 3-passes: Notes: The shape and represent the sequence length and internal dimension, respectively. , , , …