Tile a Matrix Multiplication
Pick a tile size T (T×T output per thread block). Bigger T = more data reuse, but the tile must fit in fast on-chip shared memory.
8
FLOPs per tile: Bytes loaded: Arithmetic intensity: Shared memory:
Compute used:
Memory bandwidth:
Shared memory:
Hardware: 50 TFLOPS peak compute, 3 TB/s HBM bandwidth, 48 KB shared memory per block. Inner dimension K = 16 (fp32). The compute/memory crossover is at AI ≈ 17 FLOPs/byte; the dashed line on the roofline.