Pick a tile size T (T×T output per thread block). Bigger T = more data reuse, but the tile must fit in fast on-chip shared memory.
8
FLOPs per tile:—Bytes loaded:—Arithmetic intensity:—Shared memory:—
Compute used:
—
Memory bandwidth:
—
Shared memory:
—
—
Hardware: 50 TFLOPS peak compute, 3 TB/s HBM bandwidth, 48 KB shared memory per block. Inner dimension K = 16 (fp32). The compute/memory crossover is at AI ≈ 17 FLOPs/byte; the dashed line on the roofline.