Tile a Matrix Multiplication

Pick a tile size T (T×T output per thread block). Bigger T = more data reuse, but the tile must fit in fast on-chip shared memory.

tile size T: 8

FLOPs per tile:— Bytes loaded:— Arithmetic intensity:— Shared memory:—

Compute used:

—

Memory bandwidth:

—

Shared memory:

—

Hardware: 50 TFLOPS peak compute, 3 TB/s HBM bandwidth, 48 KB shared memory per block. Inner dimension K = 16 (fp32). The compute/memory crossover is at AI ≈ 17 FLOPs/byte; the dashed line on the roofline.