Actual tokens
Padding waste
Not in peak MB
Fixed memory (model + optimizer + gradients)