Microbatch Memory Visualization

max_tokens_per_mb:

Actual tokens

Padding waste

Not in peak MB

Fixed memory (model + optimizer + gradients)