https://hgpu.org/?p=27590
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism