* initial commit
* expose some UX. update test
* add test. update bench
* update test. add doc
* fix ngpu
* fix FSDP
* fix
* fix fsdp test
* fix
* grammar
* simplify fsdp test
* update benchmark script
* update
* make claim more conservative
* register fused adam
* update benchmark script
* add more ops
* update default
* use TorchAOBaseTensor
* fix fsdp param_dtype
* fix param_dtype
* dtype check to prevent unnecessary errors
* move checks
* add note
* fix
* simplify script
* add module-based UX
* fix
* use FP8 impl of __torch_dispatch__
* rename _dynamice interface
* update test
* fix compile on 2.4
* log torch version
* make log interval customizable
* make naming for explicit
* update readme
* some change
* fix big bug
* add docstring. update _get_linear_inserter
* add TorchAOBaseTensor back
* fix FSDP
* update FSDP test. add autocast support
* reduce iter
* update int8_mm fallback
* put leading dims logic to _dynamic_int8_mm