Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Allow 1 manitissa bit diff in TestFused8BitRowwiseQuantizationConvers…
…ion (pytorch#2015) Summary: Pull Request resolved: pytorch#2015 The reference implementation of FP8 quantization is in Python, but the actual implementation is in C++/CUDA. Upon summerdengfb's investigation, Python has a known floating point representation issue (https://www.geeksforgeeks.org/floating-point-error-in-python/). This could cause quantization result discrepancy. To workaround this issue, we allow 1 bit difference in the FP8 quantization result (LSB of mantissa) in `TestFused8BitRowwiseQuantizationConversion`. Reviewed By: q10, shintaro-iwasaki Differential Revision: D49255499 fbshipit-source-id: b28294f8076bda61589e10699119375f03b091a8
- Loading branch information