update main blog NLL results for MNIST

srush · Jun 28, 2022 · 0105f28 · 0105f28
1 parent a7f1c32
commit 0105f28
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 5 deletions.
diff --git a/s4/s4.py b/s4/s4.py
@@ -1401,12 +1401,12 @@ def sample_checkpoint(path, model, length, rng):
 # A more visually interesting task is generating MNIST digits, by predicting entire
 # sequences of pixels! Here, we simply feed in a sequence of pixels into the model and have it
 # predict the next one like language modeling. With a little
-# tweaking, we are able to get the model to an NLL of 0.52 on this
-# task with size 512 and 6 layers (~2m parameters).
+# tweaking, we are able to get the model to an NLL of 0.36 on this
+# task with size 512 and 6 layers (~4m parameters).
 #
 # The metric usually used for this task is *[bits per
 # dimension](https://paperswithcode.com/sota/image-generation-on-mnist)* which is
-# NLL in base 2 for MNIST. A score of 0.52 is ~0.76 BPD which is near PixelCNN++.
+# NLL in base 2 for MNIST. A loss of 0.36 is ~0.52 BPD which is SOTA according to PapersWithCode.
 
 
 # <img src="images/sample.png" width="100%">

diff --git a/s4/s4d.py b/s4/s4d.py
@@ -433,9 +433,9 @@ def test_conversion(N=8, L=16):
 # It's neat that generalizing the diagonal case to diagonal plus low-rank simply reduces to a slightly different, but computationally equivalent, linear algebra primitive!
 
 # Note that these primitives can be implemented in many ways, which has been the source of some confusion about their efficiencies (is diagonal faster than DPLR?) and implementations (does DPLR require a custom CUDA kernel?).
-# In summary, the DPLR kernel (i.e. Cauchy) and all versions of diagonal kernels (i.e. Vandermonde) actually have the *exact same computational complexities* as well as "implementation complexity", because the computational core in all cases is a similar structured matrix product. This can be computed in:
+# In short, the DPLR kernel (i.e. Cauchy) and any version of diagonal kernel (i.e. Vandermonde) actually have the *exact same computational complexities* as well as "implementation complexity", because the computational core in all cases is a similar structured matrix product. This can be computed in:
 #
-# * $O(NL)$ time and $O(NL)$ space, by naively materializing the matrix
+# * $O(NL)$ time and $O(NL)$ space, by naively materializing the matrix (good enough for most purposes!)
 # * $O(NL)$ time and $O(N+L)$ space, which either requires a custom kernel (e.g. in PyTorch) or taking advantage of clever compilers (e.g. JAX with XLA) as in our implementation above
 # * $\widetilde{O}(N+L)$ time and $O(N+L)$ space theoretically, from a rich body of literature in scientific computing