Skip to content

Commit

Permalink
update main blog NLL results for MNIST
Browse files Browse the repository at this point in the history
  • Loading branch information
albertfgu committed Jun 28, 2022
1 parent a7f1c32 commit 0105f28
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 5 deletions.
6 changes: 3 additions & 3 deletions s4/s4.py
Original file line number Diff line number Diff line change
Expand Up @@ -1401,12 +1401,12 @@ def sample_checkpoint(path, model, length, rng):
# A more visually interesting task is generating MNIST digits, by predicting entire
# sequences of pixels! Here, we simply feed in a sequence of pixels into the model and have it
# predict the next one like language modeling. With a little
# tweaking, we are able to get the model to an NLL of 0.52 on this
# task with size 512 and 6 layers (~2m parameters).
# tweaking, we are able to get the model to an NLL of 0.36 on this
# task with size 512 and 6 layers (~4m parameters).
#
# The metric usually used for this task is *[bits per
# dimension](https://paperswithcode.com/sota/image-generation-on-mnist)* which is
# NLL in base 2 for MNIST. A score of 0.52 is ~0.76 BPD which is near PixelCNN++.
# NLL in base 2 for MNIST. A loss of 0.36 is ~0.52 BPD which is SOTA according to PapersWithCode.


# <img src="images/sample.png" width="100%">
Expand Down
4 changes: 2 additions & 2 deletions s4/s4d.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,9 +433,9 @@ def test_conversion(N=8, L=16):
# It's neat that generalizing the diagonal case to diagonal plus low-rank simply reduces to a slightly different, but computationally equivalent, linear algebra primitive!

# Note that these primitives can be implemented in many ways, which has been the source of some confusion about their efficiencies (is diagonal faster than DPLR?) and implementations (does DPLR require a custom CUDA kernel?).
# In summary, the DPLR kernel (i.e. Cauchy) and all versions of diagonal kernels (i.e. Vandermonde) actually have the *exact same computational complexities* as well as "implementation complexity", because the computational core in all cases is a similar structured matrix product. This can be computed in:
# In short, the DPLR kernel (i.e. Cauchy) and any version of diagonal kernel (i.e. Vandermonde) actually have the *exact same computational complexities* as well as "implementation complexity", because the computational core in all cases is a similar structured matrix product. This can be computed in:
#
# * $O(NL)$ time and $O(NL)$ space, by naively materializing the matrix
# * $O(NL)$ time and $O(NL)$ space, by naively materializing the matrix (good enough for most purposes!)
# * $O(NL)$ time and $O(N+L)$ space, which either requires a custom kernel (e.g. in PyTorch) or taking advantage of clever compilers (e.g. JAX with XLA) as in our implementation above
# * $\widetilde{O}(N+L)$ time and $O(N+L)$ space theoretically, from a rich body of literature in scientific computing

Expand Down

0 comments on commit 0105f28

Please sign in to comment.