Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparently valid ZIP files can be processed by archive/zip but not zipstream #7

Open
leonardr opened this issue Nov 8, 2024 · 5 comments

Comments

@leonardr
Copy link

leonardr commented Nov 8, 2024

While using zipstream for a project I've encountered a set of ZIP files that pass validation according to tools like zip -t, and which can be processed by archive/zip and mholt/archiver, but which can't be processed by zipstream.

I originally generated these files using archive/zip, but I also tried using mholt/archiver, and got slightly different files that triggered the same problem. So I don't think the problem is with the generator.

Below is a standalone Go program that opens the file from disk and runs it through archive/zip successfully. It then opens the same file and tries to run it through zipstream.

The file itself is confidential, but I can share it with you privately if you're interested.

@leonardr
Copy link
Author

leonardr commented Nov 8, 2024

package main

import "github.com/krolaw/zipstream"
import "errors"
import "os"
import "fmt"
import "io"
import "archive/zip"

func main() {
	filename := "generated-preview.epub"

	// Process with zip.Reader
	reader, err := os.Open(filename)
	if err != nil {
		panic(err)
	}

	archive, err := zip.NewReader(reader, int64(10969298)) // size of file
	if err != nil {
		panic(err)
	}
	for i, file := range archive.File {
		fmt.Printf("%d %s\n", i, file.Name)
	}
	fmt.Println()

	// Process again with zipstream
	reader, err = os.Open(filename)
	if err != nil {
		panic(err)
	}
	zipReader := zipstream.NewReader(reader)
	i := 0
	for {
		header, err := zipReader.Next()
		fmt.Printf("%d %v\n", i, header)
		if errors.Is(err, io.EOF) {
			break
		} else if err != nil {
			panic(err)
		}
		i++
	}
}

@leonardr
Copy link
Author

leonardr commented Nov 8, 2024

Here's the result of running the Go program. The checksum-looking things are the names of files in the ZIP.

$ go run parse_bad_epub.go
0 mimetype
1 META-INF/container.xml
2 content.opf
3 toc.ncx
4 toc.xhtml
5 8249cb941cbbe8eae0c6bdb21daf85c2
6 45ab87f1e4edc2e4ccd2752b72a95ef3
7 ad7ee2a691ec56748153045b06a0512f
8 a6712fc22957d8263167a41c6adca364
9 4d72ccaabf26b4b9c3b60babb888b244
10 b8a67e75254fb035d32ba4e7caaf7a03
11 c9fc6c09706e32b4d717d3fd66d9efcf
12 d2bd1316d10cbd527e7638624453e679
13 69df0bff4774856e57fb2d400f09ab3c
14 7e79e158191aced4a730a0792b431058
15 1b03c86912b64d3c3c7950307694606a
16 1dbc18185c65386017058d8cae42a8f1
17 148a9dc7bbf717d7717f552cd433f010
18 97789d8c93a4edbc2524c77c7a12f2a4
19 93946ba61271df2c95af191911cba2cd
20 e83d5ef1a9af99afa5c54d84f1b4784b
21 b7d503b3325dec5e071fe2d0d2fa620c
22 dfbdaae468ab26547e5ff66bf1a00836
23 6f7f9751ada36d43432be0cb9dc1b539
24 febb46eeb1d396e0a8e0b5066928b730
25 2bd1683e32bc7ad87d2a2524777a6be4
26 7d950af25f5571dbff69129506fa625c
27 6c946b0247666d7d05ffd05e0f48ceb5
28 5557938a653af2dc118545e908fd05a9
29 ad2de6a465fa23d99d6c78d56b78cf83
30 bfc4bb134bf9ab7180386f522446334b
31 6681d84796dbdaeaa98899492c24d549
32 e656a89128234a2455f198ffbf0da273
33 fb2ff6a36224e388f042a17d756c8b61
34 ba4affce47c9caa76e7af620a3af7f3e
35 bd3fb71acf3d6993fb8c2c473fb388cd
36 80164d992d6618fc3dde37fcad264cae
37 55105de1b413a8be20a2fd3417b7046e
38 8517547f7b5b16f00bdcd038b7269578
39 4ee47ec12360824f9fe3948611ae3ec8
40 d534455469648cff27ec3ede2812c097
41 da948a2e89e2f78ee1f128faebe7a7f7
42 fdedfa2b17b900e7068fe8f7e80bb835
43 564bcd02d34b0648228ba079cd9c5a73
44 fc26618ee5b86f84cb9569d1767c8172
45 d41068ba579c47c21fba1e63a6d061af
46 b1f2ebddaaedccea86a7857c110c067c
47 29abcc2d3ed83a89333d574c6b23da09
48 63d44ee4653e9935909b388f681a1b90
49 da4b43f52ccf7f74403e087e39836d39
50 413827c9d245480184708a3975d15d40
51 adb68c1ccd23dfd4e92f40c5f0f5fed0
52 90a361860d5bfbfc8f193a033dded420
53 70776b0f04ce3fde52387829f6fd61bc
54 793e50748b9d96ecf91c8653c3b949d7
55 2332e87109fcf75b98ed65047b4c6787
56 c2e939044d7effbd5f7e6631467d6cf9
57 a54eca75c5acffc6a4c8587df80577b3
58 31fad1d3f9adaf9de8eb8316c3f38a69
59 2e6182a63b2482ccb3f0952c863805af
60 fa4f81dbfe50740cc6903f27789b578f
61 26333aa551575ba06ae37a0dafba5061
62 7010dc29e529d304fff99a498662b5e8
63 b8d011dcad65dfd41e9f366b34df0bf5
64 6a65f7d38439a61769df9d35a6d9a08c
65 0a6a8024a0ef054add47f29a4579e1d7
66 57807073a72aa8a2c1d45d3e25a64bd3
67 ff22ca2e2326ac1868fab89c54b0f60c
68 643a1fc861622ed59ce621496ee13f6d
69 642b21dacc304d9c1594b0f770774b67
70 c78d15faab3d8946bd52a574ed9c4f90
71 981c8b5aaffe5f0e7f4e70eab0806ba3
72 97feaab555833c35dba798187f6a634b
73 b2389fca51b4da2df96b3a52d967fa7c
74 954f6bd3d0e841460ed544b01c15f4d4
75 2c1d2f9c5e87aba08e0df397fb8cbe25
76 2d7c490efec764dbfab95ec0f397474b
77 3d9863884ce9a87f662135e0e4eb0277
78 b334d03fc46a48ec18a48444b51cec79
79 cd6f0cd1dc42f661d797419cc58ec27e
80 a52e2994e90aa7b4ab505cd7dc61c88b
81 4c7181bf79e84722b8eea591fb89853f
82 825e6c1877fe8418d608f3639ef2150b
83 86ea23902cbf6fbdc794c40c487a5e05
84 45c5b2fee49c24f0fba955bb707af238
85 d6139a8426d576f135c55ef5e5f61749
86 f1babf0f920e8623577a2f827a33d5e9
87 e6667c911764a481b742e3ef083b2415
88 2147ff5e9a296ca7e89ebc92d2adece6

0 &{mimetype  false 0 20 8 0 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
1 &{META-INF/container.xml  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
2 &{content.opf  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
3 &{toc.ncx  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
4 &{toc.xhtml  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
5 &{8249cb941cbbe8eae0c6bdb21daf85c2  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
6 &{45ab87f1e4edc2e4ccd2752b72a95ef3  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
7 &{ad7ee2a691ec56748153045b06a0512f  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
8 &{a6712fc22957d8263167a41c6adca364  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
9 &{4d72ccaabf26b4b9c3b60babb888b244  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
10 &{b8a67e75254fb035d32ba4e7caaf7a03  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
11 &{c9fc6c09706e32b4d717d3fd66d9efcf  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
12 &{d2bd1316d10cbd527e7638624453e679  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
13 &{69df0bff4774856e57fb2d400f09ab3c  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
14 &{7e79e158191aced4a730a0792b431058  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
15 &{1b03c86912b64d3c3c7950307694606a  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
16 &{1dbc18185c65386017058d8cae42a8f1  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
17 &{148a9dc7bbf717d7717f552cd433f010  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
18 &{97789d8c93a4edbc2524c77c7a12f2a4  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
19 &{93946ba61271df2c95af191911cba2cd  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
20 &{e83d5ef1a9af99afa5c54d84f1b4784b  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
21 &{b7d503b3325dec5e071fe2d0d2fa620c  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
22 &{dfbdaae468ab26547e5ff66bf1a00836  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
23 &{6f7f9751ada36d43432be0cb9dc1b539  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
24 &{febb46eeb1d396e0a8e0b5066928b730  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
25 &{2bd1683e32bc7ad87d2a2524777a6be4  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
26 &{7d950af25f5571dbff69129506fa625c  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
27 &{6c946b0247666d7d05ffd05e0f48ceb5  false 0 20 8 8 0001-01-01 00:00:00 +0000 UTC 0 0 0 0 0 0 0 [] 0}
28 <nil>
panic: zip: not a valid zip file

@leonardr
Copy link
Author

leonardr commented Nov 8, 2024

The only other clue I have is that the problem goes away if I change the original generation code to use the Store compression method everywhere instead of the Deflate method.

@krolaw
Copy link
Owner

krolaw commented Nov 9, 2024

Looking through the code, it looks if the previous file has a set length in the header AND there's some data inserted between the files, then Next() won't find the header it is expecting, while still be a valid zip. But I don't think archive/zip does that. Could you try creating a minimal zip with just the files that cause the issue? i.e. 87, 88 and 89...

In the meantime, I'll think about getting the code to skip data until it finds the header.

@krolaw
Copy link
Owner

krolaw commented Nov 9, 2024

You might like to try modernise branch and see if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants