Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crawlsite.js crashes on PDFs #10

Closed
minthemiddle opened this issue Jun 7, 2018 · 10 comments
Closed

crawlsite.js crashes on PDFs #10

minthemiddle opened this issue Jun 7, 2018 · 10 comments

Comments

@minthemiddle
Copy link

When the script reaches a PDF, it crashes.

Example:

(node:23872) UnhandledPromiseRejectionWarning: Error: net::ERR_ABORTED at https://code.design/files/code-design-magazine-001.pdf
    at navigate (/Users/martin/Sites/crawlsite/node_modules/puppeteer/lib/Page.js:539:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:23872) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:23872) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
@ebidel
Copy link
Contributor

ebidel commented Jun 7, 2018

Good catch. Do you have the starting page you were running it on? That'll help me debug.

@minthemiddle
Copy link
Author

Yes, my non-profit: https://code.design

@aamakerlsa
Copy link

@ebidel any progress on the crash on PDF documents issue... this is a really cool project!

@aamakerlsa
Copy link

aamakerlsa commented Oct 13, 2018

I found a way around the by making this modification

.filter(el => el.localName === 'a' && el.href && el.href.indexOf('.pdf') < 0) // element is an anchor with an href.

... basically it checks to make sure the href of the a tag does NOT contain .pdf

@ebidel
Copy link
Contributor

ebidel commented Oct 16, 2018

@aamakerlsa Right, it would be something like that. However, not every PDF link contains ".pdf" in the name :)

@TruptiM18
Copy link

Can I work on this issue?

@ebidel
Copy link
Contributor

ebidel commented Jan 14, 2019

Sure

@TruptiM18
Copy link

TruptiM18 commented Jan 21, 2019

@ebidel Thanks.
Can we just read the header of the file pointed by href in hex and figure out if its of .pdf format file or not?
Pdf File Format Basic Structure

@TruptiM18
Copy link

Hi @ebidel,
Did you get a chance to look into the above query?
Thanks.

@ebidel
Copy link
Contributor

ebidel commented Jan 29, 2019

Not sure if that would work but you could try. You'd have to read the response body of every request though :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants