crawlsite.js crashes on PDFs #10

minthemiddle · 2018-06-07T16:01:51Z

When the script reaches a PDF, it crashes.

Example:

(node:23872) UnhandledPromiseRejectionWarning: Error: net::ERR_ABORTED at https://code.design/files/code-design-magazine-001.pdf
    at navigate (/Users/martin/Sites/crawlsite/node_modules/puppeteer/lib/Page.js:539:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:23872) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:23872) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

The text was updated successfully, but these errors were encountered:

ebidel · 2018-06-07T16:55:17Z

Good catch. Do you have the starting page you were running it on? That'll help me debug.

minthemiddle · 2018-06-07T16:59:10Z

Yes, my non-profit: https://code.design

aamakerlsa · 2018-10-13T02:13:52Z

@ebidel any progress on the crash on PDF documents issue... this is a really cool project!

aamakerlsa · 2018-10-13T03:19:40Z

I found a way around the by making this modification

.filter(el => el.localName === 'a' && el.href && el.href.indexOf('.pdf') < 0) // element is an anchor with an href.

... basically it checks to make sure the href of the a tag does NOT contain .pdf

ebidel · 2018-10-16T16:33:33Z

@aamakerlsa Right, it would be something like that. However, not every PDF link contains ".pdf" in the name :)

TruptiM18 · 2019-01-13T09:19:34Z

Can I work on this issue?

ebidel · 2019-01-14T19:57:36Z

Sure

TruptiM18 · 2019-01-21T02:28:20Z

@ebidel Thanks.
Can we just read the header of the file pointed by href in hex and figure out if its of .pdf format file or not?
Pdf File Format Basic Structure

TruptiM18 · 2019-01-29T05:30:29Z

Hi @ebidel,
Did you get a chance to look into the above query?
Thanks.

ebidel · 2019-01-29T21:04:02Z

Not sure if that would work but you could try. You'd have to read the response body of every request though :(

minthemiddle closed this as completed Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawlsite.js crashes on PDFs #10

crawlsite.js crashes on PDFs #10

minthemiddle commented Jun 7, 2018

ebidel commented Jun 7, 2018

minthemiddle commented Jun 7, 2018

aamakerlsa commented Oct 13, 2018

aamakerlsa commented Oct 13, 2018 •

edited

Loading

ebidel commented Oct 16, 2018

TruptiM18 commented Jan 13, 2019

ebidel commented Jan 14, 2019

TruptiM18 commented Jan 21, 2019 •

edited

Loading

TruptiM18 commented Jan 29, 2019

ebidel commented Jan 29, 2019

crawlsite.js crashes on PDFs #10

crawlsite.js crashes on PDFs #10

Comments

minthemiddle commented Jun 7, 2018

ebidel commented Jun 7, 2018

minthemiddle commented Jun 7, 2018

aamakerlsa commented Oct 13, 2018

aamakerlsa commented Oct 13, 2018 • edited Loading

ebidel commented Oct 16, 2018

TruptiM18 commented Jan 13, 2019

ebidel commented Jan 14, 2019

TruptiM18 commented Jan 21, 2019 • edited Loading

TruptiM18 commented Jan 29, 2019

ebidel commented Jan 29, 2019

aamakerlsa commented Oct 13, 2018 •

edited

Loading

TruptiM18 commented Jan 21, 2019 •

edited

Loading