Add support for web scraping #29

ankrgyl · 2022-09-11T22:48:12Z

This change adds the basics required to support scraping:

A web.py package that wraps selenium/chromium and can extract screenshots
Javascript code (find_leaf_nodes.js) that finds all leaf nodes with words in an HTML document. This is in JS specifically so that we can reuse it for other use cases (e.g. a Chrome plugin) where the text gets extracted client-side
A new WebDocument class (+ a bit of supportive refactoring) that wraps the web driver
Tests + documentation

jagilley · 2022-09-12T16:01:32Z

You're legendary, Impira team, was just trying to implement this functionality myself. FWIW the web scraping part LGTM

ankrgyl · 2022-09-12T16:37:06Z

Nice to hear! Have you taken it for a spin yet?

jagilley · 2022-09-12T18:40:30Z

No, haven't got the chance yet @ankrgyl

…ay flaky.

ankrgyl · 2022-09-12T19:03:25Z

Okay no worries @jagilley let me know when you do! Merging this in now

jagilley · 2022-09-12T19:33:44Z

Works great, thank you @ankrgyl!

ankrgyl added 14 commits September 10, 2022 15:05

Add basic HTML scraping

6590e4a

Add web scraping support

a93c3e8

Add selenium to dependencies

9135474

Improve a bunch of stuff

8ddb1d8

Improve screenshot logic

79b77d2

Clip the last window

f11df9f

Add tests

c2e5b8c

Add docs and use webdriver-manager

f5b09eb

Remove unused import

280d70c

Use --no-sandbox if running root

b129859

Add find leaf nodes

a40c24c

Switch to Chromium driver

fb78dda

Forgive an invalid session and retry

f4851f8

Fix PDF typo

284abfa

amazingvince approved these changes Sep 12, 2022

View reviewed changes

Fix flaky tests. We may need to keep an eye on these tests if they st…

6f0b236

…ay flaky.

ankrgyl merged commit 0d20beb into main Sep 12, 2022

ankrgyl deleted the web branch September 12, 2022 20:46

Provide feedback