Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for web scraping #29

Merged
merged 15 commits into from
Sep 12, 2022
Merged

Add support for web scraping #29

merged 15 commits into from
Sep 12, 2022

Conversation

ankrgyl
Copy link
Contributor

@ankrgyl ankrgyl commented Sep 11, 2022

This change adds the basics required to support scraping:

  • A web.py package that wraps selenium/chromium and can extract screenshots
  • Javascript code (find_leaf_nodes.js) that finds all leaf nodes with words in an HTML document. This is in JS specifically so that we can reuse it for other use cases (e.g. a Chrome plugin) where the text gets extracted client-side
  • A new WebDocument class (+ a bit of supportive refactoring) that wraps the web driver
  • Tests + documentation

@jagilley
Copy link

You're legendary, Impira team, was just trying to implement this functionality myself. FWIW the web scraping part LGTM

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Sep 12, 2022

Nice to hear! Have you taken it for a spin yet?

@jagilley
Copy link

No, haven't got the chance yet @ankrgyl

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Sep 12, 2022

Okay no worries @jagilley let me know when you do! Merging this in now

@ankrgyl ankrgyl merged commit 0d20beb into main Sep 12, 2022
@jagilley
Copy link

Works great, thank you @ankrgyl!

@ankrgyl ankrgyl deleted the web branch September 12, 2022 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants