Skip to content

Simple webcrawler which follow the links `<a href>'s` within the same domain.

Notifications You must be signed in to change notification settings

damianbaar/web-crawler

Repository files navigation

web-crawler

Simple webcrawler which follow the links <a href>'s within the same domain.

Parts / Structure

How to ...

... run in local / development mode

  • get nixpkg via curl https://nixos.org/nix/install | sh
  • clone this repo
  • within folder run nix-shell command

... run lambda

  • goto folder packages/web-crawler-lambda and run yarn run offline or with logging enabled DEBUG=log:parser yarn run offline
  • after compilation you can navigate to localhost:8080/get-site-map
  • to run on custom page, use query string, like so http://localhost:8080/get-site-map?page=http://www.google.com

... run tests and lints

  • unit tests - yarn test:unit
  • lints - yarn lint
  • integration tests - goto packages/web-crawler-lambda' and run yarn run test'

... run watch mode

Just append at the end --watch for test command, this is, yarn test:unit --watch

... deploy

Setup

AWS Credentials
serverless config credentials --provider aws --key YOUR_ACCESS_KEY --secret YOUR_SECRET_KEY

More here.

Some why's

  • nix-shell - to get fully isolated, reproducible environment with all node.js dependencies without global flag required
  • serverless - easy to setup, test and deploy to cloud and locally, no vendor lock in
  • lambda - there is no point to hold an instance like EC2 for RPC like call
  • yarn workspaces - to have clean view on npm dependency tree and to have more meaningful parts of app
Limitations / Caveats

This solution is okey for static pages as it is parsing html file from get request, but is not suitable for Singe Page Apps. In case of SPA, frontend is taking more responsibility - dynamic importing of files/scripts becomes a standard and it is not possible to crawl such dynamic page when user interaction is required, i.e. scrolling, in such case only visible part on page is eagerly downloaded and wait until user interaction to show up rest of elements of page. To make it more robust - more advance page crawling algorithm would be needed, something which won’t be only a parsing html page but instead of it will mimic real browser (i.e. puppeteer).

For zsh users

  • as you will spawn another shell via nix shell some of aliases / commands / paths would not work, however if you are an zsh user it should be seamless, assuming your zsh config lives within home directory, this is ~/.zshrc if not, override zdotdir/zshenv before running nix-shell and apply necessary config.

About

Simple webcrawler which follow the links `<a href>'s` within the same domain.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published