Simple webcrawler which follow the links <a href>'s
within the same domain.
page-parser
- getting source of html and following the links pageweb-crawler-lambda
- lambda service with embededpage-parser
web-crawler-ui
- basic user interface forweb-crawler
- #TODO
- get
nixpkg
viacurl https://nixos.org/nix/install | sh
- clone this repo
- within folder run
nix-shell
command
goto
folderpackages/web-crawler-lambda
and runyarn run offline
or with logging enabledDEBUG=log:parser yarn run offline
- after compilation you can navigate to
localhost:8080/get-site-map
- to run on custom page, use query string, like so
http://localhost:8080/get-site-map?page=http://www.google.com
unit tests
-yarn test:unit
lints
-yarn lint
integration tests
- gotopackages/web-crawler-lambda' and run
yarn run test'
Just append at the end --watch
for test command, this is, yarn test:unit --watch
- setup your aws credentials - more details below
- goto
packages/web-crawler-lambda
and runyarn run deploy
More about serverless deploy
serverless config credentials --provider aws --key YOUR_ACCESS_KEY --secret YOUR_SECRET_KEY
More here.
nix-shell
- to get fully isolated, reproducible environment with allnode.js
dependencies without global flag requiredserverless
- easy to setup, test and deploy to cloud and locally, no vendor lock inlambda
- there is no point to hold an instance likeEC2
forRPC
like callyarn workspaces
- to have clean view onnpm
dependency tree and to have more meaningful parts of app
This solution is okey for static pages as it is parsing html file from get request, but is not suitable for Singe Page Apps. In case of SPA, frontend is taking more responsibility - dynamic importing of files/scripts becomes a standard and it is not possible to crawl such dynamic page when user interaction is required, i.e. scrolling, in such case only visible part on page is eagerly downloaded and wait until user interaction to show up rest of elements of page. To make it more robust - more advance page crawling algorithm would be needed, something which won’t be only a parsing html page but instead of it will mimic real browser (i.e. puppeteer).
- as you will spawn another shell via
nix shell
some of aliases / commands / paths would not work, however if you are anzsh
user it should be seamless, assuming yourzsh
config lives within home directory, this is~/.zshrc
if not, overridezdotdir/zshenv
before runningnix-shell
and apply necessary config.