web-crawler

Simple webcrawler which follow the links <a href>'s within the same domain.

Parts / Structure

page-parser - getting source of html and following the links page
web-crawler-lambda - lambda service with embeded page-parser
web-crawler-ui - basic user interface for web-crawler - #TODO

How to ...

... run in local / development mode

get nixpkg via curl https://nixos.org/nix/install | sh
clone this repo
within folder run nix-shell command

... run lambda

goto folder packages/web-crawler-lambda and run yarn run offline or with logging enabled DEBUG=log:parser yarn run offline
after compilation you can navigate to localhost:8080/get-site-map
to run on custom page, use query string, like so http://localhost:8080/get-site-map?page=http://www.google.com

... run tests and lints

unit tests - yarn test:unit
lints - yarn lint
integration tests - goto packages/web-crawler-lambda' and run yarn run test'

... run watch mode

Just append at the end --watch for test command, this is, yarn test:unit --watch

... deploy

setup your aws credentials - more details below
goto packages/web-crawler-lambda and run yarn run deploy More about serverless deploy

Setup

AWS Credentials

serverless config credentials --provider aws --key YOUR_ACCESS_KEY --secret YOUR_SECRET_KEY

More here.

Some why's

nix-shell - to get fully isolated, reproducible environment with all node.js dependencies without global flag required
serverless - easy to setup, test and deploy to cloud and locally, no vendor lock in
lambda - there is no point to hold an instance like EC2 for RPC like call
yarn workspaces - to have clean view on npm dependency tree and to have more meaningful parts of app

Limitations / Caveats

This solution is okey for static pages as it is parsing html file from get request, but is not suitable for Singe Page Apps. In case of SPA, frontend is taking more responsibility - dynamic importing of files/scripts becomes a standard and it is not possible to crawl such dynamic page when user interaction is required, i.e. scrolling, in such case only visible part on page is eagerly downloaded and wait until user interaction to show up rest of elements of page. To make it more robust - more advance page crawling algorithm would be needed, something which won’t be only a parsing html page but instead of it will mimic real browser (i.e. puppeteer).

For `zsh` users

as you will spawn another shell via nix shell some of aliases / commands / paths would not work, however if you are an zsh user it should be seamless, assuming your zsh config lives within home directory, this is ~/.zshrc if not, override zdotdir/zshenv before running nix-shell and apply necessary config.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
example-output		example-output
packages		packages
zdotdir		zdotdir
.gitignore		.gitignore
README.md		README.md
jest.lint.js		jest.lint.js
jest.test.js		jest.test.js
package.json		package.json
shell.nix		shell.nix
tsconfig.json		tsconfig.json
tslint.json		tslint.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-crawler

Parts / Structure

How to ...

... run in local / development mode

... run lambda

... run tests and lints

... run watch mode

... deploy

Setup

AWS Credentials

Some why's

Limitations / Caveats

For `zsh` users

About

Releases

Packages

Languages

damianbaar/web-crawler

Folders and files

Latest commit

History

Repository files navigation

web-crawler

Parts / Structure

How to ...

... run in local / development mode

... run lambda

... run tests and lints

... run watch mode

... deploy

Setup

AWS Credentials

Some why's

Limitations / Caveats

For zsh users

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

For `zsh` users

Packages