clairvoyant

A conservative spider download.

The word conservative here stands for:

links will only be gathered from specified document area, e.g.: a few "div"s or "table"s, HTML parsing based on jsoup.
use only white list to direct the crawling path, the spider will not go elsewhere, strictly limit the searching boundary.
simple to use, locate and type java -jar clairvoyant-assembly-1.0.jar to see the help information, write json, then go.

spider format

start: starting URLs
concurrency: maximum concurrent threads
delay: the wait in milliseconds before next crawl operation
timeout: connection time out
filters: a list of filter TUPLE(a valid regex for matching URLs, a JQuery-style selector for designating an area in HTML page)
store: provide a local directory to store crawled HTML pages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
project		project
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt