A conservative spider download.
The word conservative here stands for:
- links will only be gathered from specified document area, e.g.: a few "div"s or "table"s, HTML parsing based on jsoup.
- use only white list to direct the crawling path, the spider will not go elsewhere, strictly limit the searching boundary.
- simple to use, locate and type
java -jar clairvoyant-assembly-1.0.jar
to see the help information, write json, then go.
- start: starting URLs
- concurrency: maximum concurrent threads
- delay: the wait in milliseconds before next crawl operation
- timeout: connection time out
- filters: a list of filter
TUPLE(a valid regex for matching URLs, a JQuery-style selector for designating an area in HTML page)
- store: provide a local directory to store crawled HTML pages
ShiZhan (c) 2013 Apache License Version 2.0