Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
onionElasticBot		onionElasticBot
.gitignore		.gitignore
README.md		README.md
mappings.json		mappings.json
polipo_conf		polipo_conf
requirements.txt		requirements.txt
runforever.sh		runforever.sh
scrapy.cfg		scrapy.cfg

Repository files navigation

onionElasticBot - Crawl .onion and .i2p websites from the Tor network

This is a Scrapy project to crawl .onion websites from the Tor network. Saves h1, h2, title, domain, url, plain HTML and words to Elasticsearch or plain JSON.

Short guide to the crawler installation

Install Elasticsearch and Oracle Java
Install Python 2.7
Install Tor software (use zero-circuit patch)
Install Polipo HTTP proxy software
Install i2p software

$ sudo apt-get install git
$ sudo apt-get install python-virtualenv
$ sudo apt-get install python-pip
$ sudo apt-get install libffi-dev
$ sudo apt-get install python-dev libxml2-dev libxslt-dev
$ sudo apt-get install libssl-dev
$ sudo apt-get install python-twisted
$ sudo apt-get install python-simplejson
$ sudo apt-get install gcc

$ cd onionElasticBot
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Setup Polipo:

$ sudo apt-get install polipo
$ sudo cp polipo_conf /etc/polipo/config
$ sudo service polipo restart

Creating Index Settings:

$ curl -XPUT -i "localhost:9200/crawl/" -d "@./mappings.json"

Run the crawler software

$ scrapy crawl OnionSpider -s DEPTH_LIMIT=100
or
$ scrapy crawl i2pSpider -s DEPTH_LIMIT=100 -s LOG_LEVEL=DEBUG -s ELASTICSEARCH_TYPE=i2p
or
$ scrapy crawl i2pSpider -s DEPTH_LIMIT=1 -s ROBOTSTXT_OBEY=0 -s ELASTICSEARCH_TYPE=i2p
or
$ scrapy crawl OnionSpider -o items.json -t json
or
$ scrapy crawl OnionSpider -s DEPTH_LIMIT=1 -s ALLOWED_DOMAINS=/home/juha/allowed_domains.txt -s TARGET_SITES=/home/juha/seed_list.txt -s ELASTICSEARCH_TYPE=targetitemtype

Run crawler forever:

$ bash runforever.sh OnionSpider 3600

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

onionElasticBot - Crawl .onion and .i2p websites from the Tor network

Short guide to the crawler installation

Run the crawler software

About

Releases

Packages

Languages

JonathanBowker/ahmia-crawler

Folders and files

Latest commit

History

Repository files navigation

onionElasticBot - Crawl .onion and .i2p websites from the Tor network

Short guide to the crawler installation

Run the crawler software

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages