diff --git a/README.md b/README.md index faf2881..f5c7778 100644 --- a/README.md +++ b/README.md @@ -6,13 +6,17 @@ ![index](index.png) ![search](search.png) -1.安装相关组件 +1.依赖组件 -------- + python 2.7和pip + mongodb -+ scrapy (pip install scrapy) -+ flask (pip install Flask) -+ pymongo (pip install pymongo) ++ scrapy ++ flask ++ pymongo ++ Elasticserch (搜索引擎,可选) + + [在ubuntu下的安装过程(点我)](install.md) + 2.爬虫 -------- @@ -39,7 +43,7 @@ + 搜索:在浏览器通过http://localhost:5000进行搜索漏洞,多个关键字可以用空格分开。 -+ 当进行全文搜索时,如果安装并启用了Elasicsearch,可提高全文搜索的效率;否则将使用mongodb的内置搜索,安装和启用方法见[安装Elasicsearch](elasticsearch_install.md)。 ++ 默认使用mongodb的数据库搜索,在进行全文搜索时比较慢,推荐安装使用Elasicsearch搜索引擎。[安装和配置Elasicsearch的方法(点我)](elasticsearch_install.md) 4.为mongodb数据库创建索引 -------- @@ -53,30 +57,31 @@ db.wooyun_drops.ensureIndex({"datetime":1}) 5.虚拟机 ------ -+ 虚拟机1:在2016年6月底爬的wooyun全部漏洞库和知识库内容,总共30G(压缩后约11G),网盘地址为: [http://pan.baidu.com/s/1kVdJuNd](http://pan.baidu.com/s/1kVdJuNd) 提取密码hn9d(8.3更新) ++ 虚拟机1:在2016年6月底爬的wooyun全部漏洞库和知识库内容,集成了Elasticsearch搜索,总共35G(压缩后约14G),网盘地址为: [http://pan.baidu.com/s/1kVtY2rX](http://pan.baidu.com/s/1kVtY2rX) ,提取密码:5ik7(8.5更新) 使用方法: 1、压缩包解压后是一个vmware虚拟机的镜像,可以由vmware直接打开运行; 2、由于在制作压缩包时虚拟机为“挂起”状态,当前虚拟机的IP地址可能和宿主机的IP地址段不一致,请将虚拟机重启后重新获取IP地址,虚拟机用户密码为hancool/qwe123; - 3、进入wooyun_public目录,先用git更新一下到最新的代码git pull; + 3、进入wooyun_public目录,先用git更新一下到最新的代码git pull(如果提示merge冲突,先进行git reset --hard origin/master后再git pull); 4、进入wooyun_public/flask目录,运行./app.py; 5、打开浏览器,输入http://ip:5000,ip为虚拟机的网卡地址(使用ifconfig eth0查看) -+ 虚拟机2:已打包了一个安装了所有组件和程序的虚拟机(不包含具体内容,约980M),网盘地址为:[http://pan.baidu.com/s/1sj67KDZ](http://pan.baidu.com/s/1sj67KDZ) 密码:bafi ++ 虚拟机2:已打包了一个安装了所有组件和elasticsearch搜索的虚拟机,不包含具体内容,压缩后约2.3G(由于wooyun还处于升级关闭期间,无法进行内容的爬取),网盘地址:[http://pan.baidu.com/s/1nvrS3zj](http://pan.baidu.com/s/1nvrS3zj),提取密码:2290 (8.5更新) 使用方法: - 1、使用vmware或virtualbox导入虚拟机 - 2、登录用户名hancool,密码qwe123 - 3、进入wooyun_public目录,先用git更新一下到最新的代码git pull + 1、压缩包解压后是一个vmware虚拟机的镜像,可以由vmware直接打开运行; + 2、由于在制作压缩包时虚拟机为“挂起”状态,当前虚拟机的IP地址可能和宿主机的IP地址段不一致,请将虚拟机重启后重新获取IP地址,虚拟机用户密码为hancool/qwe123; + 3、进入wooyun_public目录,先用git更新一下到最新的代码git pull(如果提示merge冲突,先进行git reset --hard origin/master后再git pull); 4、分别进入wooyun_public目录下的wooyun和wooyun_drops,运行爬虫爬取数据(爬取全部数据并且本地离线缓存):scrapy crawl wooyun -a page_max=0 -a local_store=true -a update=true 5、进入wooyun_publich目录下的flask,运行./app.py,启动web服务 6、打开浏览器,输入http://ip:5000,ip为虚拟机的网卡地址(使用ifconfig eth0查看) -### 6.其它 +6.其它 +-------- + 本程序只用于技术研究和个人使用,程序组件均为开源程序,漏洞和知识库来源于乌云公开漏洞,版权归wooyun.org。 diff --git a/elasticsearch_install.md b/elasticsearch_install.md index 34307bc..e5c7442 100644 --- a/elasticsearch_install.md +++ b/elasticsearch_install.md @@ -1,30 +1,30 @@ Elasticsearch Install ============================= -当进行全文搜索时,使用mongodb效率很低,且比较耗内存;一种解决办法是使用elasticsearch引擎,通过mongo-connector将数据同步到elasticsearch后进行快速搜索。 +当进行全文搜索时,使用mongodb效率很低,且比较耗内存;解决办法是使用elasticsearch引擎,通过mongo-connector将数据同步到elasticsearch后进行快速搜索。 -elasticsearch默认对中文是按照每个单独的汉字来进行分词的,所以查询中文非常的蛋疼。现在搜索中文的分词都基本采用IK插件,经过反复安装完成测试,还未达到理想的效果。可能是有地方没搞对,还请各位大牛们指点指点。 - -安装elasticsearch(通过apt-get) +安装elasticsearch -------- -1、安装repo库 + +1、安装JDK(或者JRE) ```bash -wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add - -echo "deb https://packages.elastic.co/elasticsearch/2.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list +sudo apt-get install openjdk-7-jdk ``` -2、安装JDK和elasticsearch +2、下载elasticseach ```bash -sudo apt-get update -sudo apt-get install openjdk-7-jdk elasticsearch +wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.4/elasticsearch-2.3.4.tar.gz +tar xvf elasticsearch-2.3.4.tar.gz ``` -3、将elasticseach加入到系统启动项中 + +3、运行elasticsearch ```bash -sudo update-rc.d elasticsearch defaults 95 10 -sudo /etc/init.d/elasticsearch start +cd elasticsearch-2.3.4/bin +./elasticsearch ``` + 4、测试一下,安装完成运行后elasticsearch会在9200端口上进行监听 ```bash @@ -66,16 +66,6 @@ rs.initiate( {"_id" : "rs0", "version" : 1, "members" : [ { "_id" : 0, "host" : 3,搭建好replicSet之后,退出mongo shell重新登录,提示符会变成:rs0:PRIMARY>,就可以退出Mongodb -安装mongo-connector,将数据同步到elasticsearch -------- - -```bash -sudo pip install mongo-connector elastic2_doc_manager -sudo mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager -``` -显示Logging to mongo-connector.log.后将会把mongodb数据库的信息同步到elasticsearch中,完全同步完成估计需要10-15分钟时间,同步期间不能中断,否则可能导致elasticsearch与mongodb数据不一致。 - - 安装中文分词插件elasticsearch-analysis-ik ------- @@ -83,7 +73,7 @@ sudo mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manage ```bash cd ~ -sudo apt-get install unzip wget +sudo apt-get install unzip wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.4/elasticsearch-analysis-ik-1.9.4.zip unzip elasticsearch-analysis-ik-1.9.4.zip ``` @@ -91,45 +81,57 @@ unzip elasticsearch-analysis-ik-1.9.4.zip 2、将插件复制到elasticsearch的plugins目录 ```bash -sudo cp -R ~/elasticsearch-analysis-ik/ /usr/share/elasticsearch/plugins -sudo chmod +rx /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik +cp -r elasticsearch-analysis-ik elasticsearch-2.3.4/plugins ``` 3、修改elasticsearch.yml配置,定义插件配置 ```bash -sudo vi /etc/elasticsearch/elasticsearch.yml +vi elasticsearch-2.3.4/config/elasticsearch.yml ``` 在最后增加: - index: - analysis: - analyzer: - ik_syno: - type: custom - tokenizer: ik_max_word - filter: [my_synonym_filter] - ik_syno_smart: - type: custom - tokenizer: ik_smart - filter: [my_synonym_filter] - filter: - my_synonym_filter: - type: synonym - synonyms_path: analysis/synonym.txt - -同时,增加一个空的analysis/synonym.txt文件: + index.analysis.analyzer.ik.type : 'ik' + index.analysis.analyzer.default.type : 'ik' + +4、退出并重启elasticsearch ```bash -sudo mkdir /etc/elasticsearch/analysis -sudo touch /etc/elasticsearch/analysis/synonym.txt + elasticsearch-2.3.4/bin/elasticsearch -d + (-d表示以后台方式运行) ``` -4、重启elasticsearch +安装mongo-connector,将数据同步到elasticsearch +------- ```bash -sudo service elasticsearch restart +sudo pip install mongo-connector elastic2_doc_manager +sudo mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager +``` +显示Logging to mongo-connector.log.后将会把mongodb数据库的信息同步到elasticsearch中,完全同步完成估计需要30分钟左右,同步期间不能中断,否则可能导致elasticsearch与mongodb数据不一致。 + +在同步过程中,可能会报错: + +```bash +OperationFailed: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'localhost', port=9200): Read timed out. (read timeout=10)) +2016-08-04 17:24:53,372 [ERROR] mongo_connector.oplog_manager:633 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(u'127.0.0.1', 27017), u'local'), u'oplog.rs') +2016-08-04 17:24:54,371 [ERROR] mongo_connector.connector:304 - MongoConnector: OplogThread unexpectedly stopped! Shutting down +``` + +####解决办法: + +修改timeout值,从默认的10改为200 + +```bash +sudo vi /usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py ``` + 将: + self.elastic = Elasticsearch(hosts=[url],**kwargs.get('clientOptions', {})) + + 修改为: + self.elastic = Elasticsearch(hosts=[url],timeout=200, **kwargs.get('clientOptions', {})) + + 启用全文搜索 ------- 1、安装elasticsearch-py @@ -149,7 +151,7 @@ git pull ```bash vi ~/wooyun_public/flask/app.py 修改: - SEARCH_BY_ES = True + SEARCH_BY_ES = 'auto' ``` 参考链接 ------- @@ -159,4 +161,10 @@ vi ~/wooyun_public/flask/app.py 3、[http://es.xiaoleilu.com](http://es.xiaoleilu.com) -4、[http://www.cnblogs.com/ciaos/p/3601209.html](http://www.cnblogs.com/ciaos/p/3601209.html) \ No newline at end of file +4、[http://www.cnblogs.com/ciaos/p/3601209.html](http://www.cnblogs.com/ciaos/p/3601209.html) + +5、[https://segmentfault.com/a/1190000002470467](https://segmentfault.com/a/1190000002470467) + +6、[https://github.com/medcl/elasticsearch-analysis-ik/issues/207](https://github.com/medcl/elasticsearch-analysis-ik/issues/207) + +7、[https://github.com/mongodb-labs/mongo-connector/wiki/Usage%20with%20ElasticSearch](https://github.com/mongodb-labs/mongo-connector/wiki/Usage%20with%20ElasticSearch) \ No newline at end of file diff --git a/flask/app.py b/flask/app.py index 3863fea..d96fca7 100755 --- a/flask/app.py +++ b/flask/app.py @@ -3,6 +3,7 @@ import math import re import time +import urllib2 import pymongo from flask import Flask, request, session, g, redirect, url_for, abort, render_template, flash # setting: @@ -12,9 +13,12 @@ MONGODB_COLLECTION_BUGS = 'wooyun_list' MONGODB_COLLECTION_DROPS = 'wooyun_drops' ROWS_PER_PAGE = 20 -#search engine,if has install elasticsearch and mongo-connector,please use elasicsearch for full text search -#else set False -SEARCH_BY_ES = False +ELASTICSEARCH_HOST = 'localhost:9200' +#ELASTICSEARCH CHOOSE +# auto: auto detect elasticsearch ,if opened then use elasticsearch,else use mongodb +# yes: always use elasticsearch +# no: not use elasticsearch +SEARCH_BY_ES = 'auto' # flask app: app = Flask(__name__) app.config.from_object(__name__) @@ -77,8 +81,7 @@ def search_mongodb_by_es(keywords, page, content_search_by, search_by_html): # get the page rows if page >= 1 : row_start = (page - 1) * app.config['ROWS_PER_PAGE'] - #get elasticsearch in localhost:9200 - es = Elasticsearch() + es = Elasticsearch([app.config['ELASTICSEARCH_HOST'],]) if keywords.strip() == '': query_dsl = { "query": { @@ -97,7 +100,7 @@ def search_mongodb_by_es(keywords, page, content_search_by, search_by_html): "query": { "filtered": { "query": { - "match":{ + "match": { field_name : { 'query':keywords, 'operator':'and' @@ -125,6 +128,15 @@ def search_mongodb_by_es(keywords, page, content_search_by, search_by_html): return page_info +def check_elastichsearch_open(): + try: + html = urllib2.urlopen('http://%s' %app.config['ELASTICSEARCH_HOST']).read() + if len(html) > 0: + return True + else: + return False + except: + return False def get_wooyun_total_count(): client = pymongo.MongoClient(connection_string) @@ -153,9 +165,8 @@ def search(): content_search_by = request.args.get('content_search_by', 'by_bugs') if page < 1: page = 1 - #if there is elasticsearch config ,then the fulltext search by es - #else by mongodb search - if app.config['SEARCH_BY_ES'] is True and search_by_html is True: + #search by elasticsearch or mongo + if app.config['SEARCH_BY_ES'] == 'yes' or ( app.config['SEARCH_BY_ES'] == 'auto' and check_elastichsearch_open() is True ): page_info = search_mongodb_by_es(keywords, page, content_search_by, search_by_html) else: page_info = search_mongodb(keywords, page, content_search_by, search_by_html) diff --git a/install.md b/install.md new file mode 100644 index 0000000..74a688a --- /dev/null +++ b/install.md @@ -0,0 +1,39 @@ +wooyun_public在Ubuntu下的安装 +============================= + +以下为在ubuntu14.04和16.04的安装过程,需要安装的依赖组件: + ++ python 2.7和pip ++ mongodb ++ scrapy ++ flask ++ pymongo + +步骤 +-------- +1、安装python、pip、mongodb + +```bash +sudo apt-get install python python-pip mongodb +``` +2、安装scrapy + +```bash +安装scrapy如果报错,则先apt-get安装下述依赖包,然后安装pip安装lxml后即可正常安装scrapy +sudo apt-get install libxml2-dev libxslt1-dev python-dev zlib1g-dev libevent-dev python-openssl + +sudo pip install lxml +sudo pip install scrapy +``` +3、安装flask和pymongo + +```bash +sudo pip install flask pymongo +``` +4、从github下载源码 + +```bash +git clone https://github.com/hanc00l/wooyun_public +``` + +