Skip to content

Commit

Permalink
使用elasticsearch进行快速搜索
Browse files Browse the repository at this point in the history
  • Loading branch information
hanc00l committed Aug 5, 2016
1 parent 4a443db commit 79ae32d
Show file tree
Hide file tree
Showing 4 changed files with 134 additions and 71 deletions.
29 changes: 17 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,17 @@
![index](index.png)
![search](search.png)

1.安装相关组件
1.依赖组件
--------
+ python 2.7和pip
+ mongodb
+ scrapy (pip install scrapy)
+ flask (pip install Flask)
+ pymongo (pip install pymongo)
+ scrapy
+ flask
+ pymongo
+ Elasticserch (搜索引擎,可选)

[在ubuntu下的安装过程(点我)](install.md)


2.爬虫
--------
Expand All @@ -39,7 +43,7 @@

+ 搜索:在浏览器通过http://localhost:5000进行搜索漏洞,多个关键字可以用空格分开。

+ 当进行全文搜索时,如果安装并启用了Elasicsearch,可提高全文搜索的效率;否则将使用mongodb的内置搜索,安装和启用方法见[安装Elasicsearch](elasticsearch_install.md)
+ 默认使用mongodb的数据库搜索,在进行全文搜索时比较慢,推荐安装使用Elasicsearch搜索引擎。[安装和配置Elasicsearch的方法(点我)](elasticsearch_install.md)

4.为mongodb数据库创建索引
--------
Expand All @@ -53,30 +57,31 @@ db.wooyun_drops.ensureIndex({"datetime":1})
5.虚拟机
------

+ 虚拟机1:在2016年6月底爬的wooyun全部漏洞库和知识库内容,总共30G(压缩后约11G),网盘地址为: [http://pan.baidu.com/s/1kVdJuNd](http://pan.baidu.com/s/1kVdJuNd) 提取密码hn9d(8.3更新
+ 虚拟机1:在2016年6月底爬的wooyun全部漏洞库和知识库内容,集成了Elasticsearch搜索,总共35G(压缩后约14G),网盘地址为: [http://pan.baidu.com/s/1kVtY2rX](http://pan.baidu.com/s/1kVtY2rX) ,提取密码:5ik7(8.5更新

使用方法:
1、压缩包解压后是一个vmware虚拟机的镜像,可以由vmware直接打开运行;
2、由于在制作压缩包时虚拟机为“挂起”状态,当前虚拟机的IP地址可能和宿主机的IP地址段不一致,请将虚拟机重启后重新获取IP地址,虚拟机用户密码为hancool/qwe123;
3、进入wooyun_public目录,先用git更新一下到最新的代码git pull;
3、进入wooyun_public目录,先用git更新一下到最新的代码git pull(如果提示merge冲突,先进行git reset --hard origin/master后再git pull)
4、进入wooyun_public/flask目录,运行./app.py;
5、打开浏览器,输入http://ip:5000,ip为虚拟机的网卡地址(使用ifconfig eth0查看)

+ 虚拟机2:已打包了一个安装了所有组件和程序的虚拟机(不包含具体内容,约980M),网盘地址为[http://pan.baidu.com/s/1sj67KDZ](http://pan.baidu.com/s/1sj67KDZ) 密码:bafi
+ 虚拟机2:已打包了一个安装了所有组件和elasticsearch搜索的虚拟机,不包含具体内容,压缩后约2.3G(由于wooyun还处于升级关闭期间,无法进行内容的爬取),网盘地址[http://pan.baidu.com/s/1nvrS3zj](http://pan.baidu.com/s/1nvrS3zj),提取密码:2290 (8.5更新)

使用方法:
1、使用vmware或virtualbox导入虚拟机
2、登录用户名hancool,密码qwe123
3、进入wooyun_public目录,先用git更新一下到最新的代码git pull
1、压缩包解压后是一个vmware虚拟机的镜像,可以由vmware直接打开运行;
2、由于在制作压缩包时虚拟机为“挂起”状态,当前虚拟机的IP地址可能和宿主机的IP地址段不一致,请将虚拟机重启后重新获取IP地址,虚拟机用户密码为hancool/qwe123;
3、进入wooyun_public目录,先用git更新一下到最新的代码git pull(如果提示merge冲突,先进行git reset --hard origin/master后再git pull);
4、分别进入wooyun_public目录下的wooyun和wooyun_drops,运行爬虫爬取数据(爬取全部数据并且本地离线缓存):scrapy crawl wooyun -a page_max=0 -a local_store=true -a update=true
5、进入wooyun_publich目录下的flask,运行./app.py,启动web服务
6、打开浏览器,输入http://ip:5000,ip为虚拟机的网卡地址(使用ifconfig eth0查看)


### 6.其它
6.其它
--------

+ 本程序只用于技术研究和个人使用,程序组件均为开源程序,漏洞和知识库来源于乌云公开漏洞,版权归wooyun.org。

Expand Down
108 changes: 58 additions & 50 deletions elasticsearch_install.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
Elasticsearch Install
=============================

当进行全文搜索时,使用mongodb效率很低,且比较耗内存;一种解决办法是使用elasticsearch引擎,通过mongo-connector将数据同步到elasticsearch后进行快速搜索。
当进行全文搜索时,使用mongodb效率很低,且比较耗内存;解决办法是使用elasticsearch引擎,通过mongo-connector将数据同步到elasticsearch后进行快速搜索。

elasticsearch默认对中文是按照每个单独的汉字来进行分词的,所以查询中文非常的蛋疼。现在搜索中文的分词都基本采用IK插件,经过反复安装完成测试,还未达到理想的效果。可能是有地方没搞对,还请各位大牛们指点指点。

安装elasticsearch(通过apt-get)
安装elasticsearch
--------
1、安装repo库

1、安装JDK(或者JRE)

```bash
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://packages.elastic.co/elasticsearch/2.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
sudo apt-get install openjdk-7-jdk
```
2、安装JDK和elasticsearch
2、下载elasticseach

```bash
sudo apt-get update
sudo apt-get install openjdk-7-jdk elasticsearch
wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.4/elasticsearch-2.3.4.tar.gz
tar xvf elasticsearch-2.3.4.tar.gz
```
3、将elasticseach加入到系统启动项中

3、运行elasticsearch

```bash
sudo update-rc.d elasticsearch defaults 95 10
sudo /etc/init.d/elasticsearch start
cd elasticsearch-2.3.4/bin
./elasticsearch
```

4、测试一下,安装完成运行后elasticsearch会在9200端口上进行监听

```bash
Expand Down Expand Up @@ -66,70 +66,72 @@ rs.initiate( {"_id" : "rs0", "version" : 1, "members" : [ { "_id" : 0, "host" :
3,搭建好replicSet之后,退出mongo shell重新登录,提示符会变成:rs0:PRIMARY>,就可以退出Mongodb
安装mongo-connector,将数据同步到elasticsearch
-------
```bash
sudo pip install mongo-connector elastic2_doc_manager
sudo mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager
```
显示Logging to mongo-connector.log.后将会把mongodb数据库的信息同步到elasticsearch中,完全同步完成估计需要10-15分钟时间,同步期间不能中断,否则可能导致elasticsearch与mongodb数据不一致。
安装中文分词插件elasticsearch-analysis-ik
-------
1、从github下载编译好好的插件
```bash
cd ~
sudo apt-get install unzip wget
sudo apt-get install unzip
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.4/elasticsearch-analysis-ik-1.9.4.zip
unzip elasticsearch-analysis-ik-1.9.4.zip
```
2、将插件复制到elasticsearch的plugins目录
```bash
sudo cp -R ~/elasticsearch-analysis-ik/ /usr/share/elasticsearch/plugins
sudo chmod +rx /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
cp -r elasticsearch-analysis-ik elasticsearch-2.3.4/plugins
```
3、修改elasticsearch.yml配置,定义插件配置
```bash
sudo vi /etc/elasticsearch/elasticsearch.yml
vi elasticsearch-2.3.4/config/elasticsearch.yml
```
在最后增加:
index:
analysis:
analyzer:
ik_syno:
type: custom
tokenizer: ik_max_word
filter: [my_synonym_filter]
ik_syno_smart:
type: custom
tokenizer: ik_smart
filter: [my_synonym_filter]
filter:
my_synonym_filter:
type: synonym
synonyms_path: analysis/synonym.txt
同时,增加一个空的analysis/synonym.txt文件:
index.analysis.analyzer.ik.type : 'ik'
index.analysis.analyzer.default.type : 'ik'
4、退出并重启elasticsearch
```bash
sudo mkdir /etc/elasticsearch/analysis
sudo touch /etc/elasticsearch/analysis/synonym.txt
elasticsearch-2.3.4/bin/elasticsearch -d
(-d表示以后台方式运行)
```
4、重启elasticsearch
安装mongo-connector,将数据同步到elasticsearch
-------
```bash
sudo service elasticsearch restart
sudo pip install mongo-connector elastic2_doc_manager
sudo mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager
```
显示Logging to mongo-connector.log.后将会把mongodb数据库的信息同步到elasticsearch中,完全同步完成估计需要30分钟左右,同步期间不能中断,否则可能导致elasticsearch与mongodb数据不一致。
在同步过程中,可能会报错:
```bash
OperationFailed: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'localhost', port=9200): Read timed out. (read timeout=10))
2016-08-04 17:24:53,372 [ERROR] mongo_connector.oplog_manager:633 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(u'127.0.0.1', 27017), u'local'), u'oplog.rs')
2016-08-04 17:24:54,371 [ERROR] mongo_connector.connector:304 - MongoConnector: OplogThread <OplogThread(Thread-7, started 140485117060864)> unexpectedly stopped! Shutting down
```
####解决办法:
修改timeout值,从默认的10改为200
```bash
sudo vi /usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py
```
将:
self.elastic = Elasticsearch(hosts=[url],**kwargs.get('clientOptions', {}))
修改为:
self.elastic = Elasticsearch(hosts=[url],timeout=200, **kwargs.get('clientOptions', {}))
启用全文搜索
-------
1、安装elasticsearch-py
Expand All @@ -149,7 +151,7 @@ git pull
```bash
vi ~/wooyun_public/flask/app.py
修改:
SEARCH_BY_ES = True
SEARCH_BY_ES = 'auto'
```
参考链接
-------
Expand All @@ -159,4 +161,10 @@ vi ~/wooyun_public/flask/app.py
3、[http://es.xiaoleilu.com](http://es.xiaoleilu.com)
4、[http://www.cnblogs.com/ciaos/p/3601209.html](http://www.cnblogs.com/ciaos/p/3601209.html)
4、[http://www.cnblogs.com/ciaos/p/3601209.html](http://www.cnblogs.com/ciaos/p/3601209.html)
5、[https://segmentfault.com/a/1190000002470467](https://segmentfault.com/a/1190000002470467)
6、[https://github.com/medcl/elasticsearch-analysis-ik/issues/207](https://github.com/medcl/elasticsearch-analysis-ik/issues/207)
7、[https://github.com/mongodb-labs/mongo-connector/wiki/Usage%20with%20ElasticSearch](https://github.com/mongodb-labs/mongo-connector/wiki/Usage%20with%20ElasticSearch)
29 changes: 20 additions & 9 deletions flask/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import math
import re
import time
import urllib2
import pymongo
from flask import Flask, request, session, g, redirect, url_for, abort, render_template, flash
# setting:
Expand All @@ -12,9 +13,12 @@
MONGODB_COLLECTION_BUGS = 'wooyun_list'
MONGODB_COLLECTION_DROPS = 'wooyun_drops'
ROWS_PER_PAGE = 20
#search engine,if has install elasticsearch and mongo-connector,please use elasicsearch for full text search
#else set False
SEARCH_BY_ES = False
ELASTICSEARCH_HOST = 'localhost:9200'
#ELASTICSEARCH CHOOSE
# auto: auto detect elasticsearch ,if opened then use elasticsearch,else use mongodb
# yes: always use elasticsearch
# no: not use elasticsearch
SEARCH_BY_ES = 'auto'
# flask app:
app = Flask(__name__)
app.config.from_object(__name__)
Expand Down Expand Up @@ -77,8 +81,7 @@ def search_mongodb_by_es(keywords, page, content_search_by, search_by_html):
# get the page rows
if page >= 1 :
row_start = (page - 1) * app.config['ROWS_PER_PAGE']
#get elasticsearch in localhost:9200
es = Elasticsearch()
es = Elasticsearch([app.config['ELASTICSEARCH_HOST'],])
if keywords.strip() == '':
query_dsl = {
"query": {
Expand All @@ -97,7 +100,7 @@ def search_mongodb_by_es(keywords, page, content_search_by, search_by_html):
"query": {
"filtered": {
"query": {
"match":{
"match": {
field_name : {
'query':keywords,
'operator':'and'
Expand Down Expand Up @@ -125,6 +128,15 @@ def search_mongodb_by_es(keywords, page, content_search_by, search_by_html):

return page_info

def check_elastichsearch_open():
try:
html = urllib2.urlopen('http://%s' %app.config['ELASTICSEARCH_HOST']).read()
if len(html) > 0:
return True
else:
return False
except:
return False

def get_wooyun_total_count():
client = pymongo.MongoClient(connection_string)
Expand Down Expand Up @@ -153,9 +165,8 @@ def search():
content_search_by = request.args.get('content_search_by', 'by_bugs')
if page < 1:
page = 1
#if there is elasticsearch config ,then the fulltext search by es
#else by mongodb search
if app.config['SEARCH_BY_ES'] is True and search_by_html is True:
#search by elasticsearch or mongo
if app.config['SEARCH_BY_ES'] == 'yes' or ( app.config['SEARCH_BY_ES'] == 'auto' and check_elastichsearch_open() is True ):
page_info = search_mongodb_by_es(keywords, page, content_search_by, search_by_html)
else:
page_info = search_mongodb(keywords, page, content_search_by, search_by_html)
Expand Down
39 changes: 39 additions & 0 deletions install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
wooyun_public在Ubuntu下的安装
=============================

以下为在ubuntu14.04和16.04的安装过程,需要安装的依赖组件:

+ python 2.7和pip
+ mongodb
+ scrapy
+ flask
+ pymongo

步骤
--------
1、安装python、pip、mongodb

```bash
sudo apt-get install python python-pip mongodb
```
2、安装scrapy

```bash
安装scrapy如果报错,则先apt-get安装下述依赖包,然后安装pip安装lxml后即可正常安装scrapy
sudo apt-get install libxml2-dev libxslt1-dev python-dev zlib1g-dev libevent-dev python-openssl

sudo pip install lxml
sudo pip install scrapy
```
3、安装flask和pymongo

```bash
sudo pip install flask pymongo
```
4、从github下载源码

```bash
git clone https://github.com/hanc00l/wooyun_public
```


0 comments on commit 79ae32d

Please sign in to comment.