Skip to content

Commit

Permalink
add elasticsearch for fulltext search
Browse files Browse the repository at this point in the history
  • Loading branch information
hanc00l committed Aug 2, 2016
1 parent c0f5e62 commit 6d4de42
Show file tree
Hide file tree
Showing 3 changed files with 256 additions and 23 deletions.
45 changes: 26 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,55 +6,62 @@
![index](index.png)
![search](search.png)

### 1.安装相关组件
1.安装相关组件
--------
+ python 2.7和pip
+ mongodb
+ scrapy (pip install scrapy)
+ flask (pip install Flask)
+ pymongo (pip install pymongo)

### 2.爬虫
2.爬虫
--------

+ 乌云公开漏洞和知识库的爬虫分别位于目录scrapy/wooyun和scrapy/wooyun_drops

+ 运行scrapy crawl wooyun -a page_max=1 -a local_store=false -a update=false,有三个参数用于控制爬取:

-a page_max: 控制爬取的页数,默认为1,如果值为0,表示所有页面

-a local_store: 控制是否将每个漏洞离线存放到本地,默认为false

-a update:控制是否重复爬取,默认为false
-a page_max: 控制爬取的页数,默认为1,如果值为0,表示所有页面
-a local_store: 控制是否将每个漏洞离线存放到本地,默认为false
-a update:控制是否重复爬取,默认为false

+ 第一次爬取全部内容时,用scrapy crawl wooyun -a page_max=0 -a update=true

+ 平时只爬取最近的更新时,用scrapy crawl wooyun -a page_max=1,可以根据自己的爬取频率和网站更新情况调整page_max的值

+ 全部公开漏洞的列表和每个漏洞的文本内容存在mongodb中,大概约2G内容;如果整站爬全部文本和图片作为离线查询,大概需要10G空间、2小时(10M电信带宽);爬取全部知识库,总共约500M空间。(截止2015年10月)

### 3.搜索
3.搜索
--------
+ 漏洞搜索使用了Flask作为web server,bootstrap作为前端

+ 启动web server :在flask目录下运行python app.py,默认端口是5000

+ 搜索:在浏览器通过http://localhost:5000进行搜索漏洞,多个关键字可以用空格分开。

### 4.为mongodb数据库创建索引(如果不创建索引,可能导致返回结果报错,以下为命令行状态)
mongo
use wooyun
db.wooyun_list.ensureIndex({"datetime":1})
db.wooyun_drops.ensureIndex({"datetime":1})
+ 当进行全文搜索时,如果安装并启用了Elasicsearch,可提高全文搜索的效率;否则将使用mongodb的内置搜索,安装和启用方法见[安装Elasicsearch](elasticsearch_install.md)

4.为mongodb数据库创建索引
--------
```bash
mongo
use wooyun
db.wooyun_list.ensureIndex({"datetime":1})
db.wooyun_drops.ensureIndex({"datetime":1})
```

### 5.虚拟机
5.虚拟机
------

+ 虚拟机1:在2016年6月底爬的wooyun全部漏洞库和知识库内容,总共30G(压缩后约11G),网盘地址为: [http://pan.baidu.com/s/1o7IEaAQ](http://pan.baidu.com/s/1o7IEaAQ) 提取密码:d4cq

使用方法:
1、压缩包解压后是一个vmware虚拟机的镜像,可以由vmware直接打开运行;
2、由于在制作压缩包时虚拟机为“挂起”状态,当前虚拟机的IP地址可能和宿主机的IP地址段不一致,请将虚拟机重启后重新获取IP地址,虚拟机用户密码为hancool/qwe123;
3、进入wooyun_public目录,先用git更新一下到最新的代码git pull;
4、进入wooyun_public/flask目录,运行./app.py;
5、打开浏览器,输入http://ip:5000,ip为虚拟机的网卡地址(使用ifconfig eth0查看)
1、压缩包解压后是一个vmware虚拟机的镜像,可以由vmware直接打开运行;
2、由于在制作压缩包时虚拟机为“挂起”状态,当前虚拟机的IP地址可能和宿主机的IP地址段不一致,请将虚拟机重启后重新获取IP地址,虚拟机用户密码为hancool/qwe123;
3、进入wooyun_public目录,先用git更新一下到最新的代码git pull;
4、进入wooyun_public/flask目录,运行./app.py;
5、打开浏览器,输入http://ip:5000,ip为虚拟机的网卡地址(使用ifconfig eth0查看)

+ 虚拟机2:已打包了一个安装了所有组件和程序的虚拟机(不包含具体内容,约980M),网盘地址为:[http://pan.baidu.com/s/1sj67KDZ](http://pan.baidu.com/s/1sj67KDZ) 密码:bafi
Expand Down
162 changes: 162 additions & 0 deletions elasticsearch_install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
Elasticsearch Install
=============================

当进行全文搜索时,使用mongodb效率很低,且比较耗内存;一种解决办法是使用elasticsearch引擎,通过mongo-connector将数据同步到elasticsearch后进行快速搜索。

elasticsearch默认对中文是按照每个单独的汉字来进行分词的,所以查询中文非常的蛋疼。现在搜索中文的分词都基本采用IK插件,经过反复安装完成测试,还未达到理想的效果。可能是有地方没搞对,还请各位大牛们指点指点。

安装elasticsearch(通过apt-get)
--------
1、安装repo库

```bash
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://packages.elastic.co/elasticsearch/2.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
```
2、安装JDK和elasticsearch

```bash
sudo apt-get update
sudo apt-get install openjdk-7-jdk elasticsearch
```
3、将elasticseach加入到系统启动项中

```bash
sudo update-rc.d elasticsearch defaults 95 10
sudo /etc/init.d/elasticsearch start
```
4、测试一下,安装完成运行后elasticsearch会在9200端口上进行监听

```bash
curl -X GET http://localhost:9200
{
"name" : "Sebastian Shaw",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.3.4",
"build_hash" : "e455fd0c13dceca8dbbdbb1665d068ae55dabe3f",
"build_timestamp" : "2016-06-30T11:24:31Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
}
```


配置mongodb
-------

1、编辑/etc/mongodb.conf,增加:

replSet=rs0 #这里是指定replSet的名字
oplogSize=100 #这里是指定oplog表数据大小(太大了不支持)

重启动mongodb

```bash
sudo service mongodb restart
```
2,进入mongodb shell,初始化replicSet

```bash
mongo
rs.initiate( {"_id" : "rs0", "version" : 1, "members" : [ { "_id" : 0, "host" : "127.0.0.1:27017" } ]})
```
3,搭建好replicSet之后,退出mongo shell重新登录,提示符会变成:rs0:PRIMARY>,就可以退出Mongodb
安装mongo-connector,将数据同步到elasticsearch
-------
```bash
sudo pip install mongo-connector elastic2_doc_manager
sudo mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager
```
显示Logging to mongo-connector.log.后将会把mongodb数据库的信息同步到elasticsearch中,完全同步完成估计需要10-15分钟时间,同步期间不能中断,否则可能导致elasticsearch与mongodb数据不一致。
安装中文分词插件elasticsearch-analysis-ik
-------
1、从github下载编译好好的插件
```bash
cd ~
sudo apt-get install unzip wget
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.4/elasticsearch-analysis-ik-1.9.4.zip
unzip elasticsearch-analysis-ik-1.9.4.zip
```
2、将插件复制到elasticsearch的plugins目录
```bash
sudo cp -R ~/elasticsearch-analysis-ik/ /usr/share/elasticsearch/plugins
sudo chmod +rx /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
```
3、修改elasticsearch.yml配置,定义插件配置
```bash
sudo vi /etc/elasticsearch/elasticsearch.yml
```
在最后增加:
index:
analysis:
analyzer:
ik_syno:
type: custom
tokenizer: ik_max_word
filter: [my_synonym_filter]
ik_syno_smart:
type: custom
tokenizer: ik_smart
filter: [my_synonym_filter]
filter:
my_synonym_filter:
type: synonym
synonyms_path: analysis/synonym.txt
同时,增加一个空的analysis/synonym.txt文件:
```bash
sudo mkdir /etc/elasticsearch/analysis
sudo touch /etc/elasticsearch/analysis/synonym.txt
```
4、重启elasticsearch
```bash
sudo service elasticsearch restart
```
启用全文搜索
-------
1、安装elasticsearch-py
```bash
pip install elasticsearch
```
2、更新app.py
```bash
cd ~/wooyun_public
git pull
```
3、修改app.py
```bash
vi ~/wooyun_public/flask/app.py
修改:
SEARCH_BY_ES = True
```
参考链接
-------
1、[https://imququ.com/post/elasticsearch.html](https://imququ.com/post/elasticsearch.html)
2、[https://github.com/medcl/elasticsearch-analysis-ik](https://github.com/medcl/elasticsearch-analysis-ik)
3、[http://es.xiaoleilu.com](http://es.xiaoleilu.com)
4、[http://www.cnblogs.com/ciaos/p/3601209.html](http://www.cnblogs.com/ciaos/p/3601209.html)
72 changes: 68 additions & 4 deletions flask/app.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
#-*- coding: utf-8 -*-
import math
import re
import time
import pymongo
from flask import Flask, request, session, g, redirect, url_for, abort, render_template, flash
# setting:
Expand All @@ -11,6 +12,9 @@
MONGODB_COLLECTION_BUGS = 'wooyun_list'
MONGODB_COLLECTION_DROPS = 'wooyun_drops'
ROWS_PER_PAGE = 20
#search engine,if has install elasticsearch and mongo-connector,please use elasicsearch for full text search
#else set False
SEARCH_BY_ES = False
# flask app:
app = Flask(__name__)
app.config.from_object(__name__)
Expand Down Expand Up @@ -64,6 +68,63 @@ def search_mongodb(keywords, page, content_search_by, search_by_html):
#
return page_info

def search_mongodb_by_es(keywords, page, content_search_by, search_by_html):
from elasticsearch import Elasticsearch

field_name = 'html' if search_by_html else 'title'
page_info = {'current': page, 'total': 0,
'total_rows': 0, 'rows': []}
# get the page rows
if page >= 1 :
row_start = (page - 1) * app.config['ROWS_PER_PAGE']
#get elasticsearch in localhost:9200
es = Elasticsearch()
if keywords.strip() == '':
query_dsl = {
"query": {
"filtered": {
"query": {
"match_all":{ }
}
}
},
"sort": {"datetime": { "order": "desc" }},
"from": row_start,
"size": app.config['ROWS_PER_PAGE']
}
else:
query_dsl = {
"query": {
"filtered": {
"query": {
"match":{
field_name : {
'query':keywords,
'operator':'and'
}
}
}
}
},
"sort": {"datetime": { "order": "desc" }},
"from": row_start,
"size": app.config['ROWS_PER_PAGE']
}
res = es.search(body=query_dsl,index=app.config['MONGODB_DB'],doc_type=content[content_search_by]['mongodb_collection'])
#get total rows and pages
page_info['total_rows'] = res['hits']['total']
page_info['total'] = int(math.ceil(page_info['total_rows'] / (app.config['ROWS_PER_PAGE'] * 1.0)))
#get everyone row set
for doc in res['hits']['hits']:
c = doc['_source']
c['datetime'] = time.strftime('%Y-%m-%d',time.strptime(c['datetime'],'%Y-%m-%dT%H:%M:%S'))
if 'url' in c:
urlsep = c['url'].split('//')[1].split('/')
c['url_local'] = '%s-%s.html' % (urlsep[1], urlsep[2])
page_info['rows'].append(c)

return page_info


def get_wooyun_total_count():
client = pymongo.MongoClient(connection_string)
Expand Down Expand Up @@ -92,15 +153,18 @@ def search():
content_search_by = request.args.get('content_search_by', 'by_bugs')
if page < 1:
page = 1
#
page_info = search_mongodb(
keywords, page, content_search_by, search_by_html)
#if there is elasticsearch config ,then the fulltext search by es
#else by mongodb search
if app.config['SEARCH_BY_ES'] is True and search_by_html is True:
page_info = search_mongodb_by_es(keywords, page, content_search_by, search_by_html)
else:
page_info = search_mongodb(keywords, page, content_search_by, search_by_html)
#
return render_template(content[content_search_by]['template_html'], keywords=keywords, page_info=page_info, search_by_html=search_by_html, title=u'搜索结果-乌云公开漏洞、知识库搜索')


def main():
app.run(host='0.0.0.0', debug=True)
app.run(host='0.0.0.0', debug=False)

if __name__ == '__main__':
main()

0 comments on commit 6d4de42

Please sign in to comment.