add elasticsearch for fulltext search

b0han711 · Aug 2, 2016 · 6d4de42 · 6d4de42
1 parent c0f5e62
commit 6d4de42
Show file tree

Hide file tree

Showing 3 changed files with 256 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -6,55 +6,62 @@
 ![index](index.png)
 ![search](search.png)
 
-### 1.安装相关组件
+1.安装相关组件
+--------
 + python 2.7和pip
 + mongodb
 + scrapy (pip install scrapy)
 + flask (pip install Flask)
 + pymongo (pip install pymongo) 
 
-### 2.爬虫
+2.爬虫
+--------
 
 + 乌云公开漏洞和知识库的爬虫分别位于目录scrapy/wooyun和scrapy/wooyun_drops
 
 + 运行scrapy crawl wooyun -a page_max=1  -a local_store=false -a update=false，有三个参数用于控制爬取：
 
-    -a page_max: 控制爬取的页数，默认为1，如果值为0，表示所有页面
-
-    -a local_store: 控制是否将每个漏洞离线存放到本地，默认为false
-
-    -a update：控制是否重复爬取，默认为false
+	    -a page_max: 控制爬取的页数，默认为1，如果值为0，表示所有页面
+	    -a local_store: 控制是否将每个漏洞离线存放到本地，默认为false  
+	    -a update：控制是否重复爬取，默认为false
 
 + 第一次爬取全部内容时，用scrapy crawl wooyun -a page_max=0 -a update=true
 
 + 平时只爬取最近的更新时，用scrapy crawl wooyun -a page_max=1，可以根据自己的爬取频率和网站更新情况调整page_max的值
 
 + 全部公开漏洞的列表和每个漏洞的文本内容存在mongodb中，大概约2G内容；如果整站爬全部文本和图片作为离线查询，大概需要10G空间、2小时（10M电信带宽）；爬取全部知识库，总共约500M空间。（截止2015年10月）
 
-### 3.搜索 
+3.搜索 
+--------
 + 漏洞搜索使用了Flask作为web server，bootstrap作为前端
 
 + 启动web server ：在flask目录下运行python app.py，默认端口是5000
 
 + 搜索：在浏览器通过http://localhost:5000进行搜索漏洞，多个关键字可以用空格分开。
 
-### 4.为mongodb数据库创建索引（如果不创建索引，可能导致返回结果报错，以下为命令行状态）
-	mongo
-	use wooyun
-	db.wooyun_list.ensureIndex({"datetime":1})
-	db.wooyun_drops.ensureIndex({"datetime":1})
++ 当进行全文搜索时，如果安装并启用了Elasicsearch，可提高全文搜索的效率；否则将使用mongodb的内置搜索，安装和启用方法见[安装Elasicsearch](elasticsearch_install.md)。
+
+4.为mongodb数据库创建索引
+--------
+```bash
+mongo
+use wooyun
+db.wooyun_list.ensureIndex({"datetime":1})
+db.wooyun_drops.ensureIndex({"datetime":1})
+```
 
-### 5.虚拟机
+5.虚拟机
+------
 
 + 虚拟机1：在2016年6月底爬的wooyun全部漏洞库和知识库内容，总共30G（压缩后约11G），网盘地址为： [http://pan.baidu.com/s/1o7IEaAQ](http://pan.baidu.com/s/1o7IEaAQ) 提取密码：d4cq 
 
 	使用方法：
 			
-			1、压缩包解压后是一个vmware虚拟机的镜像，可以由vmware直接打开运行；
-			2、由于在制作压缩包时虚拟机为“挂起”状态，当前虚拟机的IP地址可能和宿主机的IP地址段不一致，请将虚拟机重启后重新获取IP地址，虚拟机用户密码为hancool/qwe123；
-			3、进入wooyun_public目录，先用git更新一下到最新的代码git pull；
-			4、进入wooyun_public/flask目录，运行./app.py；
-			5、打开浏览器，输入http://ip:5000，ip为虚拟机的网卡地址（使用ifconfig eth0查看）
+		1、压缩包解压后是一个vmware虚拟机的镜像，可以由vmware直接打开运行；
+		2、由于在制作压缩包时虚拟机为“挂起”状态，当前虚拟机的IP地址可能和宿主机的IP地址段不一致，请将虚拟机重启后重新获取IP地址，虚拟机用户密码为hancool/qwe123；
+		3、进入wooyun_public目录，先用git更新一下到最新的代码git pull；
+		4、进入wooyun_public/flask目录，运行./app.py；
+		5、打开浏览器，输入http://ip:5000，ip为虚拟机的网卡地址（使用ifconfig eth0查看）
 		
 
 + 虚拟机2：已打包了一个安装了所有组件和程序的虚拟机（不包含具体内容，约980M），网盘地址为：[http://pan.baidu.com/s/1sj67KDZ](http://pan.baidu.com/s/1sj67KDZ) 密码：bafi

diff --git a/elasticsearch_install.md b/elasticsearch_install.md
@@ -0,0 +1,162 @@
+Elasticsearch Install
+=============================
+
+当进行全文搜索时，使用mongodb效率很低，且比较耗内存；一种解决办法是使用elasticsearch引擎，通过mongo-connector将数据同步到elasticsearch后进行快速搜索。
+
+elasticsearch默认对中文是按照每个单独的汉字来进行分词的，所以查询中文非常的蛋疼。现在搜索中文的分词都基本采用IK插件，经过反复安装完成测试，还未达到理想的效果。可能是有地方没搞对，还请各位大牛们指点指点。
+
+安装elasticsearch(通过apt-get)
+--------
+1、安装repo库
+
+```bash
+wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
+echo "deb https://packages.elastic.co/elasticsearch/2.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
+```
+2、安装JDK和elasticsearch
+
+```bash
+sudo apt-get update 
+sudo apt-get install openjdk-7-jdk elasticsearch
+```
+3、将elasticseach加入到系统启动项中
+
+```bash
+sudo update-rc.d elasticsearch defaults 95 10
+sudo /etc/init.d/elasticsearch start
+```
+4、测试一下，安装完成运行后elasticsearch会在9200端口上进行监听
+
+```bash
+curl -X GET http://localhost:9200
+{
+  "name" : "Sebastian Shaw",
+  "cluster_name" : "elasticsearch",
+  "version" : {
+    "number" : "2.3.4",
+    "build_hash" : "e455fd0c13dceca8dbbdbb1665d068ae55dabe3f",
+    "build_timestamp" : "2016-06-30T11:24:31Z",
+    "build_snapshot" : false,
+    "lucene_version" : "5.5.0"
+  },
+  "tagline" : "You Know, for Search"
+}
+```
+
+
+配置mongodb
+-------
+
+1、编辑/etc/mongodb.conf，增加：
+
+	replSet=rs0 #这里是指定replSet的名字 
+	oplogSize=100 #这里是指定oplog表数据大小（太大了不支持）
+
+重启动mongodb
+
+```bash
+sudo service mongodb restart
+```
+2，进入mongodb shell，初始化replicSet
+
+```bash
+mongo
+rs.initiate( {"_id" : "rs0", "version" : 1, "members" : [ { "_id" : 0, "host" : "127.0.0.1:27017" } ]}) 
+```
+3，搭建好replicSet之后，退出mongo shell重新登录，提示符会变成：rs0:PRIMARY>，就可以退出Mongodb
+
+
+安装mongo-connector，将数据同步到elasticsearch
+-------
+
+```bash
+sudo pip install mongo-connector elastic2_doc_manager
+sudo mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager
+```
+显示Logging to mongo-connector.log.后将会把mongodb数据库的信息同步到elasticsearch中，完全同步完成估计需要10-15分钟时间，同步期间不能中断，否则可能导致elasticsearch与mongodb数据不一致。
+
+
+安装中文分词插件elasticsearch-analysis-ik
+-------
+
+1、从github下载编译好好的插件
+
+```bash
+cd ~  
+sudo apt-get install unzip wget
+wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.4/elasticsearch-analysis-ik-1.9.4.zip
+unzip elasticsearch-analysis-ik-1.9.4.zip
+```
+
+2、将插件复制到elasticsearch的plugins目录
+
+```bash
+sudo cp -R  ~/elasticsearch-analysis-ik/ /usr/share/elasticsearch/plugins
+sudo chmod +rx /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
+```
+
+3、修改elasticsearch.yml配置，定义插件配置
+
+```bash
+sudo vi /etc/elasticsearch/elasticsearch.yml
+```
+在最后增加:
+
+	index:
+	  analysis:
+	    analyzer:
+	      ik_syno:
+	          type: custom
+	          tokenizer: ik_max_word
+	          filter: [my_synonym_filter]
+	      ik_syno_smart:
+	          type: custom
+	          tokenizer: ik_smart
+	          filter: [my_synonym_filter]
+	    filter:
+	      my_synonym_filter:
+	          type: synonym
+	          synonyms_path: analysis/synonym.txt
+	          
+同时，增加一个空的analysis/synonym.txt文件：
+
+```bash
+sudo mkdir /etc/elasticsearch/analysis
+sudo touch /etc/elasticsearch/analysis/synonym.txt
+```
+
+4、重启elasticsearch
+
+```bash
+sudo service elasticsearch restart
+```
+启用全文搜索
+-------
+1、安装elasticsearch-py
+
+```bash
+pip install elasticsearch
+```
+2、更新app.py
+
+```bash
+cd ~/wooyun_public
+git pull
+```
+
+3、修改app.py
+
+```bash
+vi ~/wooyun_public/flask/app.py
+修改:
+	SEARCH_BY_ES = True
+```
+参考链接
+-------
+1、[https://imququ.com/post/elasticsearch.html](https://imququ.com/post/elasticsearch.html)
+
+2、[https://github.com/medcl/elasticsearch-analysis-ik](https://github.com/medcl/elasticsearch-analysis-ik)
+
+3、[http://es.xiaoleilu.com](http://es.xiaoleilu.com)
+
+4、[http://www.cnblogs.com/ciaos/p/3601209.html](http://www.cnblogs.com/ciaos/p/3601209.html)
diff --git a/flask/app.py b/flask/app.py
@@ -2,6 +2,7 @@
 #-*- coding: utf-8 -*-
 import math
 import re
+import time
 import pymongo
 from flask import Flask, request, session, g, redirect, url_for, abort, render_template, flash
 # setting:
@@ -11,6 +12,9 @@
 MONGODB_COLLECTION_BUGS = 'wooyun_list'
 MONGODB_COLLECTION_DROPS = 'wooyun_drops'
 ROWS_PER_PAGE = 20
+#search engine,if has install elasticsearch and mongo-connector,please use elasicsearch for full text search
+#else set False
+SEARCH_BY_ES = False
 # flask app:
 app = Flask(__name__)
 app.config.from_object(__name__)
@@ -64,6 +68,63 @@ def search_mongodb(keywords, page, content_search_by, search_by_html):
     #
     return page_info
 
+def search_mongodb_by_es(keywords, page, content_search_by, search_by_html):
+    from elasticsearch import Elasticsearch
+
+    field_name = 'html' if search_by_html else 'title'
+    page_info = {'current': page, 'total': 0,
+                 'total_rows': 0, 'rows': []}
+    # get the page rows
+    if page >= 1 :
+        row_start = (page - 1) * app.config['ROWS_PER_PAGE']
+        #get elasticsearch in localhost:9200
+        es = Elasticsearch()
+        if keywords.strip() == '':
+            query_dsl = {
+                "query":    {
+                    "filtered": {
+                        "query":    {   
+                            "match_all":{ }
+                        }
+                    }
+                },
+                "sort": {"datetime":   {   "order":    "desc"  }},
+                "from": row_start,
+                "size": app.config['ROWS_PER_PAGE']
+            }
+        else:   
+            query_dsl = {
+                "query":    {
+                    "filtered": {
+                        "query":    {   
+                            "match":{ 
+                                field_name : {
+                                    'query':keywords,
+                                    'operator':'and'
+                                }
+                            }
+                        }
+                    }
+                },
+                "sort": {"datetime":   {   "order":    "desc"  }},
+                "from": row_start,
+                "size": app.config['ROWS_PER_PAGE']
+            }
+        res = es.search(body=query_dsl,index=app.config['MONGODB_DB'],doc_type=content[content_search_by]['mongodb_collection'])
+        #get total rows and pages
+        page_info['total_rows'] = res['hits']['total']
+        page_info['total'] = int(math.ceil(page_info['total_rows'] / (app.config['ROWS_PER_PAGE'] * 1.0)))
+        #get everyone row set
+        for doc in res['hits']['hits']:
+            c = doc['_source']
+            c['datetime'] = time.strftime('%Y-%m-%d',time.strptime(c['datetime'],'%Y-%m-%dT%H:%M:%S'))
+            if 'url' in c:
+                    urlsep = c['url'].split('//')[1].split('/')
+                    c['url_local'] = '%s-%s.html' % (urlsep[1], urlsep[2])
+            page_info['rows'].append(c)
+
+    return page_info
+
 
 def get_wooyun_total_count():
     client = pymongo.MongoClient(connection_string)
@@ -92,15 +153,18 @@ def search():
     content_search_by = request.args.get('content_search_by', 'by_bugs')
     if page < 1:
         page = 1
-    #
-    page_info = search_mongodb(
-        keywords, page, content_search_by, search_by_html)
+    #if there is elasticsearch config ,then the fulltext search by es
+    #else by mongodb search
+    if app.config['SEARCH_BY_ES'] is True and search_by_html is True:
+        page_info = search_mongodb_by_es(keywords, page, content_search_by, search_by_html)
+    else:
+        page_info = search_mongodb(keywords, page, content_search_by, search_by_html)
     #
     return render_template(content[content_search_by]['template_html'], keywords=keywords, page_info=page_info, search_by_html=search_by_html, title=u'搜索结果-乌云公开漏洞、知识库搜索')
 
 
 def main():
-    app.run(host='0.0.0.0', debug=True)
+    app.run(host='0.0.0.0', debug=False)
 
 if __name__ == '__main__':
     main()