Information_Retrieval

Project for subject Information Retrieval: Build a tool for crawling data from the internet, then build a search engine.

The Project uses:
- Scrapy + Splash: Crawl data from websites (kenh14.vn and dantri.com)
- MySQL: Storage data after crawled
- Elasticsearch + Logstash: Build search engine
Create Table in database:
- id: bigint
- url: varchar(255)
- title: tinytext
- time: datetime
- short_content: mediumtext
- full_content: longtext
Crawl data (folder 'news'):
- Install enviroment:
  - Install docker splash: sudo docker pull scrapinghub/splash
  - Run docker splash: sudo docker run -p 8050:8050 scrapinghub/splash
  - Install library scrapy and splash: pip3 install scrapy scrapy-splash
- Crawl from dantri.com and kenh14.vn
- Duplicate content Identifier by Jaccard Similarity with data storage in MySQL
- Config information of your database in file: /news/news/spiders/kenh14_spider.py /news/news/spiders/dantri_spider.py
- Start crawling:
  - cd news
  - scrapy crawl kenh14
  - scrapy crawl dantri (- /news/csvTomysql.py: process data in file your.csv to storage to mysql)
- Data will storage in mysql database
- Do crawl with
Install Elasticsearch:
- Install Elasticsearch
- Install Logstash
- Create index in ES (elasticsearch.txt)
- Copy file logstash_mysql.conf to /usr/share/logstash/
- Push data fro Mysql to index of ES: sudo bin/logstash --path.settings /etc/logstash/ -f logstash_mysql.conf
- Test search in ES: cd /src python serachEngine.py -i "index" -s "search_word"
UI in /interface/index.php

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
interface		interface
json_file		json_file
json_file2		json_file2
news		news
src		src
BÀI-TẬP-LỚN.pptx		BÀI-TẬP-LỚN.pptx
FlowChart.png		FlowChart.png
IR-Baocao13h06.docx		IR-Baocao13h06.docx
IR.zip		IR.zip
README.md		README.md
crontab22h30.txt		crontab22h30.txt
elastic_search.txt		elastic_search.txt
logstash_mysql.conf		logstash_mysql.conf

Provide feedback