Empowering knowledge exploration and generation through LLamaIndex and RAG on the full WIKI knowledge base.
RAG-wiki.mp4
-
Clone this repo
git clone https://github.com/lyyf2002/RAG-WIKI
-
Download a subset of WIKI processed by me which only has 200MB text.
-
Ensure those files satisfy the following file hierarchy: (
storage
is the path that stores the index)ROOT ├── wiki ├── storage └── RAG-WIKI
-
To process the full WIKI or other data, please follow the 5-7.
-
Download the full WIKI data you like from Wikipedia database backup dump, e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.
-
use wikiextractor to get the cleaned text for the wiki dataset.
wikiextractor -o wiki --json --no-templates enwiki-latest-pages-articles.xml.bz2
-
storage
will be created by theapp.py
when you first run it. You can change the path to get different index stored before. -
cd
RAG-WIKI
-
Update the
api_base
andapi_key
inapp.py
. You can get a free key for test at https://github.com/chatanywhere/GPT_API_free -
install the Dependencies:
pip install streamlit pip install llama-index pip install langchain
-
run
streamlit run app.py