This project is an example of GraphRAG, providing a system for processing documents, extracting entities and relationships, and managing them in a SQLite database. It leverages OpenAI's GPT models for natural language processing tasks and SQLite for database management.
app.py
: Main application script that initializes components and runs the document processing and querying workflow.graph_manager.py
: Manages the SQLite database, including building and reprojecting the graph, calculating centrality measures, and managing graph operations.query_handler.py
: Handles user queries by leveraging the graph data and OpenAI's GPT models for natural language processing.document_processor.py
: Processes documents by splitting them into chunks, extracting entities and relationships, and summarizing them.graph_database.py
: Manages the connection to the SQLite database.logger.py
: Provides a logging utility to log messages to both console and file with configurable log levels.
-
Clone the repository:
git clone [email protected]:stephenc222/example-graphrag-with-sqlite.git cd example-graphrag-with-sqlite
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.env
file in the root directory and add the following variables:OPENAI_API_KEY=your_openai_api_key DB_PATH=your_sqlite_db_path LOG_LEVEL=INFO # Optional, default is INFO
-
Initialize the SQLite database:
python initialize_db.py
-
Install the dependencies:
pip install -r requirements.txt
-
Run the application:
python app.py
-
Initial Indexing: The application will first index the initial set of documents defined in
constants.py
asDOCUMENTS
. -
Querying: After indexing, the application will handle a predefined query to extract themes from the documents. Centrality measures will also be calculated to enhance the query responses.
-
Reindexing with New Documents: The application will then add new documents defined in
constants.py
asDOCUMENTS_TO_ADD_TO_INDEX
and reindex the graph. -
Second Query: After reindexing, the application will handle another predefined query to extract themes from the updated set of documents.
- Overview: Acts as the entry point of the application.
- Responsibilities:
- Initializes the components: logger, document processor, graph manager, and query handler.
- Handles the main workflow:
- Performs initial indexing of documents.
- Executes a user query.
- Reindexes the graph with new documents.
- Runs a second user query based on the updated graph.
- Uses the logging utility to track the workflow progress.
- Overview: Manages graph-related operations in the SQLite database.
- Responsibilities:
- Builds the graph from document summaries.
- Reprojects the graph for community and centrality analysis.
- Performs calculations such as degree centrality, betweenness centrality, and closeness centrality.
- Supports reindexing with new documents and recalculating centrality measures.
- Overview: Handles natural language queries.
- Responsibilities:
- Extracts answers from the graph using centrality measures.
- Uses OpenAI GPT models to provide concise answers based on graph data and centrality results.
- Overview: Manages the extraction and summarization of entities and relationships from documents.
- Responsibilities:
- Splits documents into chunks.
- Extracts entities and relationships from the chunks using OpenAI GPT models.
- Summarizes the extracted entities and relationships for graph processing.
- Overview: Manages the SQLite database connection.
- Responsibilities:
- Provides utility functions to connect to the SQLite database.
- Clears the database if necessary.
- Overview: Provides a logging utility for the application.
- Responsibilities:
- Logs messages to both console and file.
- Supports configurable log levels via environment variables (
LOG_LEVEL
). - Ensures logs are created in the correct format.
Centrality measures help identify the most important nodes (entities) in a graph based on their structural properties. These measures help in identifying key themes and influential concepts in the documents.
- Degree Centrality: Measures how many connections a node has. Nodes with a high degree centrality are the most connected and can represent key topics or ideas in the document set.
- Betweenness Centrality: Identifies nodes that act as bridges between other nodes. Nodes with high betweenness centrality often represent concepts that connect different themes.
- Closeness Centrality: Measures how quickly a node can reach all other nodes. Entities with high closeness centrality are well-connected to all other entities and can be key summarizers or connectors of information.
graph_manager = GraphManager(db_path)
graph_manager.calculate_centrality_measures()
-
Initial Indexing: The system processes an initial set of documents, extracting entities and relationships, and storing them in a SQLite graph.
-
Querying: A user query is handled by leveraging the centrality measures calculated from the graph, providing an intelligent answer using the OpenAI GPT model.
-
Reindexing: The system reindexes the graph when new documents are added, recalculates the centrality measures, and processes another user query.
query = "What are the main themes in these documents?"
answer = query_handler.ask_question_with_centrality(query)
print(f"Answer: {answer}")
Each component has its own logger, ensuring that log messages provide insight into the progress of document processing, graph operations, and query handling.
The log level can be configured dynamically at runtime using the LOG_LEVEL
environment variable.
openai
: For interacting with OpenAI's GPT models.dotenv
: For loading environment variables from a.env
file.sqlite3
: For interacting with the SQLite database.pickle
: For saving and loading processed data.logging
: For tracking the workflow progress across the application.
After running app.py
to populate the SQLite database, use the following command to export the graph data for D3.js visualization:
python export_graph_data.py data/graph_database.sqlite
Use a static file server to serve the public
directory:
python -m http.server --directory public 8000
Then navigate to http://localhost:8000/
to view the graph visualization.
This project is licensed under the MIT License. See the LICENSE file for details.