Knowledge Extraction and Graph Construction

This code extracts knowledge from the provided AWS S3 documentation URL, performs Named Entity Recognition (NER), identifies relationships between entities, generates insert queries, and constructs a knowledge graph. The code utilizes various natural language processing (NLP) tools and models.

Approach Used to Solve

Text Extraction: The script extracts text content from a specified URL using BeautifulSoup and requests. Only the first 25 lines of text are considered.
Text Preprocessing: The extracted text is preprocessed using NLTK for sentence segmentation, tokenization, lowercasing, stopword removal, lemmatization, and custom cleaning. This results in a clean and processed text.
Named Entity Recognition (NER): The script uses a pre-trained BERT model for token classification to perform NER. Entities such as persons, organizations, and locations are extracted from the preprocessed text.
Relationship Identification: Relationships between entities are identified by analyzing the context within a specified window size. Relationships are established based on the proximity of entities within the text.
Insert Query Generation: GPT-2 is employed to generate insert queries based on the identified entities and relationships. The script utilizes a combination of entities and relationships to create meaningful queries.
Knowledge Graph Construction: A knowledge graph is constructed using NetworkX to represent entities as nodes, relationships as edges, and insert queries as node attributes. The graph is visualized using Matplotlib in a circular layout.
Accuracy, Quality, and Relevance Calculation: The script dynamically generates ground truth entities and relationships and calculates accuracy, quality, and relevance based on the NER and relationship identification results.

Tools Used

BeautifulSoup: For HTML parsing
requests: For fetching content from URLs
NLTK: For Natural Language Processing tasks
Models used
- BERT (Bidirectional Encoder Representations from Transformers):
  - Model: BertForTokenClassification
  - Tokenizer: BertTokenizer
  - Usage: Named Entity Recognition (NER)
- GPT-2 (Generative Pre-trained Transformer 2):
  - Model: GPT2LMHeadModel
  - Tokenizer: GPT2Tokenizer
  - Usage: Insert query generation and language modeling
NetworkX: For graph construction
Matplotlib: For graph visualization

Assumptions

The relevant text on the webpage is enclosed in
tags.
The maximum token length for BERT model input is set to 256.
The script assumes a certain window size for relationship identification; users can adjust this based on their requirements.
Considering the OpenAI API key constraints, I have used GPT-2 for insert query generation, and users can explore other variants of GPT-2 for experimentation as applicable.
The script assumes a circular layout for visualizing the knowledge graph due to the external DB constraints. If required, users can consider any other layout available in networkx or can make use of external DBs.

Achieved Accuracy, Quality and Relevance

Accuracy - 0.0
Quality - 1.0
Relevance - 1.0

Usage

Install the required libraries
Import the corresponding modules from them
Download NLTK resources

pip install beautifulsoup4 requests nltk transformers networkx matplotlib
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import BertTokenizer, BertForTokenClassification, GPT2LMHeadModel, GPT2Tokenizer
import torch
import gc
import networkx as nx
import matplotlib.pyplot as plt


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Knowledge_Graph_using_LLMs.ipynb		Knowledge_Graph_using_LLMs.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Extraction and Graph Construction

Approach Used to Solve

Tools Used

Assumptions

Achieved Accuracy, Quality and Relevance

Usage

About

Releases

Packages

Languages

kraviteja95/Knowledge-Graph-Using-LLMs

Folders and files

Latest commit

History

Repository files navigation

Knowledge Extraction and Graph Construction

Approach Used to Solve

Tools Used

Assumptions

Achieved Accuracy, Quality and Relevance

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages