main_content_extractor

Description

This library is designed for extracting only the main content from HTML.
It was developed for obtaining information related to LLM and for data input to LangChain and LlamaIndex.

Since this library contains element information and hierarchy information of HTML, it is useful when utilizing them.
For example, it can be helpful in obtaining a list of links or headers from the main content.

While trafilatura is an excellent library for main content extraction, it has issues such as missing necessary data or inability to output HTML.
To address these problems, this library exists.

The sequence of main content extraction is as follows:

In addition to HTML format, output in Text format and Markdown format is also supported. This is to make it easier to output data in a format that is more convenient for LLM.

The extraction of main content uses trafilatura.
Since trafilatura cannot output in HTML format, it is output in XML format containing HTML information and then converted to HTML.
The conversion from XML to HTML is irreversible and does not perfectly match the original data.

Installation

pip install MainContentExtractor

HowToUse

import requests
from main_content_extractor import MainContentExtractor

# Get HTML using requests
url = "https://developer.mozilla.org/ja/docs/Web"
response = requests.get(url)
response.encoding = 'utf-8'
content = response.text

# Get HTML with main content extracted from HTML
extracted_html = MainContentExtractor.extract(content)

# Get HTML with main content extracted from Markdown
extracted_markdown = MainContentExtractor.extract(content, output_format="markdown")

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
main_content_extractor		main_content_extractor
README.md		README.md
README_ja.md		README_ja.md
content_extraction_sequence.png		content_extraction_sequence.png
content_extraction_sequence.pu		content_extraction_sequence.pu
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

main_content_extractor

Description

Installation

HowToUse

About

Releases

Packages

Contributors 2

Languages

HawkClaws/main_content_extractor

Folders and files

Latest commit

History

Repository files navigation

main_content_extractor

Description

Installation

HowToUse

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages