textexplorer

Modern Text Exploration Library

A Python library for advanced text analysis and exploration, primarily focused on PDF documents. This library combines various NLP techniques to extract insights from text data.

⚠️ This is a work in progress and should not be used in production.

Features

PDF text extraction
Advanced text analysis using spaCy
Text statistics (Flesch-Kincaid grade level, type-token ratio)
Extraction of:
- Out of vocabulary words
- Noun phrases
- Common words
- Keywords using TextRank
Text clustering capabilities (with OpenAI embeddings)
Visualization of text clusters

Installation

uv pip install text_explorer

Dependencies

spaCy (with en_core_web_lg model)
OpenAI API key (for embeddings)
Other dependencies will be installed automatically

Usage

Basic Text Analysis

from text_explorer import PDFExtractor, TextProcessor

# Extract text from PDF
pdf_extractor = PDFExtractor()
text = pdf_extractor.extract_text("your_document.pdf")

# Initialize text processor
text_processor = TextProcessor(model_name="en_core_web_lg", text=text)

# Get out of vocabulary words (useful for finding PDF parsing errors)
oov_words = text_processor.extract_top_oov_words(k=10)

# Extract noun phrases
noun_phrases = text_processor.extract_top_noun_phrases(k=10)

# Get most common words
common_words = text_processor.extract_top_words(k=10)

# Extract keywords using TextRank
keywords = text_processor.extract_keywords(k=10)

# Get readability metrics
grade_level = text_processor.flesch_kincaid_grade()
token_ratio = text_processor.type_token_ratio()

Clustering (Work in Progress)

from text_explorer import ClusterVisualizer
from sklearn.cluster import KMeans
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Initialize components
text_splitter = SemanticChunker(OpenAIEmbeddings())
clustering_function = KMeans(n_clusters=5, random_state=42)
cluster_visualizer = ClusterVisualizer(clustering_function=clustering_function)

# Split text into chunks
docs = text_splitter.create_documents([text])

# Cluster and visualize
df_clusters = cluster_visualizer.cluster_texts(docs, embeddings)
df_labeled = cluster_visualizer.label_clusters(df_clusters)
cluster_visualizer.visualize_clusters(df_labeled, embeddings)

License & Contributing

Apache 2.0 & Contributing have more details.

Contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dev_data		dev_data
text_explorer		text_explorer
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
usage.py		usage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textexplorer

Features

Installation

Dependencies

Usage

Basic Text Analysis

Clustering (Work in Progress)

License & Contributing

About

Releases

Packages

Languages

License

NirantK/textexplorer

Folders and files

Latest commit

History

Repository files navigation

textexplorer

Features

Installation

Dependencies

Usage

Basic Text Analysis

Clustering (Work in Progress)

License & Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages