SambaNova AI Starter Kits

Data Extraction Examples

Data Extraction Examples

Overview

This kit include a series of Notebooks that demonstrates various methods for extracting text from documents in different input formats. including Markdown, PDF, CSV, RTF, DOCX, XLS, HTML

Getting started

Deploy the starter kit

Option 1: Run through local virtual environment

Important: With this option some funcionalities requires to install some pakges directly in your system

pandoc (for local rtf files loading)

tesseract-ocr (for PDF ocr and table extraction)

poppler-utils (for PDF ocr and table extraction)

Clone repo.

git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git

2.1 Install requirements: It is recommended to use virtualenv or conda environment for installation.

cd ai-starter-kit
python3 -m venv data_extract_env
source data_extract_env/bin/activate
cd data_extraction
pip install -r requirements.txt

2.2 Install requirements for paddle utility: ,It is recommended to use virtualenv or conda environment for installation.

Use this in case you want to use Paddle OCR recipe for PDF OCR and table extraction you shold use the requirementsPaddle file instead

cd ai-starter-kit
python3 -m venv data_extract_env
source data_extract_env/bin/activate
cd data_extraction
pip install -r requirementsPaddle.txt

Some text extraction examples use Unstructured lib. Please register at Unstructured.io to get a free API Key. then create an enviroment file to store the APIkey and URL provided.

echo 'UNSTRUCTURED_API_KEY="your_API_key_here"\nUNSTRUCTURED_API_KEY="your_API_url_here"' > export.env

Option 2: Run via Docker

With this option all funcionalities and notebook are ready to use

You need to have the Docker engine installed Docker installation

Clone repo.

git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git

Some text extraction examples use Unstructured lib. Please register at Unstructured.io to get a free API Key. then create an enviroment file to store the APIkey and URL provided.

echo 'UNSTRUCTURED_API_KEY="your_API_key_here"\nUNSTRUCTURED_API_KEY="your_API_url_here"' > export.env

3.1 Run data extraction docker container

sudo docker-compose up data_extraction_service

3.2 Run data extraction docker container for Paddle utility 3.1 Run data extraction docker container

sudo docker-compose up data_extraction_service

3.2 Run data extraction docker container for Paddle utility

Use this in case you want to use Paddle OCR recipe for PDF OCR and table extraction you shold use the startPaddle script instead

sudo docker-compose up data_extraction_paddle_service

File loaders

You will find several data extraction recipes and pipelines in the notebooks folder as follows:

CSV Documents

csv_extraction.ipynb: This notebook provides examples of text extraction from CSV files using different packages. Depending on your specific use case, some packages may perform better than others.

XLS/XLSX Documents

xls_extraction.ipynb: This notebook provides examples of text extraction from files in different input format using Unstructured lib. Section 2 includes two loading examples first one using unstructured API and the other using local unstructured loader

DOC/DOCX Documents

docx_extraction.ipynb: This notebook provides examples of text extraction from files in different input format using Unstructured lib. Section 3 includes two loading examples first one using unstructured API and the other using local unstructured loader

RTF Documents

rtf_extraction.ipynb: This notebook provides examples of text extraction from files in different input format using Unstructured lib. Section 4 includes two loading examples first one using unstructured API and the other using local unstructured loader

Markdown Documents

markdown_extraction.ipynb: This notebook provides examples of text extraction from files in different input format using Unstructured lib. Section 5 includes two loading examples first one using unstructured API and the other using local unstructured loader

HTML Documents

web_extraction.ipynb: This notebook provides examples of text extraction from files in different input format using Unstructured lib. Section 6 includes two loading examples first one using unstructured API and the other using local unstructured loader

PDF Documents

pdf_extraction.ipynb: This notebook provides examples of text extraction from PDF documents using different packages including different OCR and non-OCR packages. Depending on your specific use case, some packages may perform better than others.
retrieval_from_pdf_tables.ipynb: This notebook provides an example of a simple RAG retiever and an example of a multivector RAG retriever for pdf with tables retrieval. For SambaNova model endpoint usage refer to the ai-starter-kit docs
qa_qc_util.ipynb: This notebook offers a simple utility for visualizing text boxes extracted using the Fitz package. This visualization can be particularly helpful when dealing with complex multi-column PDF documents, and in the debugging process.

Included files

data: Contains sample data for running the notebooks, and is used as storage for intermediate steps for recipes.
src: contains the source code for some functionalities used in the notebooks.
docker: contains Dockerfile for data extraction starter kit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SambaNova AI Starter Kits

Data Extraction Examples

Overview

Getting started

Deploy the starter kit

Option 1: Run through local virtual environment

Option 2: Run via Docker

File loaders

CSV Documents

XLS/XLSX Documents

DOC/DOCX Documents

RTF Documents

Markdown Documents

HTML Documents

PDF Documents

Included files

Files

README.md

Latest commit

History

README.md

File metadata and controls

SambaNova AI Starter Kits

Data Extraction Examples

Overview

Getting started

Deploy the starter kit

Option 1: Run through local virtual environment

Option 2: Run via Docker

File loaders

CSV Documents

XLS/XLSX Documents

DOC/DOCX Documents

RTF Documents

Markdown Documents

HTML Documents

PDF Documents

Included files