Skip to content

PragmaticMachineLearning/docai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocAI

Extract structured data from unstructured documents using Answer.AI's Byaldi, OpenAI gpt-4o, and Langchain's structured output.

Installation

pyenv virtualenv 3.10.6 docai
pyenv activate docai
poetry install

Environment vars

Ensure you have an OPENAI_API_KEY and HF_TOKEN set in your environment variables.

export OPENAI_API_KEY=<your key>
export HF_TOKEN=<your token>

Sample usage

Build the index from the pdfs/ folder:

python scripts/build_index.py --folder "pdfs/" --index_name "application"

Extraction structured information from the index (open extract.py to see queries and pydantic models):

python scripts/extract.py

Sample output

What losses have occurred in the past 5 years?
LossHistory(
    losses=[
        Loss(loss_date='2/20/21', loss_amount=7003.0, loss_description='Claimant was in his sleeper when his truck got hit by insured driver on the left', date_of_claim='4/19/21'),
        Loss(loss_date='2/4/21', loss_amount=92584.0, loss_description='The IV was attempting to merge on the highway when the IV lost control and struck', date_of_claim='4/30/21'),
        Loss(loss_date='9/14/21', loss_amount=5583.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21'),
        Loss(loss_date='9/14/21', loss_amount=6299.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21')
    ]
)

What is the basic application information?
Application(
  insured_name='Greentown Burgers LLC', 
  insured_address='Not provided', 
  insured_phone='Not provided',
  insured_email='Not provided', 
  effective_date='07/22/2024'
)

About

Structured information extraction from documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages