- Add hi_res_model_name to partition and deprecate model_name
- Bump unstructured to 0.12.0
- Add support for returning extracted image blocks as base64 encoded data stored in metadata fields
- Bump unstructured to 0.11.6
- Handle invalid hi_res_model_name kwarg
- Enable self-hosted authorization using UNSTRUCTURED_API_KEY env variable
- Bump unstructured to 0.11.0
- Bump unstructured to 0.10.30
- Make sure
multipage_sections
param defaults totrue
as per the readme - Bump unstructured to 0.10.29
- Add
max_characters
param for chunking This param gives users additional control to "chunk" elements into larger or smallerCompositeElement
s - Bump unstructured to 0.10.28
- Make sure chipperv2 is called whien
hi_res_model_name==chipper
- Bump unstructured to 0.10.26
- Bring parent_id metadata field back after fixing a backwards compatibility bug
- Restrict Chipper usage to one at a time. The model is very resource intense, and this will prevent issues while we improve it.
- Bump unstructured to 0.10.25
- Use a generator when splitting pdfs in parallel mode
- Add a default memory minimum for 503 check
- Fix an UnboundLocalError when an invalid docx file is caught
- Bump unstructured to 0.10.23
- Simplify the error message for BadZipFile errors
- Bump unstructured to 0.10.21
- Fix an unhandled error when a non pdf file is sent with content-type pdf
- Fix an unhandled error when a non docx file is sent with content-type docx
- Fix an unhandled error when a non-Unstructured json schema is sent
- Bump unstructured to 0.10.19
- Bump unstructured to 0.10.18
- Remove spurious whitespace in
app-start.sh
. This fixes deployments in some envs such as Google Cloud Run.
- Adds
languages
kwargocr_languages
will eventually be deprecated and replaced bylanugages
to specify what languages to use for OCR - Adds a startup log and other minor cleanups
- Adds
chunking_strategy
kwarg and associated params These params allow users to "chunk" elements into larger or smallerCompositeElement
s - Remove
parent_id
from the element metadata. New metadata fields are causing errors with existing installs. We'll readd this once a fix is widely available. - Fix some pdfs incorrectly returning a file is encrypted error. The
pypdf.is_encrypted
check caused us to return this error even if the file is readable.
- Bump unstructured to 0.10.16
- Drop
detection_class_prob
from the element metadata. This broke backwards compatibility when library users calledpartition_via_api
. - Bump unstructured to 0.10.15
- Bump unstructured to 0.10.14
- Improve parallel mode retry handling
- Improve logging during error handling. We don't need to log stack traces for expected errors.
- Bump unstructured to 0.10.13
- Bump unstructured-inference to 0.5.25
- Remove dependency on unstructured-api-tools
- Add a top level error handler for more consistent response bodies
- Tesseract minor version bump to 5.3.2
- Update readme for parameter
hi_res_model_name
- Fix a bug using
hi_res_model_name
in parallel mode - Bump unstructured library to 0.10.12
- Bump unstructured-inference to 0.5.22
- Bump unstructured library to 0.10.8
- Bump unstructured-inference to 0.5.17
- Reject traffic when overloaded via
UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB
- Docker image built with Python 3.10 rather than 3.8
- Fix wrong handleing on param skip_infer_table_types
- Pin
safetensors
to fix a build error with 0.0.38
- Fix page break has None page number bug
- Bump unstructured to 0.10.5
- Bump unstructured-ingest to 0.5.15
- Fix UnboundLocalError using pdfs in parallel mode
- Bump unstructured to 0.10.4
- Fix a bug in parallel mode causing
not a valid pdf
errors - Bump unstructured to 0.10.2, unstructured-inference to 0.5.13
- Bump unstructured library to 0.9.2
- Fix a misleading error in make docker-test
- Bump unstructured library to 0.9.0
- Add table support for image with parameter
skip_infer_table_types
- Add support for gzipped files
- Image tweak, move application entrypoint to scripts/app-start.sh
- Throw 400 error if a PDF is password protected
- Improve logging of params to single line json
- Add support for
include_page_breaks
parameter
- Support model name as api parameter
- Add retry parameters on fanout requests
- Bump unstructured library to 0.8.1
- Fix how to remove an element's coordinate information
- Add table extraction support for hi_res strategy
- Add support for
encoding
parameter - Add support for
xml_keep_tags
parameter - Add env variables for additional parallel mode tweaking
- Support .msg files
- Refactor parallel mode and add smoke test
- Fix header value for api key
- Bump unstructured library to 0.7.8 for bug fixes
- Update documentation and tests for filetypes to sync with partition.auto
- Add support for .rst, .tsv, .xml
- Move PYPDF2 to pypdf since PYPDF2 is deprecated
- Add support for
ocr_only
strategy andocr_languages
parameter - Remove building
detectron2
from source in Dockerfile - Convert strategy from fast to auto for images since there is no fast strategy for images
- Bump image to use python 3.8.17 instead of 3.8.15
- Add returning text/csv to pipeline_api
- Add support for csv files
- Add parallel processing mode for pages within a pdf
- Bump version of base image to use new stable version of tesseract
- Bump to unstructured==0.7.1 for various bug fixes.
- Supports additional filetypes: epub, odt, rft
- Updating data type of optional os env var
ALLOWED_ORIGINS
- Add optional CORS to api if os env var
ALLOWED_ORIGINS
is set
- Add config for unstructured.trace logger
- Fix image build steps to support detectron2 install from Mac M1/M2
- Upgrade to openssl 1.1.1 to accomodate the latest urllib3
- Bump unstructured for SpooledTemporaryFile fix
- Add msg and json types to supported
- Bump unstructured to the latest version
- Posting a bad .pdf results in a 400
- Remove coordinates field from response elements by default
- Add caching from the registry for
make docker-build
- Add fix for empty content type error
- Bump unstructured-api-tools for better 'file type not supported' response messages
- Updated detectron version
- Update docker-build to use the public registry as a cache
- Adds a strategy parameter to pipeline_api
- Passing file, file_filename, and content_type to
partition
- Sensible logging config
- Minor version bump
- Minor version bump
- Updated Dockerfile for public release
- Remove rate limiting in the API
- Add file type validation via UNSTRUCTURED_ALLOWED_MIMETYPES
- Major semver route also supported: /general/v0/general
- Changed pipeline name to
pipeline-general
- Changed pipeline to handle a variety of documents not just emails
- Update Dockerfile, all supported library files.
- Add sample-docs for pdf and pdf image.
- Add emails pipeline Dockerfile
- Add pipeline notebook
- Initial pipeline setup