- Patch various python CVEs
- Bump to
unstructured
0.16.11 - No longer attempts to download NLTK asset from S3 which could result in a 403
- Update
strategy
parameter to allow'
and"
as input surrounding the value.
- Bump to
unstructured
0.15.10 - Add
include_slide_notes
parameter, indicating whether slide notes inppt
andpptx
files should be partitioned. Default isTrue
. Now, when slide notes are present in the file, they will be included alongside other elements, which may shift the index numbers of non-note elements.
- Bump to
unstructured
0.15.7
- Resolve NLTK CVE.
- Bump to
unstructured
0.15.6
- Bump to
unstructured
0.15.5
- Use the library's
detect_filetype
in API to determine mimetype - Add content_type api parameter
- Bump to
unstructured
0.15.1
- Remove constraint on
safetensors
that preventing us from bumpingtransformers
.
- Bump to
unstructured
0.15.0
- Bump to
unstructured
0.14.10
- Fix certain filetypes failing mimetype lookup in the new base image
- replace rockylinux with chainguard/wolfi as a base image for
amd64
- Bump to
unstructured
0.14.6 - Bump to
unstructured-inference
0.7.35
- Bump to
unstructured
0.14.4 - Add handling for
pdf_infer_table_structure
to reflect the "tables off by default" behavior inunstructured
.
- Fix list params such as
extract_image_block_types
not working via the python/js clients
- Allow for a different server port with the PORT variable
- Change pdf_infer_table_structure parameter from being disabled in auto strategy.
- Add support for
unique_element_ids
parameter. - Add max lifetime, via MAX_LIFETIME_SECONDS env-var, to API containers
- Bump unstructured to 0.13.5
- Change default values for
pdf_infer_table_structure
andskip_infer_table_types
. Markpdf_infer_table_structure
deprecated. - Add support for the
starting_page_number
param.
- Bump unstructured to 0.12.4
- Add support for both
list[str]
andstr
input formats forocr_languages
parameter - Adds support for additional MIME types from
unstructured
- Document the support for gzip files and add additional testing
- Bump Pydantic to 2.5.x and remove it from explicit dependencies list (will be managed by fastapi)
- Introduce Form params description in the code, which will form openapi and swagger documentation
- Roll back some openapi customizations
- Keep backward compatibility for passing parameters in form of
list[str]
(will not be shown in the documentation)
- Bump unstructured to 0.12.2
- Fix bug that ignored
combine_under_n_chars
chunking option argument.
- Add hi_res_model_name to partition and deprecate model_name
- Bump unstructured to 0.12.0
- Add support for returning extracted image blocks as base64 encoded data stored in metadata fields
- Bump unstructured to 0.11.6
- Handle invalid hi_res_model_name kwarg
- Enable self-hosted authorization using UNSTRUCTURED_API_KEY env variable
- Bump unstructured to 0.11.0
- Bump unstructured to 0.10.30
- Make sure
multipage_sections
param defaults totrue
as per the readme - Bump unstructured to 0.10.29
- Add
max_characters
param for chunking This param gives users additional control to "chunk" elements into larger or smallerCompositeElement
s - Bump unstructured to 0.10.28
- Make sure chipperv2 is called when
hi_res_model_name==chipper
- Bump unstructured to 0.10.26
- Bring parent_id metadata field back after fixing a backwards compatibility bug
- Restrict Chipper usage to one at a time. The model is very resource intense, and this will prevent issues while we improve it.
- Bump unstructured to 0.10.25
- Use a generator when splitting pdfs in parallel mode
- Add a default memory minimum for 503 check
- Fix an UnboundLocalError when an invalid docx file is caught
- Bump unstructured to 0.10.23
- Simplify the error message for BadZipFile errors
- Bump unstructured to 0.10.21
- Fix an unhandled error when a non pdf file is sent with content-type pdf
- Fix an unhandled error when a non docx file is sent with content-type docx
- Fix an unhandled error when a non-Unstructured json schema is sent
- Bump unstructured to 0.10.19
- Bump unstructured to 0.10.18
- Remove spurious whitespace in
app-start.sh
. This fixes deployments in some envs such as Google Cloud Run.
- Adds
languages
kwargocr_languages
will eventually be deprecated and replaced bylanguages
to specify what languages to use for OCR - Adds a startup log and other minor cleanups
- Adds
chunking_strategy
kwarg and associated params These params allow users to "chunk" elements into larger or smallerCompositeElement
s - Remove
parent_id
from the element metadata. New metadata fields are causing errors with existing installs. We'll readd this once a fix is widely available. - Fix some pdfs incorrectly returning a file is encrypted error. The
pypdf.is_encrypted
check caused us to return this error even if the file is readable.
- Bump unstructured to 0.10.16
- Drop
detection_class_prob
from the element metadata. This broke backwards compatibility when library users calledpartition_via_api
. - Bump unstructured to 0.10.15
- Bump unstructured to 0.10.14
- Improve parallel mode retry handling
- Improve logging during error handling. We don't need to log stack traces for expected errors.
- Bump unstructured to 0.10.13
- Bump unstructured-inference to 0.5.25
- Remove dependency on unstructured-api-tools
- Add a top level error handler for more consistent response bodies
- Tesseract minor version bump to 5.3.2
- Update readme for parameter
hi_res_model_name
- Fix a bug using
hi_res_model_name
in parallel mode - Bump unstructured library to 0.10.12
- Bump unstructured-inference to 0.5.22
- Bump unstructured library to 0.10.8
- Bump unstructured-inference to 0.5.17
- Reject traffic when overloaded via
UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB
- Docker image built with Python 3.10 rather than 3.8
- Fix incorrect handling on param skip_infer_table_types
- Pin
safetensors
to fix a build error with 0.0.38
- Fix page break has None page number bug
- Bump unstructured to 0.10.5
- Bump unstructured-ingest to 0.5.15
- Fix UnboundLocalError using pdfs in parallel mode
- Bump unstructured to 0.10.4
- Fix a bug in parallel mode causing
not a valid pdf
errors - Bump unstructured to 0.10.2, unstructured-inference to 0.5.13
- Bump unstructured library to 0.9.2
- Fix a misleading error in make docker-test
- Bump unstructured library to 0.9.0
- Add table support for image with parameter
skip_infer_table_types
- Add support for gzipped files
- Image tweak, move application entrypoint to scripts/app-start.sh
- Throw 400 error if a PDF is password protected
- Improve logging of params to single line json
- Add support for
include_page_breaks
parameter
- Support model name as api parameter
- Add retry parameters on fanout requests
- Bump unstructured library to 0.8.1
- Fix how to remove an element's coordinate information
- Add table extraction support for hi_res strategy
- Add support for
encoding
parameter - Add support for
xml_keep_tags
parameter - Add env variables for additional parallel mode tweaking
- Support .msg files
- Refactor parallel mode and add smoke test
- Fix header value for api key
- Bump unstructured library to 0.7.8 for bug fixes
- Update documentation and tests for filetypes to sync with partition.auto
- Add support for .rst, .tsv, .xml
- Move PYPDF2 to pypdf since PYPDF2 is deprecated
- Add support for
ocr_only
strategy andocr_languages
parameter - Remove building
detectron2
from source in Dockerfile - Convert strategy from fast to auto for images since there is no fast strategy for images
- Bump image to use python 3.8.17 instead of 3.8.15
- Add returning text/csv to pipeline_api
- Add support for csv files
- Add parallel processing mode for pages within a pdf
- Bump version of base image to use new stable version of tesseract
- Bump to unstructured==0.7.1 for various bug fixes.
- Supports additional filetypes: epub, odt, rft
- Updating data type of optional os env var
ALLOWED_ORIGINS
- Add optional CORS to api if os env var
ALLOWED_ORIGINS
is set
- Add config for unstructured.trace logger
- Fix image build steps to support detectron2 install from Mac M1/M2
- Upgrade to openssl 1.1.1 to accomodate the latest urllib3
- Bump unstructured for SpooledTemporaryFile fix
- Add msg and json types to supported
- Bump unstructured to the latest version
- Posting a bad .pdf results in a 400
- Remove coordinates field from response elements by default
- Add caching from the registry for
make docker-build
- Add fix for empty content type error
- Bump unstructured-api-tools for better 'file type not supported' response messages
- Updated detectron version
- Update docker-build to use the public registry as a cache
- Adds a strategy parameter to pipeline_api
- Passing file, file_filename, and content_type to
partition
- Sensible logging config
- Minor version bump
- Minor version bump
- Updated Dockerfile for public release
- Remove rate limiting in the API
- Add file type validation via UNSTRUCTURED_ALLOWED_MIMETYPES
- Major semver route also supported: /general/v0/general
- Changed pipeline name to
pipeline-general
- Changed pipeline to handle a variety of documents not just emails
- Update Dockerfile, all supported library files.
- Add sample-docs for pdf and pdf image.
- Add emails pipeline Dockerfile
- Add pipeline notebook
- Initial pipeline setup