Releases · Unstructured-IO/unstructured · GitHub

17 Oct 19:32

yuming-long

0.16.0

0.16.0

Enhancements

Remove ingest implementation. The deprecated ingest functionality has been removed, as it is now maintained in the separate unstructured-ingest repository.
- Replace extras in requirements/ingest directory with a new ingest.txt extra for installing the unstructured-ingest library.
- Remove the unstructured.ingest submodule.
- Delete all shell scripts previously used for destination ingest tests.

Features

Fixes

Add language parameter to OCRAgentGoogleVision. Introduces an optional language parameter in the OCRAgentGoogleVision constructor to serve as a language hint for document_text_detection. This ensures compatibility with the OCRAgent's get_instance method and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent.

Assets 2

10 Oct 20:55

christinestraub

0.15.14

0.15.14

Enhancements

Features

Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like .filename, .filetype and .languages. This will be installed in a closely following PR to replace the four currently being used for this purpose.

Fixes

Update Python SDK usage in partition_via_api. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK.
Remove "unused" date_from_file_object parameter. As part of simplifying partitioning parameter set, remove date_from_file_object parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from the metadata_last_modified argument.
Fix occasional KeyError when mapping parent ids to hash ids. Occasionally the input elements into assign_and_map_hash_ids can contain duplicated element instances, which lead to error when mapping parent id.
Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners.
Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners.
Remove obsolete min_partition/max_partition args from TXT and EML. The legacy min_partition and max_partition parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from partition_text() and partition_email().
Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new @apply_metadata() decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
Quick-fix CI error in auto test-filetype. Better fix to follow shortly.

Assets 2

20 Sep 14:25

MthwRobinson

0.15.13

0.15.13

Enhancements

Improve pdfminer image cleanup process. Optimized the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances overall processing speed of PDF documents.

Features

Fixes

Fixes high memory overhead for intersection area computation Using numpy.float32 for coordinates and remove intermediate variables to reduce memory usage when computing intersection areas
Fixes the arm64 image build arm64 builds are now fixed and will be available against starting with the 0.15.13 release.

Assets 2

13 Sep 14:39

MthwRobinson

0.15.12

0.15.12

Enhancements

Improve pdfminer element processing Implemented splitting of pdfminer elements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicated pdfminer text.

Assets 2

10 Sep 12:55

MthwRobinson

0.15.10

0.15.10

Enhancements

Enhance pdfminer element cleanup Expand removal of pdfminer elements to include those inside all non-pdfminer elements, not just tables.
Modified analysis drawing tools to dump to files and draw from dumps If the parameter analysis of the partition_pdf function is set to True, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances.
Vectorize pdfminer elements deduplication computation. Use numpy operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.

Features

Fixes

Assets 2

30 Aug 19:13

MthwRobinson

0.15.9

0.15.9

Enhancements

Features

Add support for encoding parameter in partition_csv

Assets 2

27 Aug 15:55

MthwRobinson

0.15.8

0.15.8

Enhancements

Bump unstructured.paddleocr to 2.8.1.0.

Features

Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

Fixes

Replace pillow-heif with pi-heif. Replaces pillow-heif with pi-heif due to more permissive licensing on the wheel for pi-heif.
Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.

Assets 2

20 Aug 19:53

christinestraub

0.15.7

0.15.7

Enhancements

Features

Fixes

Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

Assets 2

20 Aug 12:47

MthwRobinson

0.15.6

0.15.6

Enhancements

Features

Fixes

Bump to NLTK 3.9.x Bumps to the latest nltk version to resolve CVE.
Update CI for ingest-test-fixture-update-pr to resolve NLTK model download errors.
Synchronized text and html on TableChunk splits. When a Table element is divided during chunking to fit the chunking window, TableChunk.text corresponds exactly with the table text in TableChunk.metadata.text_as_html, .text_as_html is always parseable HTML, and the table is split on even row boundaries whenever possible.

Assets 2

16 Aug 14:35

MthwRobinson

0.15.5

0.15.5

Enhancements

Features

Fixes

Revert to using unstructured.pytesseract fork. Due to the unavailability of some recent release versions of pytesseract on PyPI, the project now uses the unstructured.pytesseract fork to ensure stability and continued support.
Bump libreoffice verson in image. Bumps the libreoffice version to 25.2.5.2 to address CVEs.
Downgrade NLTK dependency version for compatibility. Due to the unavailability of nltk==3.8.2 on PyPI, the NLTK dependency has been downgraded to <3.8.2. This change ensures continued functionality and compatibility.

Assets 2