Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.16.0
0.16.0
Enhancements
- Remove ingest implementation. The deprecated ingest functionality has been removed, as it is now maintained in the separate unstructured-ingest repository.
- Replace extras in 
requirements/ingestdirectory with a newingest.txtextra for installing theunstructured-ingestlibrary. - Remove the 
unstructured.ingestsubmodule. - Delete all shell scripts previously used for destination ingest tests.
 
 - Replace extras in 
 
Features
Fixes
- Add language parameter to 
OCRAgentGoogleVision. Introduces an optional language parameter in theOCRAgentGoogleVisionconstructor to serve as a language hint fordocument_text_detection. This ensures compatibility with the OCRAgent'sget_instancemethod and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent. 
0.15.14
0.15.14
Enhancements
Features
- Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like 
.filename,.filetypeand.languages. This will be installed in a closely following PR to replace the four currently being used for this purpose. 
Fixes
- Update Python SDK usage in 
partition_via_api. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK. - Remove "unused" 
date_from_file_objectparameter. As part of simplifying partitioning parameter set, removedate_from_file_objectparameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from themetadata_last_modifiedargument. - Fix occasional 
KeyErrorwhen mapping parent ids to hash ids. Occasionally the input elements intoassign_and_map_hash_idscan contain duplicated element instances, which lead to error when mapping parent id. - Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
 - Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new 
@apply_metadata()decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners. - Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new 
@apply_metadata()decorator and only decorate the principal partitioner; remove decoration from delegating partitioners. - Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new 
@apply_metadata()decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners. - Remove obsolete min_partition/max_partition args from TXT and EML. The legacy 
min_partitionandmax_partitionparameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters frompartition_text()andpartition_email(). - Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new 
@apply_metadata()decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG. - Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
 - Quick-fix CI error in auto test-filetype. Better fix to follow shortly.
 
0.15.13
0.15.13
Enhancements
- Improve 
pdfminerimage cleanup process. Optimized the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances overall processing speed of PDF documents. 
Features
Fixes
- Fixes high memory overhead for intersection area computation Using 
numpy.float32for coordinates and remove intermediate variables to reduce memory usage when computing intersection areas - Fixes the 
arm64image buildarm64builds are now fixed and will be available against starting with the0.15.13release. 
0.15.12
0.15.12
Enhancements
- Improve 
pdfminerelement processing Implemented splitting ofpdfminerelements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicatedpdfminertext. 
0.15.10
0.15.10
Enhancements
- Enhance 
pdfminerelement cleanup Expand removal ofpdfminerelements to include those inside allnon-pdfminerelements, not justtables. - Modified analysis drawing tools to dump to files and draw from dumps If the parameter 
analysisof thepartition_pdffunction is set toTrue, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances. - Vectorize pdfminer elements deduplication computation. Use 
numpyoperations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements. 
Features
Fixes
0.15.9
0.15.9
Enhancements
Features
- Add support for encoding parameter in partition_csv
 
0.15.8
0.15.8
Enhancements
- Bump unstructured.paddleocr to 2.8.1.0.
 
Features
- Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.
 
Fixes
- Replace 
pillow-heifwithpi-heif. Replacespillow-heifwithpi-heifdue to more permissive licensing on the wheel forpi-heif. - Minify text_as_html from DOCX. Previously 
.metadata.text_as_htmlfor DOCX tables was "bloated" with whitespace and noise elements introduced bytabulatethat produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text. - Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by 
filetypewas incorrectly identified as a MSG file. 
0.15.7
0.15.7
Enhancements
Features
Fixes
- Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.
 
0.15.6
0.15.6
Enhancements
Features
Fixes
- Bump to NLTK 3.9.x Bumps to the latest 
nltkversion to resolve CVE. - Update CI for 
ingest-test-fixture-update-prto resolve NLTK model download errors. - Synchronized text and html on 
TableChunksplits. When aTableelement is divided during chunking to fit the chunking window,TableChunk.textcorresponds exactly with the table text inTableChunk.metadata.text_as_html,.text_as_htmlis always parseable HTML, and the table is split on even row boundaries whenever possible. 
0.15.5
0.15.5
Enhancements
Features
Fixes
- Revert to using 
unstructured.pytesseractfork. Due to the unavailability of some recent release versions ofpytesseracton PyPI, the project now uses theunstructured.pytesseractfork to ensure stability and continued support. - Bump 
libreofficeverson in image. Bumps thelibreofficeversion to25.2.5.2to address CVEs. - Downgrade NLTK dependency version for compatibility. Due to the unavailability of 
nltk==3.8.2on PyPI, the NLTK dependency has been downgraded to<3.8.2. This change ensures continued functionality and compatibility.