Skip to content

Commit 1f8030d

Browse files
authored
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download issues (#3541)
### Summary Bumps to `nltk==3.9.1` and resolves [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An NLTK version bump was originally introduced in #3512 and rolled back in #3527 because `nltk==3.8.2` was yanked from PyPI, and also because we observed significant slowdowns in processing time after bumping to `nltk==3.8.2`. The processing time regression does not appear in `nltk==3.9.1`. ### Testing After the bump, CI should pass. Additionally we verified locally that files processing takes around the amount of time we would expect for a long `.docx` file. ```python In [1]: from unstructured.partition.auto import partition In [2]: filename = "test-doc.docx" In [3]: %timeit partition(filename=filename) 3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```
1 parent a861ed8 commit 1f8030d

35 files changed

+112
-101
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
## 0.15.6-dev1
1+
## 0.15.6
22

33
### Enhancements
44

55
### Features
66

77
### Fixes
88

9+
* **Bump to NLTK 3.9.x** Bumps to the latest `nltk` version to resolve CVE.
910
* **Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
1011
* **Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.
1112

requirements/base.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ mypy-extensions==1.0.0
6969
# unstructured-client
7070
nest-asyncio==1.6.0
7171
# via unstructured-client
72-
nltk==3.8.1
72+
nltk==3.9.1
7373
# via -r ./base.in
7474
numpy==1.26.4
7575
# via -r ./base.in
@@ -110,7 +110,7 @@ sniffio==1.3.1
110110
# via
111111
# anyio
112112
# httpx
113-
soupsieve==2.5
113+
soupsieve==2.6
114114
# via beautifulsoup4
115115
tabulate==0.9.0
116116
# via -r ./base.in
@@ -129,7 +129,7 @@ typing-inspect==0.9.0
129129
# via
130130
# dataclasses-json
131131
# unstructured-client
132-
unstructured-client==0.25.4
132+
unstructured-client==0.25.5
133133
# via
134134
# -c ././deps/constraints.txt
135135
# -r ./base.in

requirements/deps/constraints.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,6 @@ fsspec==2024.5.0
5656
wrapt>=1.14.0
5757

5858
langchain-community>=0.2.5
59+
60+
grpcio==1.64.3
61+
label-studio-sdk==0.0.34

requirements/dev.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,7 @@ pyyaml==6.0.2
310310
# -c ./test.txt
311311
# jupyter-events
312312
# pre-commit
313-
pyzmq==26.1.0
313+
pyzmq==26.1.1
314314
# via
315315
# ipykernel
316316
# jupyter-client
@@ -360,7 +360,7 @@ sniffio==1.3.1
360360
# -c ./base.txt
361361
# anyio
362362
# httpx
363-
soupsieve==2.5
363+
soupsieve==2.6
364364
# via
365365
# -c ./base.txt
366366
# beautifulsoup4

requirements/extra-markdown.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
importlib-metadata==8.2.0
88
# via markdown
9-
markdown==3.6
9+
markdown==3.7
1010
# via -r ./extra-markdown.in
1111
zipp==3.20.0
1212
# via importlib-metadata

requirements/extra-paddleocr.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ astor==0.8.1
1313
# via paddlepaddle
1414
attrdict==2.0.1
1515
# via unstructured-paddleocr
16-
cachetools==5.4.0
16+
cachetools==5.5.0
1717
# via premailer
1818
certifi==2024.7.4
1919
# via
@@ -64,13 +64,13 @@ idna==3.7
6464
# anyio
6565
# httpx
6666
# requests
67-
imageio==2.34.2
67+
imageio==2.35.1
6868
# via
6969
# imgaug
7070
# scikit-image
7171
imgaug==0.4.0
7272
# via unstructured-paddleocr
73-
importlib-resources==6.4.0
73+
importlib-resources==6.4.3
7474
# via matplotlib
7575
kiwisolver==1.4.5
7676
# via matplotlib
@@ -83,7 +83,7 @@ lxml==5.3.0
8383
# -c ./base.txt
8484
# premailer
8585
# unstructured-paddleocr
86-
matplotlib==3.9.1.post1
86+
matplotlib==3.9.2
8787
# via imgaug
8888
more-itertools==10.4.0
8989
# via cssutils

requirements/extra-pdf-image.txt

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
antlr4-python3-runtime==4.9.3
88
# via omegaconf
9-
cachetools==5.4.0
9+
cachetools==5.5.0
1010
# via google-auth
1111
certifi==2024.7.4
1212
# via
@@ -48,7 +48,7 @@ fsspec==2024.5.0
4848
# torch
4949
google-api-core[grpc]==2.19.1
5050
# via google-cloud-vision
51-
google-auth==2.33.0
51+
google-auth==2.34.0
5252
# via
5353
# google-api-core
5454
# google-cloud-vision
@@ -58,13 +58,14 @@ googleapis-common-protos==1.63.2
5858
# via
5959
# google-api-core
6060
# grpcio-status
61-
grpcio==1.65.4
61+
grpcio==1.64.3
6262
# via
63+
# -c ././deps/constraints.txt
6364
# google-api-core
6465
# grpcio-status
6566
grpcio-status==1.62.3
6667
# via google-api-core
67-
huggingface-hub==0.24.5
68+
huggingface-hub==0.24.6
6869
# via
6970
# timm
7071
# tokenizers
@@ -76,7 +77,7 @@ idna==3.7
7677
# via
7778
# -c ./base.txt
7879
# requests
79-
importlib-resources==6.4.0
80+
importlib-resources==6.4.3
8081
# via matplotlib
8182
iopath==0.1.10
8283
# via layoutparser
@@ -92,7 +93,7 @@ lxml==5.3.0
9293
# pikepdf
9394
markupsafe==2.1.5
9495
# via jinja2
95-
matplotlib==3.9.1.post1
96+
matplotlib==3.9.2
9697
# via
9798
# pycocotools
9899
# unstructured-inference
@@ -120,7 +121,7 @@ onnx==1.16.2
120121
# via
121122
# -r ./extra-pdf-image.in
122123
# unstructured-inference
123-
onnxruntime==1.18.1
124+
onnxruntime==1.19.0
124125
# via unstructured-inference
125126
opencv-python==4.8.0.76
126127
# via
@@ -147,7 +148,7 @@ pdfminer-six==20231228
147148
# via
148149
# -r ./extra-pdf-image.in
149150
# pdfplumber
150-
pdfplumber==0.11.3
151+
pdfplumber==0.11.4
151152
# via layoutparser
152153
pikepdf==9.1.1
153154
# via -r ./extra-pdf-image.in

requirements/huggingface.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ fsspec==2024.5.0
2727
# -c ././deps/constraints.txt
2828
# huggingface-hub
2929
# torch
30-
huggingface-hub==0.24.5
30+
huggingface-hub==0.24.6
3131
# via
3232
# tokenizers
3333
# transformers

requirements/ingest/azure.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
#
77
adlfs==2024.7.0
88
# via -r ./ingest/azure.in
9-
aiohappyeyeballs==2.3.5
9+
aiohappyeyeballs==2.3.7
1010
# via aiohttp
11-
aiohttp==3.10.3
11+
aiohttp==3.10.4
1212
# via adlfs
1313
aiosignal==1.3.1
1414
# via aiohttp

requirements/ingest/biomed.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ beautifulsoup4==4.12.3
1010
# bs4
1111
bs4==0.0.2
1212
# via -r ./ingest/biomed.in
13-
soupsieve==2.5
13+
soupsieve==2.6
1414
# via
1515
# -c ./ingest/../base.txt
1616
# beautifulsoup4

0 commit comments

Comments
 (0)