Skip to content

Commit 7b25dfc

Browse files
MthwRobinsonscanny
andauthored
fix(CVE-2024-39705): remove nltk download (#3361)
### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <[email protected]>
1 parent d48fa3b commit 7b25dfc

File tree

12 files changed

+179
-27
lines changed

12 files changed

+179
-27
lines changed

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,8 @@ jobs:
256256
matrix:
257257
python-version: [ "3.9","3.10" ]
258258
runs-on: ubuntu-latest
259+
env:
260+
NLTK_DATA: ${{ github.workspace }}/nltk_data
259261
needs: [ setup_ingest, lint ]
260262
steps:
261263
# actions/checkout MUST come before auth

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.14.10-dev13
1+
## 0.14.10
22

33
### Enhancements
44

@@ -14,6 +14,7 @@
1414

1515
* **Fix counting false negatives and false positives in table structure evaluation**
1616
* **Fix Slack CI test** Change channel that Slack test is pointing to because previous test bot expired
17+
* **Remove NLTK download** Removes `nltk.download` in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705
1718

1819
## 0.14.9
1920

Dockerfile

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM quay.io/unstructured-io/base-images:wolfi-base-d46498e@sha256:3db0544df1d8d9989cd3c3b28670d8b81351dfdc1d9129004c71ff05996fd51e as base
1+
FROM quay.io/unstructured-io/base-images:wolfi-base-e48da6b@sha256:8ad3479e5dc87a86e4794350cca6385c01c6d110902c5b292d1a62e231be711b as base
22

33
USER root
44

@@ -18,8 +18,7 @@ USER notebook-user
1818

1919
RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';' && \
2020
pip3.11 install unstructured.paddlepaddle && \
21-
python3.11 -c "import nltk; nltk.download('punkt')" && \
22-
python3.11 -c "import nltk; nltk.download('averaged_perceptron_tagger')" && \
21+
python3.11 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()" && \
2322
python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
2423
python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"
2524

test_unstructured/nlp/test_tokenize.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,28 @@
22
from unittest.mock import patch
33

44
import nltk
5+
import pytest
56

67
from test_unstructured.nlp.mock_nltk import mock_sent_tokenize, mock_word_tokenize
78
from unstructured.nlp import tokenize
89

910

11+
def test_error_raised_on_nltk_download():
12+
with pytest.raises(ValueError):
13+
tokenize.nltk.download("tokenizers/punkt")
14+
15+
1016
def test_nltk_packages_download_if_not_present():
1117
with patch.object(nltk, "find", side_effect=LookupError):
12-
with patch.object(nltk, "download") as mock_download:
13-
tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
18+
with patch.object(tokenize, "download_nltk_packages") as mock_download:
19+
tokenize._download_nltk_packages_if_not_present()
1420

15-
mock_download.assert_called_with("fake_package")
21+
mock_download.assert_called_once()
1622

1723

1824
def test_nltk_packages_do_not_download_if():
1925
with patch.object(nltk, "find"), patch.object(nltk, "download") as mock_download:
20-
tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
26+
tokenize._download_nltk_packages_if_not_present()
2127

2228
mock_download.assert_not_called()
2329

typings/nltk/__init__.pyi

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
from __future__ import annotations
2+
3+
from nltk import data, internals
4+
from nltk.data import find
5+
from nltk.downloader import download
6+
from nltk.tag import pos_tag
7+
from nltk.tokenize import sent_tokenize, word_tokenize
8+
9+
__all__ = [
10+
"data",
11+
"download",
12+
"find",
13+
"internals",
14+
"pos_tag",
15+
"sent_tokenize",
16+
"word_tokenize",
17+
]

typings/nltk/data.pyi

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from __future__ import annotations
2+
3+
from typing import Sequence
4+
5+
path: list[str]
6+
7+
def find(resource_name: str, paths: Sequence[str] | None = None) -> str: ...

typings/nltk/downloader.pyi

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from __future__ import annotations
2+
3+
from typing import Callable
4+
5+
download: Callable[..., bool]

typings/nltk/internals.pyi

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from __future__ import annotations
2+
3+
def is_writable(path: str) -> bool: ...

typings/nltk/tag.pyi

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from __future__ import annotations
2+
3+
def pos_tag(
4+
tokens: list[str], tagset: str | None = None, lang: str = "eng"
5+
) -> list[tuple[str, str]]: ...

typings/nltk/tokenize.pyi

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
from __future__ import annotations
2+
3+
def sent_tokenize(text: str, language: str = ...) -> list[str]: ...
4+
def word_tokenize(text: str, language: str = ..., preserve_line: bool = ...) -> list[str]: ...

0 commit comments

Comments
 (0)