Skip to content

Commit d83df42

Browse files
quedahmetmeleqryannikolaidisMaksOpp
authored
chore: switch to charset normalizer (#4060)
Closes [SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible). Removes `chardet` as a dependency, standardizing on `charset-normalizer`. This involved: - Changing `chardet` to `charset-normalizer` in our base dependency file - Updating the code (in only one place) where `chardet` was used - pip-compiling to update our published dependency tree - Updating one test... `charset-normalizer` misdiagnosed the encoding of a file used as a test fixture. My guess is that the ~10 characters in the file were not enough for `charset-normalizer` to do a proper inference, so I re-encoded another slightly longer file that's also used for encoding testing, and it got that one. - Updating an ingest test fixture. - Updating the ingest test fixture update workflow to also update the expected markdown results (this was a task I missed when adding the markdown ingest tests) --------- Co-authored-by: Ahmet Melek <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: qued <[email protected]> Co-authored-by: Maksymilian Operlejn <[email protected]>
1 parent 5368197 commit d83df42

File tree

14 files changed

+65
-67
lines changed

14 files changed

+65
-67
lines changed

.github/workflows/ingest-test-fixtures-update-pr.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,10 @@ jobs:
115115
run: |
116116
source .venv/bin/activate
117117
make html-fixtures-update
118+
- name: Update markdown fixtures
119+
run: |
120+
source .venv/bin/activate
121+
make markdown-fixtures-update
118122
119123
- name: Save branch name to environment file
120124
id: branch

CHANGELOG.md

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
## 0.18.11-dev1
22

33
### Enhancements
4+
- **Standardized on `charset-normalizer` library for encoding detection** Previously we had both `chardet` and `charset-normalizer` as dependencies. We are dropping `chardet` and only using `charset-normalizer`.
45

56
### Features
67
- **Type-aware `<input>` mapping in HTML transformations** Bare `<input>` elements are now classified by their `type` attribute (checkbox → Checkbox, radio → RadioButton, others → FormFieldValue).
@@ -11,19 +12,11 @@
1112

1213
## 0.18.10
1314

14-
### Enhancements
15-
16-
### Features
17-
- **Add OCR_AGENT_CACHE_SIZE environment variable** Added configurable cache size for OCR agents to control memory usage.
18-
19-
### Fixes
20-
21-
## 0.18.10-dev0
22-
2315
### Enhancements
2416
- **Updated CodeQL** Updated CodeQL GHA to v3 from deprecated v2.
2517

2618
### Features
19+
- **Add OCR_AGENT_CACHE_SIZE environment variable** Added configurable cache size for OCR agents to control memory usage.
2720

2821
### Fixes
2922

example-docs/fake-html-cp1252.html

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44

55
<h1>My First Heading</h1>
66
<p>My first paragraph.</p>
7-
<p>Some CP1252-specific characters:</p>
7+
<p>Some text with CP1252-specific characters:</p>
88

99
<pre>
10-
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
11-
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
12-
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
13-
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
14-
à á â ã ä å æ ç è é ê ë ì í î ï
15-
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
10+
Die schöne Frau hat einen Kaffee mit Kuchen gegessen. Sie sagte: "Das war köstlich!" und lächelte dabei. Der Preis betrug 15,50 €.
11+
L'été était très chaud cette année. J'ai acheté un café au lait pour 3,50 €. C'était délicieux ! L'homme a dit : "C'est parfait !"
12+
El niño comió paella con ñoquis. La señora dijo: "¡Qué rico!" y pagó 25,75 €. El restaurante tenía un menú del día.
13+
Kvinnan åt köttbullar med lingonsylt. Hon sa: "Det var fantastiskt!" och betalade 45,90 €. Mannen frågade: "Vill du ha mer?"
14+
O João comprou um café por 2,50 €. Ele disse: "Está ótimo!" e sorriu. A mulher perguntou: "Quer mais alguma coisa?"
15+
De vrouw dronk koffie met koekjes. Ze zei: "Het was heerlijk!" en betaalde 4,25 €. Het kind vroeg: "Mag ik ook wat?"
1616
</pre>
1717

1818
</body>

example-docs/umlauts-non-utf8.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,5 @@
1-
Umlauts: ����
1+
## k�nnen
2+
3+
k�nnen
4+
5+
����

requirements/base.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
-c ./deps/constraints.txt
2-
chardet
2+
charset-normalizer
33
filetype
44
python-magic
55
lxml

requirements/base.txt

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,17 @@ backoff==2.2.1
1010
# via -r ./base.in
1111
beautifulsoup4==4.13.4
1212
# via -r ./base.in
13-
certifi==2025.7.9
13+
certifi==2025.7.14
1414
# via
1515
# httpcore
1616
# httpx
1717
# requests
1818
# unstructured-client
1919
cffi==1.17.1
2020
# via cryptography
21-
chardet==5.2.0
22-
# via -r ./base.in
2321
charset-normalizer==3.4.2
2422
# via
23+
# -r ./base.in
2524
# requests
2625
# unstructured-client
2726
click==8.2.1
@@ -80,7 +79,7 @@ numpy==2.2.6
8079
# via -r ./base.in
8180
olefile==0.47
8281
# via python-oxmsg
83-
orderly-set==5.4.1
82+
orderly-set==5.5.0
8483
# via deepdiff
8584
packaging==25.0
8685
# via
@@ -90,7 +89,7 @@ psutil==7.0.0
9089
# via -r ./base.in
9190
pycparser==2.22
9291
# via cffi
93-
pypdf==5.7.0
92+
pypdf==5.8.0
9493
# via unstructured-client
9594
python-dateutil==2.9.0.post0
9695
# via unstructured-client

requirements/extra-paddleocr.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ beautifulsoup4==4.13.4
2020
# via
2121
# -c requirements/base.txt
2222
# unstructured-paddleocr
23-
certifi==2025.7.9
23+
certifi==2025.7.14
2424
# via
2525
# -c requirements/base.txt
2626
# httpcore

requirements/extra-pdf-image.txt

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ antlr4-python3-runtime==4.9.3
1010
# via omegaconf
1111
cachetools==5.5.2
1212
# via google-auth
13-
certifi==2025.7.9
13+
certifi==2025.7.14
1414
# via
1515
# -c requirements/base.txt
1616
# requests
@@ -46,7 +46,7 @@ flatbuffers==25.2.10
4646
# via onnxruntime
4747
fonttools==4.58.5
4848
# via matplotlib
49-
fsspec==2025.5.1
49+
fsspec==2025.7.0
5050
# via
5151
# huggingface-hub
5252
# torch
@@ -64,14 +64,13 @@ googleapis-common-protos==1.70.0
6464
# grpcio-status
6565
grpcio==1.73.1
6666
# via
67-
# -c requirements/deps/constraints.txt
6867
# google-api-core
6968
# grpcio-status
7069
grpcio-status==1.73.1
7170
# via google-api-core
7271
hf-xet==1.1.5
7372
# via huggingface-hub
74-
huggingface-hub==0.33.2
73+
huggingface-hub==0.33.4
7574
# via
7675
# accelerate
7776
# timm
@@ -121,7 +120,7 @@ onnx==1.18.0
121120
# via
122121
# -r ./extra-pdf-image.in
123122
# unstructured-inference
124-
onnxruntime==1.22.0
123+
onnxruntime==1.22.1
125124
# via
126125
# -r ./extra-pdf-image.in
127126
# unstructured-inference
@@ -148,7 +147,7 @@ pdfminer-six==20250327
148147
# unstructured-inference
149148
pi-heif==1.0.0
150149
# via -r ./extra-pdf-image.in
151-
pikepdf==9.9.0
150+
pikepdf==9.10.0
152151
# via -r ./extra-pdf-image.in
153152
pillow==11.3.0
154153
# via
@@ -190,7 +189,7 @@ pycparser==2.22
190189
# cffi
191190
pyparsing==3.2.3
192191
# via matplotlib
193-
pypdf==5.7.0
192+
pypdf==5.8.0
194193
# via
195194
# -c requirements/base.txt
196195
# -r ./extra-pdf-image.in
@@ -243,7 +242,7 @@ sympy==1.14.0
243242
# via
244243
# onnxruntime
245244
# torch
246-
timm==1.0.16
245+
timm==1.0.17
247246
# via
248247
# effdet
249248
# unstructured-inference
@@ -267,7 +266,7 @@ tqdm==4.67.1
267266
# -c requirements/base.txt
268267
# huggingface-hub
269268
# transformers
270-
transformers==4.53.1
269+
transformers==4.53.2
271270
# via unstructured-inference
272271
typing-extensions==4.14.1
273272
# via

requirements/huggingface.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./huggingface.in
66
#
7-
certifi==2025.7.9
7+
certifi==2025.7.14
88
# via
99
# -c requirements/base.txt
1010
# requests
@@ -21,13 +21,13 @@ filelock==3.18.0
2121
# huggingface-hub
2222
# torch
2323
# transformers
24-
fsspec==2025.5.1
24+
fsspec==2025.7.0
2525
# via
2626
# huggingface-hub
2727
# torch
2828
hf-xet==1.1.5
2929
# via huggingface-hub
30-
huggingface-hub==0.33.2
30+
huggingface-hub==0.33.4
3131
# via
3232
# tokenizers
3333
# transformers
@@ -98,7 +98,7 @@ tqdm==4.67.1
9898
# huggingface-hub
9999
# sacremoses
100100
# transformers
101-
transformers==4.53.1
101+
transformers==4.53.2
102102
# via -r ./huggingface.in
103103
typing-extensions==4.14.1
104104
# via

requirements/test.txt

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -30,19 +30,17 @@ flake8==7.3.0
3030
# flake8-print
3131
flake8-print==5.0.0
3232
# via -r ./test.in
33-
freezegun==1.5.2
33+
freezegun==1.5.3
3434
# via -r ./test.in
3535
grpcio==1.73.1
36-
# via
37-
# -c requirements/deps/constraints.txt
38-
# -r ./test.in
36+
# via -r ./test.in
3937
iniconfig==2.1.0
4038
# via pytest
4139
liccheck==0.9.2
4240
# via -r ./test.in
4341
mccabe==0.7.0
4442
# via flake8
45-
mypy==1.16.1
43+
mypy==1.17.0
4644
# via -r ./test.in
4745
mypy-extensions==1.1.0
4846
# via
@@ -93,7 +91,7 @@ python-dateutil==2.9.0.post0
9391
# via
9492
# -c requirements/base.txt
9593
# freezegun
96-
ruff==0.12.2
94+
ruff==0.12.3
9795
# via -r ./test.in
9896
semantic-version==2.10.0
9997
# via liccheck

0 commit comments

Comments
 (0)