Skip to content

Conversation

@david-cilluffo
Copy link

Summary

Adds support for special character control words that are currently parsed as Unknown and silently dropped.

Characters added:

  • Em dash (\emdash) → —
  • En dash (\endash) → –
  • Bullet (\bullet) → •
  • Smart single quotes (\lquote, \rquote) → ' '
  • Smart double quotes (\ldblquote, \rdblquote) → " "
  • Tab (\tab)
  • Line break (\line)

Problem

RTF files from applications like Scrivener use these control words extensively. Currently they're silently dropped, causing data loss.

Input RTF:
The transformation in reverse\emdash confident expert

Current output: The transformation in reverseconfident expert (dash missing)

With this PR: The transformation in reverse—confident expert

Testing

Added 6 unit tests for special character parsing.

Breaking changes

None. Previously unknown control words are now recognized and converted to their Unicode equivalents.

Adds support for RTF special character control words that were previously
parsed as Unknown and silently dropped:

- \emdash → U+2014 (—)
- \endash → U+2013 (–)
- \bullet → U+2022 (•)
- \lquote → U+2018 (')
- \rquote → U+2019 (')
- \ldblquote → U+201C (")
- \rdblquote → U+201D (")
- \tab → U+0009 (tab)
- \line → U+000A (newline)

This fixes data loss when parsing RTF from applications like Scrivener
that use these control words extensively.

Includes 6 new unit tests covering all special characters.
@d0rianb
Copy link
Owner

d0rianb commented Dec 18, 2025

Thanks for this PR, this looks great!

@d0rianb d0rianb merged commit 3fec3d1 into d0rianb:master Dec 18, 2025
1 check passed
@david-cilluffo david-cilluffo deleted the feat/special-characters branch December 18, 2025 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants