11
22# Processing UTF-8 byte sequences into grapheme clusters
33
4- ## Motivation
5-
6- // 👪👪👪
4+ ## The Objective
5+
6+ 👪👪👪
7+
8+ - Process a consecutive sequence text in UTF-8 encoding into a group of grapheme clusters.
9+ - Stop processing on one of the conditions:
10+ - end of input stream is reached
11+ - a control character (such as newline or escape character) has been found
12+ - the maximum number of grapheme clusters in narrow width (aka. page width) have been consumed (while wide characters count as two narrow characters)
13+ - Allow resuming processing text when we previously stopped in the middle of a grapheme cluster.
14+ - The algorithm must be as resource efficient as possible:
15+ - do not require any dynamic memory allocations during text processing
16+ - reduce instruction branching as much as possible
17+ - utilize SIMD to improve throughput performance
18+ - Invalid codepoints are treated with east asian width Narrow (1 column)
719
8- We want to be able to pure scan text in the terminal, in order to be able to know what to
9- print to the terminal's screen, but we must stop at the right page margin, as well as, when
10- a control character (like newline or escape character) was found.
11- The input is a consecutive memory region containing terminal output to be interpreted.
12- Text is encoded in UTF-8.
20+ ## Consequences
1321
14- The ultimate goal is to scan text as fast as possible and stop scanning at one of the conditions:
22+ Do not report at the end of a codepoint, because maybe the following codepoint may extend
23+ the current grapheme cluster, thus, report only at the end of a complete grapheme cluster.
1524
16- 1 . end of UTF-8 byte stream is reached
17- 2 . a non-text character has been found (e.g. a control character)
18- 3 . page width has been reached
25+ ## Implementation
1926
20- Scanning US-ASCII text is easy, trivial in fact. And US-ASCII can be even scanned using SIMD instructions,
21- increasing scanning performance dramatically.
27+ Scanning US-ASCII can be easily implemented using SIMD, increasing scanning performance dramatically.
2228
2329Scanning non-US-ASCII text, complex Unicode codepoints, is way more complex, because more depth is involved.
2430
25- In order to satisfy point number 3 - stop scanning at the page width - we must take into account
31+ In order to reliably stop scanning at the page width - we must take into account
2632that the character we see on the screen is not necessarily just a single byte,
2733nor even a single UTF-32 codepoint, but rather a sequence of UTF-32 codepoints.
2834This is what we call ** grapheme cluster** . A grapheme cluster is a user perceived single grapheme entity,
@@ -31,36 +37,12 @@ that can be one or more Unicode codepoints.
3137We therefore must be able to determine the border of when a grapheme cluster ends and the next one begins.
3238
3339Because scanning US-ASCII text can be implemented using SIMD but complex Unicode cannot, we split both
34- tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode.
35-
36- In this article, we'll befocusing on scanning for complex unicode.
37-
38- To make things even more complex, we also must be able to suspend and resume scanning at any arbitrary point
39- in time, because we are not guaranteed to always have all bytes available. They may come in later calls.
40-
41- ## Objective
40+ tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode text.
4241
43- Scan a sequence of UTF-8 bytes into grapheme clusters,
44- emitting events for each grapheme cluster and their east asian widths,
45- for up to a given amount of east asian widths (sum of each cluster's width),
46- terminating also early on control characters, allowing to suspend and resume
47- at any arbitrary point in the sequence of input bytes.
48-
49- ## Requirements
50-
51- - The underlying input sequence to process at once is a consecutive sequence of bytes
52- - East asian widths are mapped to terminal columns (Narrow=1, Wide=2)
53- - Input is a consecutive sequence of bytes and the maximum number of total widths to process at most
54- - Output is the number of widths being processed that fit into the input's maximum number of total widths
55- - Invalid codepoints are treated with east asian width Narrow (1 column)
56- - Processing up to a given amount of total widths
57- - Processing can interrupt and resume at any time (like in a finite state machine)
58-
59- ## Consequences
42+ In this article, we'll befocusing on scanning for complex Unicode.
6043
61- Do not report at the end of a codepoint,
62- because maybe the following codepoint
63- may extend the current grapheme cluster
44+ We also must be able to suspend and resume scanning text at any arbitrary point
45+ in time, because we are not guaranteed to always have all bytes available in a single call.
6446
6547## Example Processing: Family Emoji
6648
0 commit comments