You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/local-sensitive.Rmd
+44-70Lines changed: 44 additions & 70 deletions
Original file line number
Diff line number
Diff line change
@@ -8,122 +8,96 @@ vignette: >
8
8
---
9
9
10
10
```{r}
11
-
#| label = "setup",
12
-
#| include = FALSE
11
+
#| include: FALSE
13
12
knitr::opts_chunk$set(
14
13
collapse = TRUE,
15
14
comment = "#>"
16
15
)
17
16
library(stringr)
18
17
```
19
18
20
-
When you're working with non-English text, there may be special characters that you need to instruct R how to encode. For example, a character that looks the same, may be encoded differently.
19
+
stringr provides a number of locale-sensitive functions, meaning their behavior depends on your locale, of which your language is a very important part. stringr defaults to English rules, `locale = "en"`, but you can override by providing a different `locale`. A locale is specified by a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. For example, "en" is English, "en_GB" is British English, and "en_US" is American English. For a list of language codes see [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). To determine which locales are supported in stringr, see `stringi::stri_locale_list()`.
21
20
22
-
```r
23
-
u<- c("\u00fc", "u\u0308")
24
-
str_view(u)
25
-
#> [1] │ ü
26
-
#> [2] │ ü
27
-
```
28
-
29
-
Alternatively, two distinct characters may be treated as the same character. In Turkish, there are two i's, with and without a dot. However, default behavior will result in the two different lowercase i's being capitalized as the same letter.
30
-
31
-
```r
32
-
str_to_upper(c("i", "ı"))
33
-
#> [1] "I" "I"
34
-
```
35
-
36
-
Within `stringr` there are a number of locale-sensitive functions, meaning their behavior depends on your locale. `stringr` defaults to English rules by using the “en” locale and requires you to specify the `locale` argument to override it. A locale is specified by a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. For a list of language codes see [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). To determine which locales are supported in `stringr`, see `stringi::stri_locale_list()`.
37
-
38
-
The three main categories of locale dependent operations:
21
+
There are three main types of function that vary based on locale:
39
22
40
23
1. Case conversion
41
24
2. Sorting and ordering
42
25
3. String comparison
43
26
44
27
## Case conversion
45
-
The rules for changing cases differ among languages. Let's return to Turkish, and its two i's. Since they are two distinct letters, they should be capitalized and lowercased differently by using the `locale` argument. Let's first see what happens when we don't use the `locale` argument.
46
-
47
-
```r
48
-
str_to_upper(c("i", "ı"))
49
-
# [1] "I" "I"
50
-
str_to_lower(c("İ", "I"))
51
-
# [1] "i̇" "i"
52
-
```
53
28
54
-
Now, by specifying the correct locale, we get the correct case conversion.
29
+
Most languages that use the Latin alphabet (like English) have upper and lower case, but the rules aren't always the same. For example, Turkish has two forms of the letter "I", dotted and dotless.
55
30
56
31
```{r}
32
+
str_to_upper(c("i", "ı"))
57
33
str_to_upper(c("i", "ı"), locale = "tr")
34
+
58
35
str_to_lower(c("İ", "I"), locale = "tr")
59
36
```
60
37
61
-
It is also important to consider the locale when converting text to title case, as this also differs by language. For example, Dutch has a digraph (ij), a two symbol letter that is treated as a single letter. Default `string_to_title()`behavior would capitalize this digraph incorrectly by only capialising the first letter.
38
+
Another example is Dutch, where "ij" is a digraph, a two symbol letter treated as a single letter. `string_to_title()`will incorrectly capitalizes this unless you specify the Dutch locale:
# Default English locale doesn't correctly capitalize IJ
67
-
str_to_title(dutch_words)
68
-
69
-
# Specifying locale = "nl" results in the digraph being correctly capitalized
44
+
str_to_title(dutch_words)
70
45
str_to_title(dutch_words, locale = "nl")
71
46
```
72
47
73
-
*Note: `str_to_title()` handles character-level locale differences (e.g., Turkish i, Dutch ij), but it doesn't implement language-specific rules about which words to capitalize in titles (articles, prepositions, etc.). For that, you'd need specialized libraries or custom logic.*
48
+
(Note that `str_to_title()` handles character-level locale differences but it doesn't implement language-specific rules about which words to capitalize in titles. Fortunately, title case appears to be concept that applies primarily to English.)
74
49
75
-
## Sorting and ordering
76
-
Sorting strings correctly requires understanding that alphabetical order varies dramatically across languages. What seems like a simple A-Z sequence in English becomes complex when working internationally. For example, Lithuanian places 'y' between 'i' and 'k' rather than at the end of the alphabet. Czech treats "ch" as a single compound letter that sorts after all other 'h' words, not by the 'c'. These differences mean that sorting using the default English locale will appear completely scrambled to speakers of other languages. These sorting differences can be mitigated by using `str_sort()` and `str_order()` with explicit locale specification.
50
+
Case-sensitive string comparison also comes up in `str_equal()`/`str_unique()` and in pattern matching functions. To take advantage of locale-specific case matching, supply `locale` to `str_equal()`/`str_unique()`, and use `coll()` instead of `fixed()` for pattern matching functions.
As mentioned at the beginning of this vignette, letters that appear the same can have different Unicode representations, which may result in string comparison issues. Luckily, `str_equal()` handles Unicode normalization correctly.
65
+
## Sorting and ordering
66
+
67
+
Alphabetical order can vary dramatically across languages. For example, Lithuanian places 'y' between 'i' and 'k' and Czech treats "ch" as a single compound letter that sorts after all other 'h' words.
91
68
92
69
```{r}
93
-
# Unicode normalization - identical appearance, different encoding
94
-
name1 <- "José" # precomposed é (single character)
Case-sensitive string comparison becomes problematic in international contexts because different languages have different rules for which characters are considered equivalent. To handle this properly, combine the `ignore_case = TRUE` argument with the appropriate locale setting to ensure that case folding follows the correct linguistic rules.
105
-
106
-
```{r}
107
-
turkish_names <- c("İpek", "Işık", "İbrahim")
108
-
search_name <- "ipek"
82
+
## String comparison
109
83
110
-
# English case-insensitive comparison - WRONG for Turkish
0 commit comments