Skip to content

Commit 9ebba79

Browse files
committed
Add vignette about locale sensitive functions. Fixes #404
1 parent 2ebd55e commit 9ebba79

File tree

1 file changed

+129
-0
lines changed

1 file changed

+129
-0
lines changed
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
title: "Locale sensitive functions"
3+
output: rmarkdown::html_vignette
4+
vignette: >
5+
%\VignetteIndexEntry{Locale sensitive functions}
6+
%\VignetteEngine{knitr::rmarkdown}
7+
%\VignetteEncoding{UTF-8}
8+
---
9+
10+
```{r}
11+
#| label = "setup",
12+
#| include = FALSE
13+
knitr::opts_chunk$set(
14+
collapse = TRUE,
15+
comment = "#>"
16+
)
17+
library(stringr)
18+
```
19+
20+
When you're working with non-English text, there may be special characters that you need to instruct R how to encode. For example, a character that looks the same, may be encoded differently.
21+
22+
```r
23+
u <- c("\u00fc", "u\u0308")
24+
str_view(u)
25+
#> [1] │ ü
26+
#> [2] │ ü
27+
```
28+
29+
Alternatively, two distinct characters may be treated as the same character. In Turkish, there are two i's, with and without a dot. However, default behavior will result in the two different lowercase i's being capitalized as the same letter.
30+
31+
```r
32+
str_to_upper(c("i", "ı"))
33+
#> [1] "I" "I"
34+
```
35+
36+
Within `stringr` there are a number of locale-sensitive functions, meaning their behavior depends on your locale. `stringr` defaults to English rules by using the “en” locale and requires you to specify the `locale` argument to override it. A locale is specified by a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. For a list of language codes see [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). To determine which locales are supported in `stringr`, see `stringi::stri_locale_list()`.
37+
38+
The three main categories of locale dependent operations:
39+
40+
1. Case conversion
41+
2. Sorting and ordering
42+
3. String comparison
43+
44+
## Case conversion
45+
The rules for changing cases differ among languages. Let's return to Turkish, and its two i's. Since they are two distinct letters, they should be capitalized and lowercased differently by using the `locale` argument. Let's first see what happens when we don't use the `locale` argument.
46+
47+
```r
48+
str_to_upper(c("i", "ı"))
49+
# [1] "I" "I"
50+
str_to_lower(c("İ", "I"))
51+
# [1] "i̇" "i"
52+
```
53+
54+
Now, by specifying the correct locale, we get the correct case conversion.
55+
56+
```{r}
57+
str_to_upper(c("i", "ı"), locale = "tr")
58+
str_to_lower(c("İ", "I"), locale = "tr")
59+
```
60+
61+
It is also important to consider the locale when converting text to title case, as this also differs by language. For example, Dutch has a digraph (ij), a two symbol letter that is treated as a single letter. Default `string_to_title()` behavior would capitalize this digraph incorrectly by only capialising the first letter.
62+
63+
```{r, warning=FALSE}
64+
dutch_words <- c("ijsvrij yoghurt", "ijmuiden", "bij elkaar")
65+
66+
# Default English locale doesn't correctly capitalize IJ
67+
str_to_title(dutch_words)
68+
69+
# Specifying locale = "nl" results in the digraph being correctly capitalized
70+
str_to_title(dutch_words, locale = "nl")
71+
```
72+
73+
*Note: `str_to_title()` handles character-level locale differences (e.g., Turkish i, Dutch ij), but it doesn't implement language-specific rules about which words to capitalize in titles (articles, prepositions, etc.). For that, you'd need specialized libraries or custom logic.*
74+
75+
## Sorting and ordering
76+
Sorting strings correctly requires understanding that alphabetical order varies dramatically across languages. What seems like a simple A-Z sequence in English becomes complex when working internationally. For example, Lithuanian places 'y' between 'i' and 'k' rather than at the end of the alphabet. Czech treats "ch" as a single compound letter that sorts after all other 'h' words, not by the 'c'. These differences mean that sorting using the default English locale will appear completely scrambled to speakers of other languages. These sorting differences can be mitigated by using `str_sort()` and `str_order()` with explicit locale specification.
77+
78+
```{r}
79+
czech_words <- c("had", "chata", "hrad", "chůze", "house")
80+
81+
# Czech words sorted incorrectly with default locale
82+
str_sort(czech_words)
83+
84+
# Czech sorting - "ch" is a letter that comes after 'h'
85+
str_sort(czech_words, locale = "cs")
86+
str_order(czech_words, locale = "cs")
87+
```
88+
89+
## String comparison
90+
As mentioned at the beginning of this vignette, letters that appear the same can have different Unicode representations, which may result in string comparison issues. Luckily, `str_equal()` handles Unicode normalization correctly.
91+
92+
```{r}
93+
# Unicode normalization - identical appearance, different encoding
94+
name1 <- "José" # precomposed é (single character)
95+
name2 <- "Jose\u0301" # e + combining acute accent (two characters)
96+
97+
# They look identical but aren't equal with ==
98+
name1 == name2
99+
100+
# str_equal() handles Unicode normalization correctly
101+
str_equal(name1, name2)
102+
```
103+
104+
Case-sensitive string comparison becomes problematic in international contexts because different languages have different rules for which characters are considered equivalent. To handle this properly, combine the `ignore_case = TRUE` argument with the appropriate locale setting to ensure that case folding follows the correct linguistic rules.
105+
106+
```{r}
107+
turkish_names <- c("İpek", "Işık", "İbrahim")
108+
search_name <- "ipek"
109+
110+
# English case-insensitive comparison - WRONG for Turkish
111+
str_equal(turkish_names, search_name, ignore_case = TRUE)
112+
113+
# Turkish case-insensitive comparison - CORRECT
114+
str_equal(turkish_names, search_name, ignore_case = TRUE, locale = "tr")
115+
```
116+
117+
In some situations, the same word or name can be represented in multiple ways.
118+
119+
```{r}
120+
# Multiple strings comparison
121+
customer_names <- c("Müller", "mueller", "MÜLLER", "Mueller")
122+
search_term <- "müller"
123+
# Case-insensitive with German locale
124+
str_equal(customer_names, search_term, ignore_case = TRUE, locale = "de")
125+
```
126+
127+
## Other locale-sensitive functions
128+
129+
When using case-insensitive pattern matching, functions like `str_detect()`, `str_extract()`, and `str_replace()` also become locale-sensitive.

0 commit comments

Comments
 (0)