Skip to content

Commit 33d955c

Browse files
committed
Polishing
1 parent 6a96bf3 commit 33d955c

File tree

3 files changed

+45
-74
lines changed

3 files changed

+45
-74
lines changed

.vscode/settings.json

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,5 @@
22
"[r]": {
33
"editor.formatOnSave": true,
44
"editor.defaultFormatter": "Posit.air-vscode"
5-
},
6-
"[quarto]": {
7-
"editor.formatOnSave": true,
8-
"editor.defaultFormatter": "quarto.quarto"
95
}
106
}

vignettes/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/.quarto/

vignettes/local-sensitive.Rmd

Lines changed: 44 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -8,122 +8,96 @@ vignette: >
88
---
99

1010
```{r}
11-
#| label = "setup",
12-
#| include = FALSE
11+
#| include: FALSE
1312
knitr::opts_chunk$set(
1413
collapse = TRUE,
1514
comment = "#>"
1615
)
1716
library(stringr)
1817
```
1918

20-
When you're working with non-English text, there may be special characters that you need to instruct R how to encode. For example, a character that looks the same, may be encoded differently.
19+
stringr provides a number of locale-sensitive functions, meaning their behavior depends on your locale, of which your language is a very important part. stringr defaults to English rules, `locale = "en"`, but you can override by providing a different `locale`. A locale is specified by a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. For example, "en" is English, "en_GB" is British English, and "en_US" is American English. For a list of language codes see [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). To determine which locales are supported in stringr, see `stringi::stri_locale_list()`.
2120

22-
```r
23-
u <- c("\u00fc", "u\u0308")
24-
str_view(u)
25-
#> [1] │ ü
26-
#> [2] │ ü
27-
```
28-
29-
Alternatively, two distinct characters may be treated as the same character. In Turkish, there are two i's, with and without a dot. However, default behavior will result in the two different lowercase i's being capitalized as the same letter.
30-
31-
```r
32-
str_to_upper(c("i", "ı"))
33-
#> [1] "I" "I"
34-
```
35-
36-
Within `stringr` there are a number of locale-sensitive functions, meaning their behavior depends on your locale. `stringr` defaults to English rules by using the “en” locale and requires you to specify the `locale` argument to override it. A locale is specified by a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. For a list of language codes see [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). To determine which locales are supported in `stringr`, see `stringi::stri_locale_list()`.
37-
38-
The three main categories of locale dependent operations:
21+
There are three main types of function that vary based on locale:
3922

4023
1. Case conversion
4124
2. Sorting and ordering
4225
3. String comparison
4326

4427
## Case conversion
45-
The rules for changing cases differ among languages. Let's return to Turkish, and its two i's. Since they are two distinct letters, they should be capitalized and lowercased differently by using the `locale` argument. Let's first see what happens when we don't use the `locale` argument.
46-
47-
```r
48-
str_to_upper(c("i", "ı"))
49-
# [1] "I" "I"
50-
str_to_lower(c("İ", "I"))
51-
# [1] "i̇" "i"
52-
```
5328

54-
Now, by specifying the correct locale, we get the correct case conversion.
29+
Most languages that use the Latin alphabet (like English) have upper and lower case, but the rules aren't always the same. For example, Turkish has two forms of the letter "I", dotted and dotless.
5530

5631
```{r}
32+
str_to_upper(c("i", "ı"))
5733
str_to_upper(c("i", "ı"), locale = "tr")
34+
5835
str_to_lower(c("İ", "I"), locale = "tr")
5936
```
6037

61-
It is also important to consider the locale when converting text to title case, as this also differs by language. For example, Dutch has a digraph (ij), a two symbol letter that is treated as a single letter. Default `string_to_title()` behavior would capitalize this digraph incorrectly by only capialising the first letter.
38+
Another example is Dutch, where "ij" is a digraph, a two symbol letter treated as a single letter. `string_to_title()` will incorrectly capitalizes this unless you specify the Dutch locale:
6239

63-
```{r, warning=FALSE}
40+
```{r}
41+
#| warning: false
6442
dutch_words <- c("ijsvrij yoghurt", "ijmuiden", "bij elkaar")
6543
66-
# Default English locale doesn't correctly capitalize IJ
67-
str_to_title(dutch_words)
68-
69-
# Specifying locale = "nl" results in the digraph being correctly capitalized
44+
str_to_title(dutch_words)
7045
str_to_title(dutch_words, locale = "nl")
7146
```
7247

73-
*Note: `str_to_title()` handles character-level locale differences (e.g., Turkish i, Dutch ij), but it doesn't implement language-specific rules about which words to capitalize in titles (articles, prepositions, etc.). For that, you'd need specialized libraries or custom logic.*
48+
(Note that `str_to_title()` handles character-level locale differences but it doesn't implement language-specific rules about which words to capitalize in titles. Fortunately, title case appears to be concept that applies primarily to English.)
7449

75-
## Sorting and ordering
76-
Sorting strings correctly requires understanding that alphabetical order varies dramatically across languages. What seems like a simple A-Z sequence in English becomes complex when working internationally. For example, Lithuanian places 'y' between 'i' and 'k' rather than at the end of the alphabet. Czech treats "ch" as a single compound letter that sorts after all other 'h' words, not by the 'c'. These differences mean that sorting using the default English locale will appear completely scrambled to speakers of other languages. These sorting differences can be mitigated by using `str_sort()` and `str_order()` with explicit locale specification.
50+
Case-sensitive string comparison also comes up in `str_equal()`/`str_unique()` and in pattern matching functions. To take advantage of locale-specific case matching, supply `locale` to `str_equal()`/`str_unique()`, and use `coll()` instead of `fixed()` for pattern matching functions.
7751

7852
```{r}
79-
czech_words <- c("had", "chata", "hrad", "chůze", "house")
53+
turkish_names <- c("İpek", "Işık", "İbrahim")
54+
search_name <- "ipek"
8055
81-
# Czech words sorted incorrectly with default locale
82-
str_sort(czech_words)
56+
# incorrect
57+
str_equal(turkish_names, search_name, ignore_case = TRUE)
58+
str_detect(turkish_names, fixed(search_name, ignore_case = TRUE))
8359
84-
# Czech sorting - "ch" is a letter that comes after 'h'
85-
str_sort(czech_words, locale = "cs")
86-
str_order(czech_words, locale = "cs")
60+
# correct
61+
str_equal(turkish_names, search_name, ignore_case = TRUE, locale = "tr")
62+
str_detect(turkish_names, coll(search_name, ignore_case = TRUE, locale = "tr"))
8763
```
8864

89-
## String comparison
90-
As mentioned at the beginning of this vignette, letters that appear the same can have different Unicode representations, which may result in string comparison issues. Luckily, `str_equal()` handles Unicode normalization correctly.
65+
## Sorting and ordering
66+
67+
Alphabetical order can vary dramatically across languages. For example, Lithuanian places 'y' between 'i' and 'k' and Czech treats "ch" as a single compound letter that sorts after all other 'h' words.
9168

9269
```{r}
93-
# Unicode normalization - identical appearance, different encoding
94-
name1 <- "José" # precomposed é (single character)
95-
name2 <- "Jose\u0301" # e + combining acute accent (two characters)
70+
czech_words <- c("had", "chata", "hrad", "chůze")
71+
lithuanian_words <- c("ąžuolas", "ėglė", "šuo", "yra", "žuvis")
9672
97-
# They look identical but aren't equal with ==
98-
name1 == name2
73+
# incorrect
74+
str_sort(czech_words)
75+
str_sort(lithuanian_words)
9976
100-
# str_equal() handles Unicode normalization correctly
101-
str_equal(name1, name2)
77+
# correct
78+
str_sort(czech_words, locale = "cs")
79+
str_sort(lithuanian_words, locale = "lt")
10280
```
10381

104-
Case-sensitive string comparison becomes problematic in international contexts because different languages have different rules for which characters are considered equivalent. To handle this properly, combine the `ignore_case = TRUE` argument with the appropriate locale setting to ensure that case folding follows the correct linguistic rules.
105-
106-
```{r}
107-
turkish_names <- c("İpek", "Işık", "İbrahim")
108-
search_name <- "ipek"
82+
## String comparison
10983

110-
# English case-insensitive comparison - WRONG for Turkish
111-
str_equal(turkish_names, search_name, ignore_case = TRUE)
84+
Letters that appear the same can have different Unicode representations:
11285

113-
# Turkish case-insensitive comparison - CORRECT
114-
str_equal(turkish_names, search_name, ignore_case = TRUE, locale = "tr")
86+
```{r}
87+
name1 <- "José" # precomposed é (single character)
88+
name2 <- "Jose\u0301" # e + combining acute accent (two characters)
89+
str_view(c(name1, name2))
11590
```
11691

117-
In some situations, the same word or name can be represented in multiple ways.
92+
They look identical but `==` says they are different:
11893

11994
```{r}
120-
# Multiple strings comparison
121-
customer_names <- c("Müller", "mueller", "MÜLLER", "Mueller")
122-
search_term <- "müller"
123-
# Case-insensitive with German locale
124-
str_equal(customer_names, search_term, ignore_case = TRUE, locale = "de")
95+
name1 == name2
12596
```
12697

127-
## Other locale-sensitive functions
98+
Fortunately, stringr's comparison functions correctly handle these differences:
12899

129-
When using case-insensitive pattern matching, functions like `str_detect()`, `str_extract()`, and `str_replace()` also become locale-sensitive.
100+
```{r}
101+
str_equal(name1, name2)
102+
str_unique(c(name1, name2))
103+
```

0 commit comments

Comments
 (0)