|
| 1 | +--- |
| 2 | +title: "Locale sensitive functions" |
| 3 | +output: rmarkdown::html_vignette |
| 4 | +vignette: > |
| 5 | + %\VignetteIndexEntry{Locale sensitive functions} |
| 6 | + %\VignetteEngine{knitr::rmarkdown} |
| 7 | + %\VignetteEncoding{UTF-8} |
| 8 | +--- |
| 9 | + |
| 10 | +```{r} |
| 11 | +#| label = "setup", |
| 12 | +#| include = FALSE |
| 13 | +knitr::opts_chunk$set( |
| 14 | + collapse = TRUE, |
| 15 | + comment = "#>" |
| 16 | +) |
| 17 | +library(stringr) |
| 18 | +``` |
| 19 | + |
| 20 | +When you're working with non-English text, there may be special characters that you need to instruct R how to encode. For example, a character that looks the same, may be encoded differently. |
| 21 | + |
| 22 | +```r |
| 23 | +u <- c("\u00fc", "u\u0308") |
| 24 | +str_view(u) |
| 25 | +#> [1] │ ü |
| 26 | +#> [2] │ ü |
| 27 | +``` |
| 28 | + |
| 29 | +Alternatively, two distinct characters may be treated as the same character. In Turkish, there are two i's, with and without a dot. However, default behavior will result in the two different lowercase i's being capitalized as the same letter. |
| 30 | + |
| 31 | +```r |
| 32 | +str_to_upper(c("i", "ı")) |
| 33 | +#> [1] "I" "I" |
| 34 | +``` |
| 35 | + |
| 36 | +Within `stringr` there are a number of locale-sensitive functions, meaning their behavior depends on your locale. `stringr` defaults to English rules by using the “en” locale and requires you to specify the `locale` argument to override it. A locale is specified by a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. For a list of language codes see [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). To determine which locales are supported in `stringr`, see `stringi::stri_locale_list()`. |
| 37 | + |
| 38 | +The three main categories of locale dependent operations: |
| 39 | + |
| 40 | +1. Case conversion |
| 41 | +2. Sorting and ordering |
| 42 | +3. String comparison |
| 43 | + |
| 44 | +## Case conversion |
| 45 | +The rules for changing cases differ among languages. Let's return to Turkish, and its two i's. Since they are two distinct letters, they should be capitalized and lowercased differently by using the `locale` argument. Let's first see what happens when we don't use the `locale` argument. |
| 46 | + |
| 47 | +```r |
| 48 | +str_to_upper(c("i", "ı")) |
| 49 | +# [1] "I" "I" |
| 50 | +str_to_lower(c("İ", "I")) |
| 51 | +# [1] "i̇" "i" |
| 52 | +``` |
| 53 | + |
| 54 | +Now, by specifying the correct locale, we get the correct case conversion. |
| 55 | + |
| 56 | +```{r} |
| 57 | +str_to_upper(c("i", "ı"), locale = "tr") |
| 58 | +str_to_lower(c("İ", "I"), locale = "tr") |
| 59 | +``` |
| 60 | + |
| 61 | +It is also important to consider the locale when converting text to title case, as this also differs by language. For example, Dutch has a digraph (ij), a two symbol letter that is treated as a single letter. Default `string_to_title()` behavior would capitalize this digraph incorrectly by only capialising the first letter. |
| 62 | + |
| 63 | +```{r, warning=FALSE} |
| 64 | +dutch_words <- c("ijsvrij yoghurt", "ijmuiden", "bij elkaar") |
| 65 | +
|
| 66 | +# Default English locale doesn't correctly capitalize IJ |
| 67 | +str_to_title(dutch_words) |
| 68 | +
|
| 69 | +# Specifying locale = "nl" results in the digraph being correctly capitalized |
| 70 | +str_to_title(dutch_words, locale = "nl") |
| 71 | +``` |
| 72 | + |
| 73 | +*Note: `str_to_title()` handles character-level locale differences (e.g., Turkish i, Dutch ij), but it doesn't implement language-specific rules about which words to capitalize in titles (articles, prepositions, etc.). For that, you'd need specialized libraries or custom logic.* |
| 74 | + |
| 75 | +## Sorting and ordering |
| 76 | +Sorting strings correctly requires understanding that alphabetical order varies dramatically across languages. What seems like a simple A-Z sequence in English becomes complex when working internationally. For example, Lithuanian places 'y' between 'i' and 'k' rather than at the end of the alphabet. Czech treats "ch" as a single compound letter that sorts after all other 'h' words, not by the 'c'. These differences mean that sorting using the default English locale will appear completely scrambled to speakers of other languages. These sorting differences can be mitigated by using `str_sort()` and `str_order()` with explicit locale specification. |
| 77 | + |
| 78 | +```{r} |
| 79 | +czech_words <- c("had", "chata", "hrad", "chůze", "house") |
| 80 | +
|
| 81 | +# Czech words sorted incorrectly with default locale |
| 82 | +str_sort(czech_words) |
| 83 | +
|
| 84 | +# Czech sorting - "ch" is a letter that comes after 'h' |
| 85 | +str_sort(czech_words, locale = "cs") |
| 86 | +str_order(czech_words, locale = "cs") |
| 87 | +``` |
| 88 | + |
| 89 | +## String comparison |
| 90 | +As mentioned at the beginning of this vignette, letters that appear the same can have different Unicode representations, which may result in string comparison issues. Luckily, `str_equal()` handles Unicode normalization correctly. |
| 91 | + |
| 92 | +```{r} |
| 93 | +# Unicode normalization - identical appearance, different encoding |
| 94 | +name1 <- "José" # precomposed é (single character) |
| 95 | +name2 <- "Jose\u0301" # e + combining acute accent (two characters) |
| 96 | +
|
| 97 | +# They look identical but aren't equal with == |
| 98 | +name1 == name2 |
| 99 | +
|
| 100 | +# str_equal() handles Unicode normalization correctly |
| 101 | +str_equal(name1, name2) |
| 102 | +``` |
| 103 | + |
| 104 | +Case-sensitive string comparison becomes problematic in international contexts because different languages have different rules for which characters are considered equivalent. To handle this properly, combine the `ignore_case = TRUE` argument with the appropriate locale setting to ensure that case folding follows the correct linguistic rules. |
| 105 | + |
| 106 | +```{r} |
| 107 | +turkish_names <- c("İpek", "Işık", "İbrahim") |
| 108 | +search_name <- "ipek" |
| 109 | +
|
| 110 | +# English case-insensitive comparison - WRONG for Turkish |
| 111 | +str_equal(turkish_names, search_name, ignore_case = TRUE) |
| 112 | +
|
| 113 | +# Turkish case-insensitive comparison - CORRECT |
| 114 | +str_equal(turkish_names, search_name, ignore_case = TRUE, locale = "tr") |
| 115 | +``` |
| 116 | + |
| 117 | +In some situations, the same word or name can be represented in multiple ways. |
| 118 | + |
| 119 | +```{r} |
| 120 | +# Multiple strings comparison |
| 121 | +customer_names <- c("Müller", "mueller", "MÜLLER", "Mueller") |
| 122 | +search_term <- "müller" |
| 123 | +# Case-insensitive with German locale |
| 124 | +str_equal(customer_names, search_term, ignore_case = TRUE, locale = "de") |
| 125 | +``` |
| 126 | + |
| 127 | +## Other locale-sensitive functions |
| 128 | + |
| 129 | +When using case-insensitive pattern matching, functions like `str_detect()`, `str_extract()`, and `str_replace()` also become locale-sensitive. |
0 commit comments