Skip to content

Commit 53df1bb

Browse files
committed
Response to claude review
1 parent 170eb8b commit 53df1bb

File tree

1 file changed

+33
-39
lines changed

1 file changed

+33
-39
lines changed

vignettes/locale-sensitive.Rmd

Lines changed: 33 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -16,55 +16,72 @@ knitr::opts_chunk$set(
1616
library(stringr)
1717
```
1818

19-
stringr provides a number of locale-sensitive functions, i.e. functions whose behaviour depends on your locale, of which your language is a very important part. stringr defaults to English, `locale = "en"`, but you can override by providing a different `locale` specified by a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. For example, "en" is English, "en_GB" is British English, and "en_US" is American English. For a list of language codes see [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) and to see which are supported in stringr, run `stringi::stri_locale_list()`.
19+
A locale is a set of parameters that define a user's language, region, and cultural preferences. It determines language-specific rules for text processing, including how to:
2020

21-
There are three main types of function that vary based on locale:
21+
- Convert between uppercase and lowercase letters
22+
- Sort text alphabetically
23+
- Format dates, numbers, and currency
24+
- Handle character encoding and display
25+
26+
In stringr, you can control the locale using the `locale` argument, which takes language codes like "en" (English), "tr" (Turkish), or "es_MX" (Mexican Spanish). In general, a locale is a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. You can see which locales are supported in stringr by running run `stringi::stri_locale_list()`.
27+
28+
This vignette describes locale-sensitive stringr functions, i.e. functions with a `locale` argument. These functions fall into two broad categories:
2229

2330
1. Case conversion
2431
2. Sorting and ordering
25-
3. String comparison
2632

2733
## Case conversion
2834

29-
Most languages that use the Latin alphabet (like English) have upper and lower case, but the rules for converting between the two aren't always the same. For example, Turkish has two forms of the letter "I", dotted and dotless:
35+
`str_to_lower()`, `str_to_upper()`, `str_to_title()`, and `str_to_sentence()` all change the case of their inputs. But while most languages that use the Latin alphabet (like English) have upper and lower case, the rules for converting between the two aren't always the same. For example, Turkish has two forms of the letter "I": as well as "i" and "I", Turkish also has "ı", the dotless lowercase i, and "İ" is the dotted uppercase I. This means the rules for coverting i to upper case and I to lower case are different from English:
3036

3137
```{r}
32-
str_to_upper(c("i", "ı"))
33-
str_to_upper(c("i", "ı"), locale = "tr")
38+
# English
39+
str_to_upper("i")
40+
str_to_lower("I")
3441
35-
str_to_lower(c("İ", "I"), locale = "tr")
42+
# Turkish
43+
str_to_upper("i", locale = "tr")
44+
str_to_lower("I", locale = "tr")
3645
```
3746

38-
Another example is Dutch, where "ij" is a digraph treated as a single letter. This means that `string_to_title()` will incorrectly capitalize it unless you use a Dutch locale:
47+
Another example is Dutch, where "ij" is a digraph treated as a single letter. This means that `str_to_sentence()` will incorrectly capitalize "ij" at the start of a sentence unless you use a Dutch locale:
3948

4049
```{r}
4150
#| warning: false
42-
dutch_words <- c("ijsvrij yoghurt", "ijmuiden", "bij elkaar")
51+
dutch_sentence <- "ijsland is een prachtig land in Noord-Europa."
4352
44-
str_to_title(dutch_words)
45-
str_to_title(dutch_words, locale = "nl")
53+
# Incorrect
54+
str_to_sentence(dutch_sentence)
55+
# Correct
56+
str_to_sentence(dutch_sentence, locale = "nl")
4657
```
4758

48-
(Note that `str_to_title()` handles character-level locale differences but it doesn't implement locale-specific rules about which words not to capitalize. Fortunately, title case appears to be concept that applies primarily to English.)
49-
50-
Case-sensitive string comparison also comes up in `str_equal()`/`str_unique()` and in pattern matching functions. To take advantage of locale-specific case matching, supply `locale` to `str_equal()`/`str_unique()` and use `coll()` instead of `fixed()` in pattern matching functions.
59+
Case conversion also comes up in another situation: case-insensitive comparison. Case-insensitive comparison comes up in two places. Firstly, `str_equal()` and `str_unique()` can optionally ignore case, so it's important to also supply locale when working with non-English text. For example, imagine we're searching for a Turkish name, ignoring case:
5160

5261
```{r}
5362
turkish_names <- c("İpek", "Işık", "İbrahim")
5463
search_name <- "ipek"
5564
5665
# incorrect
5766
str_equal(turkish_names, search_name, ignore_case = TRUE)
58-
str_detect(turkish_names, fixed(search_name, ignore_case = TRUE))
5967
6068
# correct
6169
str_equal(turkish_names, search_name, ignore_case = TRUE, locale = "tr")
70+
```
71+
72+
Case conversion also comes up in pattern matching functions like `str_detect()`. You might be accustomed to use `ignore_case = TRUE` with `regex()` or `fixed()`, but if you want to use locale-sensitive comparison you instead need to use `coll()`:
73+
74+
```{r}
75+
# incorrect
76+
str_detect(turkish_names, fixed(search_name, ignore_case = TRUE))
77+
78+
# correct
6279
str_detect(turkish_names, coll(search_name, ignore_case = TRUE, locale = "tr"))
6380
```
6481

6582
## Sorting and ordering
6683

67-
Alphabetical order can vary dramatically across languages. For example, Lithuanian places 'y' between 'i' and 'k' and Czech treats "ch" as a single compound letter that sorts after all other 'h' words.
84+
`str_sort()`, `str_order()`, and `str_rank()` all rely on the alphabetical ordering of letters. But not every language uses the same ordering as English. For example, Lithuanian places 'y' between 'i' and 'k' and Czech treats "ch" as a single compound letter that sorts after all other 'h' words. That means that if you want to correctly sort words in these languages you must provide the correct locale:
6885

6986
```{r}
7087
czech_words <- c("had", "chata", "hrad", "chůze")
@@ -78,26 +95,3 @@ str_sort(lithuanian_words)
7895
str_sort(czech_words, locale = "cs")
7996
str_sort(lithuanian_words, locale = "lt")
8097
```
81-
82-
## String comparison
83-
84-
Letters that appear the same can have different Unicode representations:
85-
86-
```{r}
87-
name1 <- "José" # precomposed é (single character)
88-
name2 <- "Jose\u0301" # e + combining acute accent (two characters)
89-
str_view(c(name1, name2))
90-
```
91-
92-
They look identical but `==` says they are different:
93-
94-
```{r}
95-
name1 == name2
96-
```
97-
98-
Fortunately, stringr's comparison functions correctly handle these differences:
99-
100-
```{r}
101-
str_equal(name1, name2)
102-
str_unique(c(name1, name2))
103-
```

0 commit comments

Comments
 (0)