R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

Question

Welcome To Ask or Share your Answers For Others

R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

I'm trying to remove non-alphabet characters from a vector of strings. I thought the [:punct:] grouping would cover it, but it seems to ignore the +. Does this belong to another group of characters?

library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)

string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\+', ' ')

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:07:30+0000

POSIX character classes need to be wrapped inside of a character class, the correct form would be [[:punct:]]. Do not confuse the POSIX term "character class" with what is normally called a regex character class.

This POSIX named class in the ASCII range matches all non-controls, non-alphanumeric, non-space characters.

ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"

Although if a locale is in effect, it could alter the behavior of [[:punct:]] ...

R Documentation ?regex states the following: Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation is that of the POSIX locale.

The Open Group LC_TYPE definition for punct says:

Define characters to be classified as punctuation characters.

In the POSIX locale, neither the <space> nor any characters in classes alpha, digit, or cntrl shall be included.

In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the <space> shall be specified.

However, the stringi package seems to depend on ICU and locale is a fundamental concept in ICU.

Using the stringi package, I recommend using the Unicode Properties p{P} and p{S}.

p{P} matches any kind of punctuation character. That is, it is missing nine of the characters that the POSIX class punct includes. This is because Unicode splits what POSIX considers to be punctuation into two categories, Punctuation and Symbols. This is where p{S} comes into place ...
```
stri_replace_all_regex(string1, '[\p{P}\p{S}]', ' ')
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "
```

Or fallback to gsub from base R which handles this very well.

gsub('[[:punct:]]', ' ', string1)
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "

Categories

R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags