Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings

Question

Welcome To Ask or Share your Answers For Others

Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for example:

▓ railway??
→ Cats and dogs
I'm on ??
Apples ? 
? Vi sign
? I'm the king ? 
Corée ? du Nord ?  (French)
 gj?r at b?de ?╗ (Norwegian)
Star me ★
Star ? once more
早上好 ? (Chinese)
Καλημ?ρα ? (Greek)
another ? sign ?
добрай ран?цы ? (Belarus)
? ??? ?????? ? (Hindi)
? ? ? ? Let's get together ★. We shall meet at 12/10/2018 10:00 AM at Tony's.?

...and many more of these.

I would like to get rid of all these signs/images and to keep only the letters (and punctuation) in the different languages.

I tried to clean the signs using the EmojiParser library:

String withoutEmojis = EmojiParser.removeAllEmojis(input);

The problem is that EmojiParser is not able to remove the majority of the signs. The ? sign is the only one I found till now that it removed. Other signs such as ? ? ★ ? ? ? ? ? ? ? ? ?? are not removed.

Is there a way to remove all these signs from the input strings and keeping only the letters and punctuation in the different languages?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:05:49+0000

Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

String characterFilter = "[^\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s]";
String emotionless = aString.replaceAll(characterFilter,"");

So:

[\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s] is a range representing all numeric (\p{N}), letter (\p{L}), mark (\p{M}), punctuation (\p{P}), whitespace/separator (\p{Z}), other formatting (\p{Cf}) and other characters above U+FFFF in Unicode (\p{Cs}), and newline (\s) characters. \p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.
The ^ in the regex character set negates the match.

Example:

String str = "hello world _# 皆さん、こんにちは！　私はジョンと申します。??";
System.out.print(str.replaceAll("[^\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s]",""));
// Output:
//   "hello world _# 皆さん、こんにちは！　私はジョンと申します。"

If you need more information, check out the Java documentation for regexes.

Categories

Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings

Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

Remove ?, &#128293;, ? , ? and other such emojis/images/signs from Java strings

Remove ?, &#128293;, ? , ? and other such emojis/images/signs from Java strings

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings

Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings