Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
249 views
in Technique[技术] by (71.8m points)

Remove ?, 🔥, ? , ? and other such emojis/images/signs from Java strings

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for example:

▓ railway??
→ Cats and dogs
I'm on ??
Apples ? 
? Vi sign
? I'm the king ? 
Corée ? du Nord ?  (French)
 gj?r at b?de ?╗ (Norwegian)
Star me ★
Star ? once more
早上好 ? (Chinese)
Καλημ?ρα ? (Greek)
another ? sign ?
добрай ран?цы ? (Belarus)
? ??? ?????? ? (Hindi)
? ? ? ? Let's get together ★. We shall meet at 12/10/2018 10:00 AM at Tony's.?

...and many more of these.

I would like to get rid of all these signs/images and to keep only the letters (and punctuation) in the different languages.

I tried to clean the signs using the EmojiParser library:

String withoutEmojis = EmojiParser.removeAllEmojis(input);

The problem is that EmojiParser is not able to remove the majority of the signs. The ? sign is the only one I found till now that it removed. Other signs such as ? ? ★ ? ? ? ? ? ? ? ? ?? are not removed.

Is there a way to remove all these signs from the input strings and keeping only the letters and punctuation in the different languages?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

String characterFilter = "[^\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s]";
String emotionless = aString.replaceAll(characterFilter,"");

So:

  • [\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s] is a range representing all numeric (\p{N}), letter (\p{L}), mark (\p{M}), punctuation (\p{P}), whitespace/separator (\p{Z}), other formatting (\p{Cf}) and other characters above U+FFFF in Unicode (\p{Cs}), and newline (\s) characters. \p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.
  • The ^ in the regex character set negates the match.

Example:

String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。??";
System.out.print(str.replaceAll("[^\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s]",""));
// Output:
//   "hello world _# 皆さん、こんにちは! 私はジョンと申します。"

If you need more information, check out the Java documentation for regexes.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...