Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
605 views
in Technique[技术] by (71.8m points)

c# - Parsing through Arabic / RTL text from left to right

Let's say I have a string in an RTL language such as Arabic with some English chucked in:

string s = "Test:?????;?????;a;b"

Notice there are semicolons in the string. When I use the Split command like string[] spl = s.Split(';');, then some of the strings are saved in reverse order. This is what happens:

?????spl[0] = "?Test:?????"
spl[1] = "?"?????
spl[2] = ?"a"
spl[3] = ?"b"

The above is out of order compared to the original. Instead, I expect to get this:

??spl[0] = ?"Test:?????"
spl[1] = "??????"
spl[2] = ?"a"
spl[3] = ?"b"

I'm prepared to write my own split function. However, the chars in the string also parse in reverse order, so I'm back to square one. I just want to go through each character as it's shown on the screen.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As your string currently stands, the word ????? is stored prior to the word ?????; the fact that ????? is displayed "first" (that is, further to the left), is just a (correct) result of the Unicode Bidirectional Algorithm in displaying the text.

That is: the string you start with ("Test:?????;?????;a;b") is the result of the user entering "Test:", then ?????, then ";", then ?????, and then ";a;b". Thus, the way C# is splitting it does in fact mirror the way that the string is created. It's just that the way it is created is not reflected in the display of the string, because the two consecutive Arabic words are treated as a single unit when they are displayed.

If you'd like a string to display Arabic words in left-to-right order with semicolons in between, while also storing the words in that same order, then you should put a Left-to-Right mark (U+200E) after the semicolon. This will effectively section off each Arabic word as its own unit, and the Bidirectional Algorithm will then treat each word separately.

For instance, the following code begins with a string identical to the one you use (with the addition of a single Left-to-Right mark), yet it will split it up according to the way that you are expecting it to (that is, spl[0] = ?"Test:?????", and spl[1] = "??????"):

static void Main(string[] args) {
    string s = "Test:?????;u200E?????;a;b";
    string[] spl = s.Split(';');
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...