Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
592 views
in Technique[技术] by (71.8m points)

macos - Different utf8 encoding in filenames os x

I have a small shellscript in .x

$ cat .x
u="B?hmáí"
touch "$u"
ls > .list
echo "$u" >.text

cat .list .text
diff .list .text
od -bc .list
od -bc .text

When i run this scrpit sh -x .x (-x only for showing commands)

$ sh -x .x
+ u=B?hmáí
+ touch B?hmáí
+ ls
+ echo B?hmáí
+ cat .list .text
B?hmáí
B?hmáí
+ diff .list .text
1c1
< B?hmáí
---
> B?hmáí
+ od -bc .list
0000000   102 157 314 210 150 155 141 314 201 151 314 201 012            
           B   o   ?    **   h   m   a   ?    **   i   ?    **  
            
0000015
+ od -bc .text
0000000   102 303 266 150 155 303 241 303 255 012                        
           B   ?  **   h   m   á  **   í  **  
                        
0000012

The same string B?hmáí has encoded into different bytes in the filename vs as a content of a file. In the terminal (utf8-encoded) the string looks same in both variants.

Where is the rabbit?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

(This is mostly stolen from a previous answer of mine...)

Unicode allows some accented characters to be represented in several different ways: as a "code point" representing the accented character, or as a series of code points representing the unaccented version of the character, followed by the accent(s). For example, "?" could be represented either precomposed as U+00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaeresis) or decomposed as U+0061 U+0308 (UTF-8 0x61cc88, Latin small letter a + combining diaeresis).

OS X's HFS+ filesystem requires that all filenames be stored in the UTF-8 representation of their fully decomposed form. In an HFS+ filename, "?" MUST be encoded as 0x61cc88, and "?" MUST be encoded as 0x6fcc88.

So what's happening here is that your shell script contains "B?hmáí" in precomposed form, so it gets stored that way in the variable a, and stored that way in the .text file. But when you create a file with that name (with touch), the filesystem converts it to the decomposed form for the actual filename. And when you ls it, it shows the form the filesystem has: the decomposed form.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...