Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
191 views
in Technique[技术] by (71.8m points)

c++ - How are u8-literals supposed to work?

Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1

This is my expectation:

const std::string utf8 = u8"???"; // or some other extended ASCII characters
assert( utf8.size() > 3);

This is the result on g++ 4.8.1

const std::string utf8 = u8"???"; // or some other extended ASCII characters
assert( utf8.size() == 3);
  • The source file is ISO-8859(-1)
  • We use these compiler directives: -m64 -std=c++11 -pthread -O3 -fpic

In my world, regardless of the encoding of the source file the resulting utf8 string should be longer than 3.

Or, have I totally misunderstood the semantics of u8, and the use-case it targets? Please enlighten me.

Update

If I explicitly tell the compiler what encoding the source file is, as many suggested, I got the expected behavior for u8 literals. But, regular literals also gets encoded to utf8

That is:

const std::string utf8 = u8"???"; // or some other extended ASCII characters
assert( utf8.size() > 3);
assert( utf8 == "???");
  • compiler directive: g++ -m64 -std=c++11 -pthread -O3 -finput-charset=ISO8859-1
  • Tried a few other charset defined from iconv, ex: ISO_8859-1 and so on...

I'm even more confused now than before...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The u8 prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.

So you have several factors at play:

  1. which encoding is the source file written in (In your case, apparently ISO-8859). According to this encoding, the string literal is "???" (3 bytes, containing the values 0xc5, 0xe4, 0xf6)
  2. which encoding does the compiler assume when reading the source file? (I suspect that GCC defaults to UTF-8, but I could be wrong.
  3. the encoding that the compiler uses for the generated string in the object file. You specify this to be UTF-8 via the u8 prefix.

Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.

However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.

You can tell GCC which source encoding to assume with -finput-charset, or you can encode the source as UTF-8, or you can use the uXXXX escape sequences in the string literal ( u00E5 instead of ?, for example)

Edit:

To clarify a bit, when you specify a string literal with the u8 prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)

If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "???". It will see the u8 prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "???".

However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:

  • it might try to parse the bytes as UTF-8, in which case it will recognize that "this is not a valid UTF-8 sequence", and issue an error. This is what Clang does.
  • alternatively, it might say "ok, I have 3 bytes here, I am told to assume that they form a valid UTF-8 string. I'll hold on to them and see what happens". Then, when it is supposed to write the string to the object file, it goes "ok, I have these 3 bytes from before, which are marked as being UTF-8. The u8 prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.

Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.

But in both cases, note that the u8 prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).

The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.

The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8 prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...