Re: Backslash n [OT] was Line Separator and Paragraph Separator

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 23 2003 - 19:08:30 CST


> Of course, indeed I just said that! If it were true then that would imply
> that '\xNNNN' == '\uNNNN' making the \u and \U escapes rather pointless.

That's not pointless:

- '\xNNNN' is interpreted by C compilers as '\xNN' and two uppercase letters
N, where '\xNN' is compiled according to the source code character set. This
generates 3 characters in the source code character set, which will then be
converted to the destination charset used at run-time (this charaset may or
may not be Unicode, depending on how the "char" C datatype is set, but most
probably, it won't be converted to any of its UTF and will stay in the
source charset). The hex sequence after '\x' is almost alwyas limited to 2
digits, as this corresponds to the size of char (most often a single byte),
with only exception for 9-bit systems where 1 byte will still contain only 1
char, but with 512 combinations (so there may exist 512 characters in the
source charset)...

- '\uNNNN' is to be interpreted in the Unicode encoding and charset only,
whatever the source or destination charset. It should compile correctly only
to create wchar_t instances, provided that the target charset contains this
Unicode character. But some compilers may be able to convert the Unicode
codepoint into a target charset/encoding, using some UTF scheme (only
available for string and wchar_t constants, not for char constants). There's
no support here for Unicode characters out of the BMP, except if you specify
a pair of surrogates in string constants only, like "\uD800\uDC00".

- '\UNNNNNN' is similar but for codepoints in UTF-32 form. It may be
available on C compilers that support wchar_t with more than 16 bits (most
probably then 32-bit or 24-bit). The C compiler should forbid any assigned
invalid codepoint such as surrogates and assigned on-characters like U+FFFE.
In practice, for now, most C/C++ compilers support wchar_t as 16-bit
unsigned shorts, and have no support for '\U' in character constants, but
may provide this support for string constants if the target charset is
Unicode (in that case it may convert it first to a UTF-16 sequence), or if
the target charset contains the corresponding character. (For example it can
be used in some Chinese source code encoded with GB18030, as a way to allow
the source code to be remapped to ASCII or UTF-* for transmission, even if
the target system will use GB2312 or Unicode at run-time).



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST