Re: Backslash n [OT] was Line Separator and Paragraph Separator

From: jon@hackcraft.net
Date: Thu Oct 23 2003 - 03:18:09 CST


> From: <jon@hackcraft.net>
> > However because the universal-character-name escapes (\uXXXX and
> \UXXXXXXXX)
> > are defined relative to a particular encoding, namely ISO 10646, it would
> be an
> > error if ('\n' != '\u000A' || '\r' != '\u000D'). Whether this is
> implemented by
> > using the values 0x0A and 0x0D for LF and CR respectivley (e.g. by using
> US-
> > ASCII or a proper superset of US-ASCII such as Unicode) or by converting
> those
> > values to another encoding when parsing isn't specified.
>
> You're wrong here:
> Neither Unicode or ISO specify that the source constants '\n' or '\r', which
> are made with an escaping mechanism of _multiple_ distinct characters
> specific for some programming languages must be bound at compile-time or
> run-time to a LF or CR character.

How foolish of me to assume that ISO 14882 somehow represented the position of
ISO on the matter.

> The '\n' and '\r' conventions are specific to each language, and C/C++ use
> conventions distinct from those in Java for example... This is not an
> encoding issue, but a language feature.

That's why I referred to the ISO standard for the C++ language rather than any
for encodings.

> In C or C++, if you want to be sure that your program will be portable when
> you need to specify LF or CR exclusively, you MUST NOT use the '\n' and '\r'
> constants but instead the numeric escapes in strings (i.e. "\012" or "\x0A"
> for LF, and "\015" or "\x0D" for CR), or simply the integer constants for
> the char, int, or wchar_t datatypes (i.e. 10 or 012 or 0x0A for LF, and 13
> or 015 or 0x0D for CR), and make sure that your run-time library will map
> these values correctly with your run-time locale or system environment (you
> may need to specify file-open flags to control this mapping, such as the "t"
> flag for fopen function calls).

This is true if you want to specify LF or CR in a string of bytes that will be
accepted as a particular character encoding - that is if you are ignoring the
built-in string and character features and dropping down to the byte level in
your own code. If you want to specify LF or CR in the implementation character
set then you should not use these, as they may be incorrect, rather you should
either use \n and \r or \u000A and \u000D or \U0000000A and \U0000000D.

> So a test like: "if ('\n'==10)" may or may not be true in C/C++, depending
> on the compiler implementation (but not of the system platform...), and the
> same test in Java will always be true...

Of course, indeed I just said that! If it were true then that would imply
that '\xNNNN' == '\uNNNN' making the \u and \U escapes rather pointless.
Further this is true for all characters, if('a'==61) may not be true in C++
either.
But if ('\n'=='\u000A') should always be true, because ISO 14882 defines \n as
LF and defines \uNNNN as "that character whose short name in ISO/IEC 10646 is
0000NNNN" and the character whose short name in ISO/IEC 10646 is 00000000A is
LF.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST