RE: How will software source code represent 21 bit unicode charac ters?

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Apr 17 2001 - 04:20:14 EDT


William Overington wrote:
> Has this matter already been addressed anywhere?

I think the C standard is in the process of making a decision about this. If
memory helps, we will have escapes like '\uXXXX' and '\UXXXXXXXX'.

I am sure that some people on this list have precise and up-to-date info at
least about C and Java.

> May I, with permission, start a discussion by suggesting that
> \uhhhh \vhhhhh
> and \whhhhhh would be good formats. Programmers could then
> enter unicode
> characters into software source code using \u and four
> hexdecimal characters
> or using \v and five hexadecimal characters or using \w and
> six hexadecimal
> characters, as convenient for any particular character.
> Leading zeros would
> be allowed so that \w0000e9 would be the same as \v000e9 and \u00e9.

Isn't a 3-symbols system too complicated? I am already unhappy with a
2-symbols like '\u' vs. '\U'.

In a perfect world, we would probably have an enclosing symbol (e.g.
'\<4E00>') so that the number can be variable length.

> I am aware, from having used C a little, that some \ pairs
> are used for such
> things as the tab character and wonder if perhaps \v and \w
> are available
> for use in the manner suggested in the previous paragraph.

In C, '\v' is already used for 0x0B ("vertical tab").

> However, there
> will probably for many years be a practical need in many programming
> environments to enter program source code in ascii characters.

There will always be this need, because of invisible characters, combining
characters, characters having tiny or misleading glyphs, etc.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT