How will software source code represent 21 bit unicode characters?

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Tue Apr 17 2001 - 02:33:16 EDT


In Java source code one may currently represent a 16 bit unicode character
by using \uhhhh where each h is any hexadecimal character.

How will Java, and maybe other languages, represent 21 bit unicode
characters?

In view of the fact that there may be many computer languages that will in
the future allow the representation of 21 bit unicode characters in source
code using just ascii characters, would people agree that it might be a good
idea to try to gain a consensus now as to a desired format in the hope that
language authors might like, unless there is some good reason otherwise for
a particular language, to use such a consensus format?

Has this matter already been addressed anywhere?

May I, with permission, start a discussion by suggesting that \uhhhh \vhhhhh
and \whhhhhh would be good formats. Programmers could then enter unicode
characters into software source code using \u and four hexdecimal characters
or using \v and five hexadecimal characters or using \w and six hexadecimal
characters, as convenient for any particular character. Leading zeros would
be allowed so that \w0000e9 would be the same as \v000e9 and \u00e9.

I am aware, from having used C a little, that some \ pairs are used for such
things as the tab character and wonder if perhaps \v and \w are available
for use in the manner suggested in the previous paragraph.

I realize that it would be nice to be able to enter strings into computer
language source code directly as the unicode character and this suggestion
of \u \v and \w does not seek to impede that possibility. However, there
will probably for many years be a practical need in many programming
environments to enter program source code in ascii characters.

William Overington

17 April 2001



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT