Re: character names (questions)

From: Mark Davis (markdavis@ispchannel.com)
Date: Thu Apr 06 2000 - 15:22:46 EDT


One of the very few times I have to correct Ken:

D841 is a code unit in UTF-16
DF00 is a code unit in UTF-16
10300 is a code point (aka scalar value) in the Unicode codespace. It is represented by the code units:

  F0 90 8C 80 in UTF-8 (four 8-bit units)
  D800 DF00 in UTF-16 (two 16-bit units)
  00010300 in UTF-32 (one 32-bit unit)

[from my handy dandy code converter at http://www.macchiato.com/mark/UnicodeConverter]

Ken is right that a code point will only have a name if it is assigned.

Mark

Kenneth Whistler wrote:

> Viranga asked:
>
> > I have 4 questions about character names:
>
> Mark Davis, John Jenkins, and Markus Scherer addressed many of these
> questions. And I do suggest you take a look at the ICU implementations,
> so you don't have to reinvent the wheel here.
>
> I just have a couple clarifications of terminology for you.
>
> >
> > (1) how does one figure out the character names of the code points
> > (in ranges in the UnicodeData.txt file)?
>
> "code points" do not have character names in the Unicode Standard.
>
> The thing that gets an associated character name is an "encoded character."
>
> This may seem like a quibble, but it actually becomes important when you
> consider surrogate code points.
>
> 00C0 is a code point in the Unicode codespace.
>
> The abstract character "capital A with a grave accent" is encoded at
> that code point (00C0).
>
> The encoded character U+00C0 has the normative character name "LATIN CAPITAL
> LETTER A WITH GRAVE".
>
> Now for surrogates:
>
> D841 is a code point in the Unicode codespace.
> DF00 is a code point in the Unicode codespace.
> 10300 is a code point in the Unicode codespace.
>
> D841 and DF00 are surrogate Unicode values. They cannot be assigned to
> abstract characters (individually), and because no encoded character is
> ever associated with them (individually), they also have no character
> names.
>
> The abstract character "the first letter of the Etruscan alphabet" will soon
> be encoded at the code point, 10300.
>
> That encoded character U-00010300 will have the normative character name
> "ETRUSCAN LETTER A".
>
> In the encoding form, UTF-16, U-00010300 ETRUSCAN LETTER A is represented
> by the surrogate pair D841 DF00 (a sequence of two 16-bit Unicode values).
>
> >
> > ...and also for the private use ranges
> > (which we'll probably be needing).
>
> As John Jenkins pointed out, private use code points also have no
> character names.
>
> > (2) how do I locate the ISO/IEC character naming guidelines?
> > I looked in "The Unicode Standard Version 3.0" and it refers
> > me to Informative Annex K of ISO/IEC 10646. Is the information
> > available electronically? I looked at the ISO site and it said
> > that "there is no electronic access to the contents of ISO
> > standards" (http://www.iso.ch/infoe/faq.htm#Standards). It did
> > mention that this was in the pipeline, but didn't say when.
>
> You have to buy the standard from ISO or a national standards body to
> get the official thing. SC2 is working on getting an online version
> available, but there are problems regarding which version of the standard
> it will be.
>
> > (3) when surrogates are introduced, will there be mappings from
> > surrogate pairs to character names? Will they be included
> > in later versions of UnicodeData.txt?
>
> I concur with Mark Davis here. It is most likely that UnicodeData.txt will
> simply be extended to use 5 digit Unicode scalar value representations of
> encoded characters from Planes 1, 2, and 14, once they are added to the
> standard.
>
> > (4) why are they called "character names" and not "code point names"?
>
> See the explanation above.
>
> --Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT