I like Ken's definition. It is no good to introduce another concept as code unit as most developers and
users are used to code point for a 16-bit coding value.
Mark Davis wrote:
> One of the very few times I have to correct Ken:
> D841 is a code unit in UTF-16
> DF00 is a code unit in UTF-16
> 10300 is a code point (aka scalar value) in the Unicode codespace. It is represented by the code units:
> F0 90 8C 80 in UTF-8 (four 8-bit units)
> D800 DF00 in UTF-16 (two 16-bit units)
> 00010300 in UTF-32 (one 32-bit unit)
> [from my handy dandy code converter at http://www.macchiato.com/mark/UnicodeConverter]
> Ken is right that a code point will only have a name if it is assigned.
> Kenneth Whistler wrote:
> > Viranga asked:
> > > I have 4 questions about character names:
> > Mark Davis, John Jenkins, and Markus Scherer addressed many of these
> > questions. And I do suggest you take a look at the ICU implementations,
> > so you don't have to reinvent the wheel here.
> > I just have a couple clarifications of terminology for you.
> > >
> > > (1) how does one figure out the character names of the code points
> > > (in ranges in the UnicodeData.txt file)?
> > "code points" do not have character names in the Unicode Standard.
> > The thing that gets an associated character name is an "encoded character."
> > This may seem like a quibble, but it actually becomes important when you
> > consider surrogate code points.
> > 00C0 is a code point in the Unicode codespace.
> > The abstract character "capital A with a grave accent" is encoded at
> > that code point (00C0).
> > The encoded character U+00C0 has the normative character name "LATIN CAPITAL
> > LETTER A WITH GRAVE".
> > Now for surrogates:
> > D841 is a code point in the Unicode codespace.
> > DF00 is a code point in the Unicode codespace.
> > 10300 is a code point in the Unicode codespace.
> > D841 and DF00 are surrogate Unicode values. They cannot be assigned to
> > abstract characters (individually), and because no encoded character is
> > ever associated with them (individually), they also have no character
> > names.
> > The abstract character "the first letter of the Etruscan alphabet" will soon
> > be encoded at the code point, 10300.
> > That encoded character U-00010300 will have the normative character name
> > "ETRUSCAN LETTER A".
> > In the encoding form, UTF-16, U-00010300 ETRUSCAN LETTER A is represented
> > by the surrogate pair D841 DF00 (a sequence of two 16-bit Unicode values).
> > >
> > > ...and also for the private use ranges
> > > (which we'll probably be needing).
> > As John Jenkins pointed out, private use code points also have no
> > character names.
> > > (2) how do I locate the ISO/IEC character naming guidelines?
> > > I looked in "The Unicode Standard Version 3.0" and it refers
> > > me to Informative Annex K of ISO/IEC 10646. Is the information
> > > available electronically? I looked at the ISO site and it said
> > > that "there is no electronic access to the contents of ISO
> > > standards" (http://www.iso.ch/infoe/faq.htm#Standards). It did
> > > mention that this was in the pipeline, but didn't say when.
> > You have to buy the standard from ISO or a national standards body to
> > get the official thing. SC2 is working on getting an online version
> > available, but there are problems regarding which version of the standard
> > it will be.
> > > (3) when surrogates are introduced, will there be mappings from
> > > surrogate pairs to character names? Will they be included
> > > in later versions of UnicodeData.txt?
> > I concur with Mark Davis here. It is most likely that UnicodeData.txt will
> > simply be extended to use 5 digit Unicode scalar value representations of
> > encoded characters from Planes 1, 2, and 14, once they are added to the
> > standard.
> > > (4) why are they called "character names" and not "code point names"?
> > See the explanation above.
> > --Ken Whistler
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT