Re: character names (questions)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Apr 06 2000 - 14:52:11 EDT

Next message: Mark Davis: "Re: character names (questions)"
Previous message: Markus Scherer: "Re: Microsoft Code Page Tables"
Maybe in reply to: Viranga Ratnaike: "character names (questions)"
Next in thread: Mark Davis: "Re: character names (questions)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Viranga asked:

> I have 4 questions about character names:

Mark Davis, John Jenkins, and Markus Scherer addressed many of these
questions. And I do suggest you take a look at the ICU implementations,
so you don't have to reinvent the wheel here.

I just have a couple clarifications of terminology for you.

>
> (1) how does one figure out the character names of the code points
> (in ranges in the UnicodeData.txt file)?

"code points" do not have character names in the Unicode Standard.

The thing that gets an associated character name is an "encoded character."

This may seem like a quibble, but it actually becomes important when you
consider surrogate code points.

00C0 is a code point in the Unicode codespace.

The abstract character "capital A with a grave accent" is encoded at
that code point (00C0).

The encoded character U+00C0 has the normative character name "LATIN CAPITAL
LETTER A WITH GRAVE".

Now for surrogates:

D841 is a code point in the Unicode codespace.
DF00 is a code point in the Unicode codespace.
10300 is a code point in the Unicode codespace.

D841 and DF00 are surrogate Unicode values. They cannot be assigned to
abstract characters (individually), and because no encoded character is
ever associated with them (individually), they also have no character
names.

The abstract character "the first letter of the Etruscan alphabet" will soon
be encoded at the code point, 10300.

That encoded character U-00010300 will have the normative character name
"ETRUSCAN LETTER A".

In the encoding form, UTF-16, U-00010300 ETRUSCAN LETTER A is represented
by the surrogate pair D841 DF00 (a sequence of two 16-bit Unicode values).

>
> ...and also for the private use ranges
> (which we'll probably be needing).

As John Jenkins pointed out, private use code points also have no
character names.

> (2) how do I locate the ISO/IEC character naming guidelines?
> I looked in "The Unicode Standard Version 3.0" and it refers
> me to Informative Annex K of ISO/IEC 10646. Is the information
> available electronically? I looked at the ISO site and it said
> that "there is no electronic access to the contents of ISO
> standards" (http://www.iso.ch/infoe/faq.htm#Standards). It did
> mention that this was in the pipeline, but didn't say when.

You have to buy the standard from ISO or a national standards body to
get the official thing. SC2 is working on getting an online version
available, but there are problems regarding which version of the standard
it will be.

> (3) when surrogates are introduced, will there be mappings from
> surrogate pairs to character names? Will they be included
> in later versions of UnicodeData.txt?

I concur with Mark Davis here. It is most likely that UnicodeData.txt will
simply be extended to use 5 digit Unicode scalar value representations of
encoded characters from Planes 1, 2, and 14, once they are added to the
standard.

> (4) why are they called "character names" and not "code point names"?

See the explanation above.

--Ken Whistler

Next message: Mark Davis: "Re: character names (questions)"
Previous message: Markus Scherer: "Re: Microsoft Code Page Tables"
Maybe in reply to: Viranga Ratnaike: "character names (questions)"
Next in thread: Mark Davis: "Re: character names (questions)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT