Re: Terminology question: character-like thing

From: Mark E. Davis (markdavis@ispchannel.com)
Date: Wed Sep 29 1999 - 11:25:17 EDT


By coincidence, IBM is starting up a new site for unicode developers (at
http://www.ibm.com/developer/unicode/), and I have a paper there that
discusses some of these topics.

As to your specific question, I can provide some more information. As in
the paper, I'll avoid the use of the term "character" to prevent ambiguity
where possible.

- A "combining character sequence" (CCS) is a base code point followed by
zero or more combining code points.
- A code point that is canonically equivalent to a CCS is called a
"composite" (aka "precomposed character").
- A combining character sequence is a type of "grapheme" (aka "user
character"). Besides CCSs, graphemes also include Indic syllables, Thai/Lao
syllables and Hangul Jamo syllables.
- Graphemes themselves are a type of "text element", which just means any
sequence of characters that has particular significance to a given type of
process. Examples of text elements include graphemes, consonant clusters,
words, line-break tokens, compiler identifiers, sentences, etc.

Probably the best term for what you are looking for would be "unencoded
combining character sequence" (UCCS). That has a slight ambiguity since a
single base character is also a CCS, so strictly speaking any unencoded
base character is also a UCCS. As you say, these have no particular
correspondance to glyphs, which can be either finer grained than
characters/code points, or coarser grained.

Mark

Juliusz Chroboczek wrote:

> I still have a problem with Unicode terminology.
>
> I think I understand the concept of glyph. It is my understanding
> that Unicode defines the set of characters as being in one-to-one
> correspondence with codepoints; thus, we have non-combining characters
> and combining characters. There also is an equivalence on strings of
> characters (or, equivalently, finite sequences of codepoints), whence
> the canonical representatives (``normalisation forms''). (I'm
> glossing over the fact that there are actually several notions of
> equivalence.)
>
> Now, it seems to me that underlying all of this there is a notion of
> ``non-necessarily encoded non-combining character'' (NNENCCS) that
> corresponds to a sequence of zero, one or several combining characters
> followed by a single non-combining character (taken up to equivalence,
> of course). Think of the set of non-Unicode characters as the set of
> all precomposed forms that might conceivably be encoded in Unicode
> (although, of course, they won't, for very good reasons). Examples of
> NNENCCS are things such as LATIN SMALL LETTER E WITH OGONEK AND ACUTE
> or ARABIC LETTER ALIF WITH DOT ABOVE.
>
> Does this notion make sense? Note here that I'm not assuming that the
> NNENCCSes are in one-to-one correspondence with glyphs, and I think the
> notion is pretty natural for, say, Arabic too, as it makes sense to
> speak of the ARABIC LETTER HEH WITH ACUTE without specifying the form
> of the HEH.
>
> What's the official name of a NNENCCS?
>
> Thanks,
>
> J.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT