Re: U+xxxx, U-xxxxxx, and the basics

From: Mark Davis (markdavis@ispchannel.com)
Date: Mon Mar 06 2000 - 12:40:15 EST


The diagram on p19 is relevant here, although it should have also shown a LATIN CAPITAL LETTER Q WITH CIRCUMFLEX to be even more clear.

LATIN CAPITAL LETTER Q WITH CIRCUMFLEX can be considered an abstract character. It is representable in Unicode. It is not directly encoded in Unicode, although one could speak of it as being indirectly encoded.

We did not use the term 'encoded' in such cases because some people attach a special meaning to the term, as being representable by a single code point. However, the important issue is whether the abstract character as such is *representable* in Unicode, which it is.

However, I would caution that the whole notion of an "abstract character" is extremely fuzzy, and best avoided. As I wrote in http://www-4.ibm.com/software/developer/library/utfencodingforms:

"Avoiding ambiguity
We have seen that characters, glyphs, code points, and code units are all different. Unfortunately the term 'character' is vastly overloaded. At various times people can use it to mean any of these things:

        An image on paper (glyph)
        What an end-user thinks of as a character (grapheme)
        What a character encoding standard encodes (code point)
        A memory storage unit in a character encoding (code unit)

Because of this, ironically, it is best to avoid the use of the term 'character' entirely when discussing character encodings, and stick to the term 'code point'."

By the way, the term "text element" is often overused as well. It is just any sequence of one or more characters that is treated as a unit by *some* process. So "a" is a text element, as is "å" (whether composite or combining sequence), but so is "My aunt Mary".

John Cowan wrote:

> Peter Constable wrote:
>
> > LATIN CAPITAL LETTER Q WITH CIRCUMFLEX is a text element, but
> > not a Unicode abstract character.
>
> Right enough. It is not a *Unicode* abstract character, because it is
> not encoded in Unicode (though it is representable in Unicode). But it is
> an abstract character nonetheless: nothing says that every abstract
> character must be encoded in Unicode.
>
> Specifically, see clause 3.3, definition D3, bullet point 5.
>
> --
>
> Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com>
> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com
> Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan
> Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT