RE: [Proposal] Extended UTF-16 by using Plane 14

From: Christian Wittern (chris@ccbs.ntu.edu.tw)
Date: Tue Apr 13 1999 - 23:40:33 EDT


Gary Roberts wrote:
>
> I thought it would be useful to extol some of the virtues of the scheme
> John Jenkins suggests, and expand on the idea a little. If one grabs some
> private use characters from the BMP (how many depends upon how many
> variants of the same character exist in your proposed script), you can
> represent a glyph variant as two UTF-16 characters: one is the abstract
> character that already exists in Unicode (or a private use character in
> UTF-16 when the standard is missing a character it should include),
> followed by the 'variant tag' private use character that designates the
> exact glyph you desire.
>
> The scheme should allow most characters to be represented with four bytes
> (shorter than your proposed utf-16 modification), and all other characters
> in six bytes. Another advantage is that a naive utf-16 display might
> display sufficiently well for a document to still be legible (as the
> utf-16 reader won't know that variant tags sould be displayed zero width,
> it could get ugly, but a user might still figure out what is going
> on).
>
> In short, I think this scheme would be much better for the problem stated
> than using ever more user defined characters that are essentially
> uninterpretable by other systems.

A similar approach is already being implemented by a group in Korea. The
digitization of the Korean Tripitaka now switched to an encoding, where a
UTF-16 representation of a character is followed by another 16-byte sequence
that indicates the variant number. I don't think this is compatible with
what other people are doing, but at least it can easily converted to a
standard text format.

>
> *
>
> On Tue, 13 Apr 1999, John Jenkins wrote:
>
> > Christian Wittern <chris@ccbs.ntu.edu.tw> writes:
> >
> > > The characters in question have deliberately not undergone
> the unification
> > > in question, since the preservation of the exact glyph shape
> is deemed of
> > > interest. This again is a reason to use the private character
> area, and
> > > again, it is a reason the number of characters needed might
> possibly exceed
> > > 131000.
> > >
> >
> > Then they don't want a character set, they want a glyph
> registry. If one
> > really *does* want to create access to a vast number of glyphs through
> > Unicode, then it would be best to take a CCCII-type approach
> and classify
> > the glyphs into families representing the abstract character of
> which they
> > are all variant appearances. You can then use a zero-width
> "variant tag" to
> > distinguish them. This is an approach Apple is investigating
> as a means of
> > helping solve the variant problem among ideographs.

Well, one of the problems CCCII ran into was the fact that some characters
can be variants of more than one character. Also, the orthograph ->variant
relation not so easily solved on a code level, since it depends on a lot of
context.

Christian Wittern



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT