RE: Level of Unicode support required for various languages

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 31 2007 - 09:17:53 CST

Next message: Andrew West: "Re: Level of Unicode support required for various languages"

Previous message: John H. Jenkins: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"
In reply to: David Starner: "Re: Level of Unicode support required for various languages"
Next in thread: Andrew West: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

David Starner wrote:
> On 10/30/07, vunzndi@vfemail.net <vunzndi@vfemail.net> wrote:
> > Quoting "Mark E. Shoulson" <mark@kli.org>:
> > > vunzndi@vfemail.net wrote:
> > >> The minimum is likely to be about 30 thousand, to be honest nobody
> > >> knows what the upper limit is, but 100k would not be inconceivable.
> > > I knew that dropping planes 17-65535 was a bad idea!

A bad idea for what? If this means dropping planes that would be needed for
private use, there's ample enough space in the PUA blocks to create private
surrogates and to extend your limit at will to support billions of private
use characters.

> > Yes, this has always struck me as a strange decision, but it would
> > seem to be one that can be reversed without any stability issues.
> Except for breaking all the code out there that uses UTF-16, SCSU, or
> otherwise depends on the limit.

Yes, allocating private surrogates in PUA blocks will not affect UTF-16 and
SCSU requirements (where they will be treated as if they were isolated PUAs,
even if they are actually used in sequences as private surrogates; these
UTFs will not detect the sequence boundaries but it does not matter here,
for PUAs whose interpretation is left to the private usage).

The only important thing in application is just to make sure that sequences
of PUAs will not be reordered or truncated without knowledge of their
semantic. At best, one could suggest that ANY sequence of PUAs is a
transparent unbreakable binary object in applications, until one of the PUAs
is recognized by the locally recognized PUA convention.

So a sequence like <E400 E800 EC00> still can be privately interpreted as
the encoded form for a single private entity (each code point here is used
as a private surrogate: a leading high "private surrogate", a middle one,
and a low one, each one containing 10 bits of information), even if the
Unicode compliant applications see this sequence as three distinct PUAs. The
sequence does not contain any one of the standard surrogates in D800..DBFF
and DC00..DFFF, so this does not break UTF-16 rules.

Now if more than one billion private use entities is not enough, you could
use longer sequences like <F0000..F3FFF, F4000..F7FFF, F8000..FBFFF> (using
PUAs in plane 15): each private PUA here encodes 12 bits of information, the
whole sequence representing 36 bits (more than 64 billions encoded
entities...). In UTF-32, it will be stored as 3 PUAs in plane 15 (total = 96
bits of storage or 12 bytes); in UTF-16 it will be stored as six standard
PUAs of the BMP (total = 96 bits of storage or 12 bytes); in UTF-8, it will
be stored as 3 sequences of 4 bytes (total = 12 bytes): there's no
difference in space requirement here between UTF-8, UTF-16 and UTF-32.

This sequence will still support SCSU and other Unicode compression
encodings without difficulty, as well as any other encoding (GB18030?) that
has a roundtrip mapping with existing Unicode PUAs, and do not change their
relative encoding order, and do not truncate long sequences of PUAs.

In other words, there's no need to support more than 17 planes for the
encoding of standard characters. And there's ample enough space for encoding
any number of private entities (I won't call them characters because this
conflicts with Unicode definition of characters) with the existing PUAs that
have been allocated.

The difficulty is then not in private use characters, but in resisting to
the encoding of many new characters within the set of standard characters
(within one of the first 15 planes), when this is not justified. A defensive
way to protect the UCS from being filled with many ideographs would be to
create a standard compositional encoding for these personal names used in
China and Taiwan.

Imagine what would happen if each one of the billion Chineses wants his own
ideograph for his signature, and these signatures are made legal in PRC or
Taiwan! A stringer compositional model, based on the initial principles of
existing IDS as defined in TUS (but *with extensions* such as those used at
IRG) would be helpful to make the scheme workable for the very long term.

********

Another way to handle this would be to define a core subset of the SVG
graphic format for creating custom glyphs that remain comparable, and
assignable to a URI (such as an URI to a record national public registry
containing the definition of this SVG file), and then define the way to
safely transport this URI in documents: this does not require any encoding
in Unicode, it will permit any number of personal names.

One way to encode this in plain text would be to register only characters in
plane 14 or the encoding of these special glyph references (we have language
tags starting by E0001, we could as well reuse the rest of these special tag
characters, by using another prefix tag to indicate that there's a glyph
reference.)

(a security risk is the possible creation of confusable, but this could be
avoided by using only URIs hosted by certified national registries, where
even the SVG stored may be tagged, and possibly redirected for unification
purpose).

The national registry could also store properties, in the XML entry pointed
to by this URI, like an IDS description string (to help recognizing
confusables), and the expected usage (people lifetime, frequency, dates of
creation and expected end of usage, linguistic and semantic information,
possible transcriptions and vocal spelling...).

Philippe.

Next message: Andrew West: "Re: Level of Unicode support required for various languages"
Previous message: John H. Jenkins: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"
In reply to: David Starner: "Re: Level of Unicode support required for various languages"
Next in thread: Andrew West: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 09:22:32 CST