Re: Unicode Myths

From: Mark Davis (mark@macchiato.com)
Date: Tue Apr 09 2002 - 18:43:24 EDT


Thanks for your feedback, you raise some good issues (although I
disagree with some of your comments). My comments below.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "David Hopwood" <david.hopwood@zetnet.co.uk>
To: <unicode@unicode.org>
Sent: Tuesday, April 09, 2002 13:05
Subject: Re: Unicode Myths

> -----BEGIN PGP SIGNED MESSAGE-----
>
> Mark Davis wrote:
> > Thanks to the many people who suggested Myths. I have posted a new
> > version on
> >
> > http://www.macchiato.com/slides/UnicodeMyths.ppt
> >
> > with new ones included after slide 8.
>
> Here are my comments.
>
> Slide 5

Part of the problems with the slides below are that you are missing my
commentary.

> - Unassigned characters are characters (clause C6 in Chapter 3 of
the
> Standard notwithstanding).

That is a bit like saying that chocolate rabbits are rabbits...

> Search the standard for "unassigned character"; it occurs several
> times. Also, several clauses and definitions are incorrect or
> incomplete if unassigned code points do not correspond to
characters:
> at least C9, C10, C13, D1, D6, D7, D9 (which should not restrict
to
> "graphic characters"), D11, D13, D14, D17..D24, D28, and the note
> after D29.

We have been gradually tightening up the language of the standard in a
number of places. That will continue in 4.0

> - Format control characters are also characters.
They are formally characters. In lay terms, of course, they are not.

> - Private-use characters are definitely characters.
These are also an edge case. A layman would ask, what does U+E000 look
like, if it is a character?

> - The values D800..DFFF are not valid code point values, they are
UTF-16
> code unit values (the valid Unicode code point space is 0..D7FF
union
> E000..10FFFF.)

Code points are from 0 to 10FFFF, inclusive.

>
> In computer jargon, "characters" are, by definition, the things that
are
> enumerated in coded-character-sets (regardless of whether or not
they are
> displayed as simple spacing glyphs, have control functions, are not
yet
> assigned, or have any other strange properties). Apart from the
unfortunate
> "noncharacter" terminology (which would have better called
"internal-use
> characters"), all valid Unicode code points *do* correspond to
characters
> in this sense.

There are *many* different uses of the term 'character': including
'byte', 'grapheme cluster', 'code point', 'code unit', etc. That is
why it is best avoided in formal terms, and in informal terms one must
be very clear as to the sense that is intended.

>
> Note that there is no conflict between this jargon meaning of
"character",
> and its original meaning as a unit of text. The following definition
may
> be helpful:
>
> An "orthocoding" is a set of rules for representing texts in some
language
> or symbology as sequences of characters. I.e. an orthocoding
relates texts
> to sequences of characters, in the same way that an orthography
relates
> texts to arrangements of glyphs.

I assume this is your own home-brew definition. A search on Google
only gives a match on:
http://www.electronic-medical-records.com/medical%20records%20coding%2
0discussion%20lists.htm ;-)

I don't think it clarifies much to use this definition. Nor do I agree
with the characterization of text in terms of glyphs.
Which of the glyphs for the character A is permissible in English?
(See http://www.macchiato.com/utc/glyph_variation.html)

>
> When someone asks the FAQ about how to encode Indic half-forms in
Unicode,
> for instance, they have in mind an orthocoding in which the
half-forms are
> characters, and are asking how that orthocoding relates to the
Unicode
> orthocodings for Indic scripts (even if they don't think of the
question
> in that way). Input methods/keyboard layouts also effectively define
an
> orthocoding in which each keystroke corresponds to a "character".
>
> The fact that orthocodings not designed for computers normally don't
use
> control characters (apart possibly for "new line", "new paragraph",
etc.)
> does not mean that the controls in Unicode, ASCII, etc. are not
characters.
> The practical effect of defining controls not to be characters would
just
> be to require awkward constructions like "sequence of characters or
controls"
> all the time, instead of just "sequence of characters".

Formally, it is a sequence of code points. If none of the code points
are unassigned (the typical case), then one can speak of a sequence of
characters. And informally, one would just say characters.

>
> While we're on this subject, it's also redundant to say "abstract
character":
> *all* characters are abstractions, and the definition of this term
(D3 in
> Chapter 3 of the Unicode Standard) doesn't mean anything different
to plain
> "character", as defined above.

No, there is a very significant difference. While the term 'character'
is, of course, an abstraction (as with about 90% of the defined terms
in the Unicode Standard), the term 'abstract character' is used to
distinguish a particular sense of 'character', so that it is not
confused.

>
> Slide 8
> - the first bullet should say "grapheme cluster != code point".

Good, thanks.

>
> Slide 13
> - the International currency symbol is a poor analogy to a
hypothetical
> "decimal point" character: that symbol has a defined appearance
> different to any specific currency symbol, and currency symbols
> shouldn't be automatically changed according to locale anyway,
unlike
> decimal points (e.g. â,¬1.50 never means the same as $1.50, but
"1,5"
> can mean the same as "1.5").

The decimal point character, as originally proposed, would appear like
'.' in English and ',' in French. That is what is being referred to
here.

> - a better argument against encoding a "decimal point" character is
that
> it isn't distinguished on keyboards (so ',' and '.' would still
have
> to be interpreted according to context anyway).
>
> Slide 17
> - the definition of "compatibility composite" is incorrect.

Good catch, I need to update that.

> - support for round-trip transcoding does *not* require encoding of
> compatibility characters, in most cases (because only strings
need
> to be round-tripped, not individual characters). Most
compatibility
> characters are unnecessary and should not have been encoded.

No, it does. If we didn't encode half-width katakana, then when you
took SJIS to Unicode and back, you would not get the same text you
started with. If you have markup, then you could preserve the
distinction without the format characters.

> - what security risks? Any risks due to encoding of "look-alike"
> characters have nothing to do with *compatibility* characters
per-se;
> they occur also for non-compatibility look-alike characters (e.g.
> Latin U+0065 'a' and Cyrillic U+0430 'а', which obviously cannot
> be sensibly unified).

The risks due to confusibles is increased the more of them there are.

>
> In any case, IMHO compatibility decomposables are *almost* all bad.
> (The decomposable CJK radicals, for example, are not bad, but should
> not have compatibility decompositions in the first place.) Whether
or
> not you agree with this, it is a supportable opinion, not a myth.
I'm
> sure that I can defend this point of view if anyone wants to discuss
it
> in more detail.

My point in the slide is that it *is* debatable: there are pluses and
minuses.

>
> Slide 24
> - a pure 16-bit design was *not* possible, even without composed
and
> compatibility characters. This should have been recognised from
the
> start. RFC373 (published in 1972!) correctly estimated that
> 17 bits were required for a universal coded-character-set. (It
might
> not sound as though there is very much difference between 16 and
> 17 bits, but there is.)

I disagree. Of course, the design goal changed over time, and clearly
the change to include so many characters that could otherwise be
excluded makes this entire discussion moot. But do the math: If you
exclude all decomposables, and if Han is restricted to a base set of
25K, with other Han represented with variation selectors and/or an IDS
mechanism, then characters in modern use fit easily. And our original
design was to use PUA for archaic characters.

>
> Arguably, the 16-bit design was a serious mistake - a variable
length
> ASCII-compatible encoding (say 1 to 3 bytes, which allows a code
space
> of ~18 bits with alphabetic scripts in the 2-byte subspace), would
> have fit much better into existing practice, and could have been
treated
> as just another "codepage" extending ASCII, rather than requiring
> completely new APIs. (Think about how long it took for Windows 9x to
> properly support two API sets.) At the very least, 16 bits was
always
> going to impose undesirable constraints and compromises.

UTF-8 was around for most of that time, but was not adopted as widely,
by any means. Allowing people to shift first to a 16bit fixed-width
encoding, then to supplementaries has allowed people to evolve their
support over time. We can all see the flaws at this point, and see
where the design could have been better (I have my own favorite list),
but trying to predict what 'would have happened' is almost as hard as
predicting 'what will happen'. If we had had a radically different
design, we might not have been as successful -- or successful at all.

>
> One of the advantages of having a code space larger than the number
of
> characters that are actually required, is that it provides enough
room
> to designate the most important properties (for example, major
category,
> case, combining class, bidirectional class, and line/word/grapheme
break
> properties) to sufficiently large unassigned ranges. Unicode doesn't
do
> this (except for default bidirectional class), but it could have
done if
> the *original* design had had a large enough code space (>= 18
bits),
> and that would have had many advantages.

I think that (in hindsight) we could have done some of this, but there
are a lot of problems; look at the number of times we have changed the
category of characters based on new information. Part of this is
hindsight; we really didn't have as 'full-fledged' a set of properties
then, nor know as much about what properties we needed.

>
> Slide 29
> - there are 1,112,064 valid Unicode code points, not 1,114,112.
> (D800..DFFF are not valid code points.)

See above.

Note: If we could do it all over again, I would have had no surrogate
code points and no shortest form UTF-8 issue. One would just start
from zero, and keep the code points within 24 bits, using something
like the following (I haven't checked the figures):

UTF-16:
00000..0F83F - 16 bits
0F840..FF840 - 32 bits

UTF-8
00000..0007F - 8 bits
00080..0087F - 16 bits
0087F..1087F - 24 bits
10880..FF840 - 32 bits

The code is just as simple for conversion as what we have now, and the
oddities of shortest forms and surrogate code points wouldn't exist.
But this is all hindsight.

>
> Slide 33
> - the first bullet point is correct - human writing systems are
complex.
> However, the second bullet point is extremely dubious: there was
no
> need for multiple byte orders (it would have been perfectly
reasonable
> to specify a fixed byte order for external representation),
multiple
> encoding and normalisation forms, or for the vast majority of
compatibility
> characters. There is a considerable amount of *unnecessary*
complexity in
> Unicode, that is not imposed by the problem of defining a
universal
> coded-character-set. Some of that complexity would have been
difficult to
> avoid without the benefit of hindsight, but some could and should
have
> been avoided.

This is in hindsight. Again, who knows if Microsoft would have adopted
Unicode had we enforced BE?

> - "Yen vs backslash" doesn't belong in the list because it is not a
> complexity of Unicode. U+005C is unambiguously backslash; the
fact that
> 0x5C can mean either yen or backslash in, e.g. the IANA
"Shift_JIS"
> charset, is a problem with the definition of that charset (which
could
> only be fixed by changing the IANA charsets registry), and has
nothing
> to do with Unicode.

While Yen vs. Backslash is not formally a problem with Unicode, it is
an issue that programmers using Unicode need to deal with.

>
> - --
> David Hopwood <david.hopwood@zetnet.co.uk>
>
> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
> RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66
15 01
> Nothing in this message is intended to be legally binding. If I
revoke a
> public key but refuse to specify why, it is because the private key
has been
> seized under the Regulation of Investigatory Powers Act; see
www.fipr.org/rip
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: 2.6.3i
> Charset: noconv
>
> iQEVAwUBPLNF2TkCAxeYt5gVAQEP+Af+P6NlrhpRaCu/nJnhfippikMK/Yf8mPw9
> t4ZNf9b+SKrXPolK3fhLZruGxFJ8doBf7waL2qMyajUzlqDBc6KkltaEDLThl4DM
> UNgdJwgJiNpqFsgOP2f6ruRcOfmSrqPT7F0l9rghS+doP6tED/9Kx9xVUNPSA3HN
> +iIXl0A+0nnUJvAhV7IaurrH3cWTPUFRNerciHqBA5PXSahqYKvB+JhbjNO+lZYS
> 4Mc/sM9VNwB6oyzo9sasmywEZMCSOONA2erMMloo7KIRFUvo3eTyi4gqWQfECsOB
> cfd73myFeTTyz8HtY5IqQq1TRAvVt8FJRyJnHMt2CZtSq4Yi161iuw==
> =91Sv
> -----END PGP SIGNATURE-----
>
>



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 19:28:39 EDT