Ken has given an excellent overview of the issues, as usual.
There is an alternative that could permit not having to enfore
Ken's
>Principle 1: Post-Unicode 3.0, do not encode any new combining
marks that are intended to represent decomposable pieces of
already existing encoded characters. And if such a combining
mark does get encoded, despite everyone's best intentions,
*NEVER* *EVER* use it in a canonical decomposition in
UnicodeData.txt, even if it confuses people not to do so.
and
>Principle 2: Post-Unicode 3.0, do not encode any new
precomposed characters that are already representable by
sequences of base character plus one or more combining marks.
To do so would be superfluous; processing depending on
normalization will decompose
it anyway into the combining character sequence that was
already valid, so encoding it as a precomposed character does
nothing but add another equivalence to the already overburdened
tables, without accomplishing what the encoding proposer
presumably intended.
(Actually, I'd want to see Principle 2 kept regardless, but
Principle 1 does seem like another unfortunate Unicode
compromise to make things actually work.)
But this alternative probably won't seem too attractive to
many:
Preamble:
We already know that unadorned data is meaningless - we can
only assign an interpretation to a byte sequence if we begin
making some assumptions about it, e.g. that it's text, that it
has a certain encoding, etc. We can apply certain heuristics so
that the assumptions are at least educated guesses, but we
often need to resort to enriching the data in some way to
provide clues to the interpretation. E.g., we suffix the
filename with ".txt" so that we know the data is text and not a
spreadsheet. But we still don't know what encoding/character
set to use to interpret the byte sequence of the "plain text",
and so we prefix the byte sequence with xFF xFE.
Since we're already prepared to take these minimal steps to
help us identify our data as plain text and to begin making
sense of it, lets recognise that even this isn't necessarily
meaningful: I may be able to determine (given the assumptions I
made) that the text contains the character sequence "chat", but
I still dont' know any of the things that are of real interest
to me, such as the intended pronunciation and/or meaning of
that string. *I don't know what language this is!" Again, I can
apply heuristics to make an educated guess if my string is even
a little longer, and in a lot of cases the intended reader is
able to identify the language and make sense. But there may be
a definite benefit if we were to tag runs of text to identify
the language. (I note that some commercial apps, such as MS
Word, are already doing this - it facilitates use of proofing
tools for multiple languages.) And maybe there could be
benefits to enriching our byte sequence with other additional
information.
Proposed alternative:
All Unicode text data is marked to indicate its normalisation
state; this information would include the version assumed when
normalisation was performed, if the data is normalised.
With this additional help, Normix and Ny-Normix can make much
more informed decisions about what to do with text.
>Now consider some possible Unicode 4.0 data:
>LATIN SMALL LETTER A, COMBINING GRACKLE, COMBINING ACUTE
ACCENT
>Ny-Normix will normalize this (Form D) to:
>LATIN SMALL LETTER A, COMBINING ACUTE ACCENT, COMBINING
GRACKLE
>Normix, of course, could not do this reordering...
No change here.
>...The critical issue is that Ny-Normix not report that
Unicode 3.0 data normalized by Normix is *not* in fact
normalized -- or what amounts to the same thing, that a Unicode
3.0 string normalized by Normix will compare not equal if
normalized again by Ny-Normix...
In this case, Ny-Normix would report that the data is U3
normalized, but not U4 normalized. A process that compares a
Unicode 4.0 string (that must be what Ken meant) that has been
normalized by Normix with the same string normalized by
Ny-Normix will know (because the strings are so marked) that
it's comparing a U3-normalized string with a U4-normalized
string, and if it's *really* smart, it will be able to examine
any differences and determine whether they are significant or
not.
>Second example:
>CYRILLIC SMALL LETTER EN WITH DESCENDER
>This is a Unicode 3.0 letter. Normix will normalize it as
itself...
>The problem is what will Ny-Normix, working with Unicode 4.0
data tables, do. If CYRILLIC SMALL LETTER EN WITH DESCENDER has
a canonical decomposition in the new tables, using the newly
encoded COMBINING CYRILLIC DESCENDER, then Ny-Normix *will*
normalize this letter differently than Normix did (for both
Form D and Form C). That is
bad, since it means that upgrading from Normix to Ny-Normix
would
invalidate already normalized data (that could be stored
anywhere
by that time).
In this case, upgrading from Normix to Ny-Normix doesn't
invalidate the data; it just means that not all the data has
identical status. The data is marked, however, to tell us its
normalization status, and this additional info provides what's
needed to handle data. There would be various possible
alternatives for implementation: leave existing data as it is
and add more knowledge to algorithms that compare strings;
leave existing data as is but examine a datum when used and
re-normalize if the normalization status reflects an old
version; ...
(I won't re-examine Ken's third example since I wouldn't want
to encourage addition of precomposed characters, and by now you
know what comments would be made anyway.)
Of course, strictly enforcing this proposal across the board
means we'd be giving up something: plain text. And that's
probably too strict a requirement - besides being impossible to
guarantee. Though these issues wouldn't arise with true plain
text (processes that compare strings could never assume a
string is normalized since the only available info is just the
plain text: you'd have to normalize in order to tell if it was
already normalized; so any comparisons would have to normalize
both strings first, and the same normalization would be
applied). But it requires introducing new infrastructure, this
infrastructure does move further from that universal ideal of
plain text, and problems could arise for any existing
implemenations that don't already track this normalization
status info.
And for reasons such as these, I'll guess that most of you who
read this far will already have rejected the idea.
Don't mind me - I'm just excersising my brain.
Peter
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT