Re: Possible problem going forward with normalization

From: peter_constable@sil.org
Date: Thu Dec 23 1999 - 01:20:08 EST


       Ken has given an excellent overview of the issues, as usual.
       There is an alternative that could permit not having to enfore
       Ken's

>Principle 1: Post-Unicode 3.0, do not encode any new combining
       marks that are intended to represent decomposable pieces of
       already existing encoded characters. And if such a combining
       mark does get encoded, despite everyone's best intentions,
       *NEVER* *EVER* use it in a canonical decomposition in
       UnicodeData.txt, even if it confuses people not to do so.

       and

>Principle 2: Post-Unicode 3.0, do not encode any new
       precomposed characters that are already representable by
       sequences of base character plus one or more combining marks.
       To do so would be superfluous; processing depending on
       normalization will decompose
       it anyway into the combining character sequence that was
       already valid, so encoding it as a precomposed character does
       nothing but add another equivalence to the already overburdened
       tables, without accomplishing what the encoding proposer
       presumably intended.

       (Actually, I'd want to see Principle 2 kept regardless, but
       Principle 1 does seem like another unfortunate Unicode
       compromise to make things actually work.)

       But this alternative probably won't seem too attractive to
       many:

       Preamble:
       We already know that unadorned data is meaningless - we can
       only assign an interpretation to a byte sequence if we begin
       making some assumptions about it, e.g. that it's text, that it
       has a certain encoding, etc. We can apply certain heuristics so
       that the assumptions are at least educated guesses, but we
       often need to resort to enriching the data in some way to
       provide clues to the interpretation. E.g., we suffix the
       filename with ".txt" so that we know the data is text and not a
       spreadsheet. But we still don't know what encoding/character
       set to use to interpret the byte sequence of the "plain text",
       and so we prefix the byte sequence with xFF xFE.

       Since we're already prepared to take these minimal steps to
       help us identify our data as plain text and to begin making
       sense of it, lets recognise that even this isn't necessarily
       meaningful: I may be able to determine (given the assumptions I
       made) that the text contains the character sequence "chat", but
       I still dont' know any of the things that are of real interest
       to me, such as the intended pronunciation and/or meaning of
       that string. *I don't know what language this is!" Again, I can
       apply heuristics to make an educated guess if my string is even
       a little longer, and in a lot of cases the intended reader is
       able to identify the language and make sense. But there may be
       a definite benefit if we were to tag runs of text to identify
       the language. (I note that some commercial apps, such as MS
       Word, are already doing this - it facilitates use of proofing
       tools for multiple languages.) And maybe there could be
       benefits to enriching our byte sequence with other additional
       information.

       Proposed alternative:
       All Unicode text data is marked to indicate its normalisation
       state; this information would include the version assumed when
       normalisation was performed, if the data is normalised.

       With this additional help, Normix and Ny-Normix can make much
       more informed decisions about what to do with text.

>Now consider some possible Unicode 4.0 data:

>LATIN SMALL LETTER A, COMBINING GRACKLE, COMBINING ACUTE
       ACCENT

>Ny-Normix will normalize this (Form D) to:

>LATIN SMALL LETTER A, COMBINING ACUTE ACCENT, COMBINING
       GRACKLE

>Normix, of course, could not do this reordering...

       No change here.

>...The critical issue is that Ny-Normix not report that
       Unicode 3.0 data normalized by Normix is *not* in fact
       normalized -- or what amounts to the same thing, that a Unicode
       3.0 string normalized by Normix will compare not equal if
       normalized again by Ny-Normix...

       In this case, Ny-Normix would report that the data is U3
       normalized, but not U4 normalized. A process that compares a
       Unicode 4.0 string (that must be what Ken meant) that has been
       normalized by Normix with the same string normalized by
       Ny-Normix will know (because the strings are so marked) that
       it's comparing a U3-normalized string with a U4-normalized
       string, and if it's *really* smart, it will be able to examine
       any differences and determine whether they are significant or
       not.

>Second example:

>CYRILLIC SMALL LETTER EN WITH DESCENDER

>This is a Unicode 3.0 letter. Normix will normalize it as
       itself...

>The problem is what will Ny-Normix, working with Unicode 4.0
       data tables, do. If CYRILLIC SMALL LETTER EN WITH DESCENDER has
       a canonical decomposition in the new tables, using the newly
       encoded COMBINING CYRILLIC DESCENDER, then Ny-Normix *will*
       normalize this letter differently than Normix did (for both
       Form D and Form C). That is
       bad, since it means that upgrading from Normix to Ny-Normix
       would
       invalidate already normalized data (that could be stored
       anywhere
       by that time).

       In this case, upgrading from Normix to Ny-Normix doesn't
       invalidate the data; it just means that not all the data has
       identical status. The data is marked, however, to tell us its
       normalization status, and this additional info provides what's
       needed to handle data. There would be various possible
       alternatives for implementation: leave existing data as it is
       and add more knowledge to algorithms that compare strings;
       leave existing data as is but examine a datum when used and
       re-normalize if the normalization status reflects an old
       version; ...

       (I won't re-examine Ken's third example since I wouldn't want
       to encourage addition of precomposed characters, and by now you
       know what comments would be made anyway.)

       Of course, strictly enforcing this proposal across the board
       means we'd be giving up something: plain text. And that's
       probably too strict a requirement - besides being impossible to
       guarantee. Though these issues wouldn't arise with true plain
       text (processes that compare strings could never assume a
       string is normalized since the only available info is just the
       plain text: you'd have to normalize in order to tell if it was
       already normalized; so any comparisons would have to normalize
       both strings first, and the same normalization would be
       applied). But it requires introducing new infrastructure, this
       infrastructure does move further from that universal ideal of
       plain text, and problems could arise for any existing
       implemenations that don't already track this normalization
       status info.

       And for reasons such as these, I'll guess that most of you who
       read this far will already have rejected the idea.

       Don't mind me - I'm just excersising my brain.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT