L2/99-044

From: Martin J. Duerst [duerst@w3.org]
Sent: Monday, February 01, 1999 4:45 AM
To: Multiple Recipients of Unicore
Cc: unicore@unicode.org
Subject: NORMALIZATION

Hello Mark,

I guess you must be very buisy, and I'm late with these comments
and very sorry about it. I was a bit out of the loop over Christmas/
New Year, but now I'm back. Here is my feedback on the normalization
draft technical report
(http://www.unicode.org/unicode/reports/tr15/tr15-10.html).

It's in three parts. First, things that I think might affect
Unicode 3.0. Second, details on your TR. Third, some general
considerations.

I will send a copy of this mail also to the W3C I18N WG.


Things affecting Unicode 3.0
============================

Sorry, I only got the code charts, and not the actual text,
so I can only comment on previous discussions and what's
in 2.0/2.1.

This is mainly Korean canonicalization. What we want is that
decomposition results in simple three-code L(eading)V(owel)T(railing)
or two-code LV sequences. We therefore have to move the Jamo
subcomposition from being a docmpatibility decomposition to
being some separate kind of decomposition, as Ken suggested it
and we discussed it in December. In the database, this can easily
be taken into accout by just giving these decompositions a
separate label, instead of the <compat> that we have currently.

Another thing that we have to consider in this context is
the interaction of current-time hangul precompositions
(johab) and ancient-time jamo. Basically, for cases such
as e.g. KAKR.

The problem here is that KR is listed as a trailing jamo
(U+11C3), but it is not a modern one, and therefore KAKR
is not in the johab list.

KAKR is representable at least as:

K - A - KR     (favorize 2/3-code jamo)
KA - KR        (use precomposition, but keep individual syllable
                components together)
KAK - R        (do greedy composition from the start)

The first variant is what I described in my early internet draft.
It seems to be the most natural solution in this case, but I know
"natural" is very subjective.

Some of the current text about how to construct johab affects this.
We should try to make sure that:
- The text describing johab composition and the C form of Hangul
  match.
- The two things above match general expectations, if there is
  anything like that.

For the vowels, the same problem appears. On the leading component,
e.g.: NKAK (NK is U+1113), there seems to be only one reasonable
representation, namely
    NK - A - K
Or is there another one?

We should also make clear that (whether?) the insertion of fillers
as described on page 3-12 of Unicode 2 is part of canonicalization.


Comments on the TR itself
=========================

- It's very good to have the things that are not yet fixed explicitly
  mentionned.

- Mike Ksar made me aware of one specific problem: What happens with
  combining characters at the beginning of a string? Concatenation
  is one of the most frequent operations on strings, I guess. It would
  be nice if the concatenation of two canonicalized strings was again
  canonical. How can we make sure that this happens? Should we specify
  that initial combining marks get changed to their spacing equivalents?
  That a space is introduced and then the algorithm is applied? Or what
  else?

- In the "labels" table, you write "see below". Please say where
  exactly the reader should look.

- There is a paragraph about "Level 1 as defined in ISO/IEC 10646-1".
  Please make sure that this paragraph mentionnes versioning; for
  later versions than the "cutoff version", this will not be true
  anymore.

- The sections "labels", "notation", and "definitions" are difficult
  to distinguish. If one has a very close look, one sees why the
  distinction is made, but most readers won't go that deep, and will
  be rather confused. This can either be improved by changing the
  titles, or changing the structure even more. E.g. the section on
  labels could be integrated with the Introduction (which already
  mentions all four forms) by just saying that a table gives an
  overview of these four forms, the labels that are defined for
  them,...

- The "table of contents" should be moved to before the Intro.

- The specification section takes a first go ("Basically,..."), and
  then starts again. As this is the specification, I propose to really
  only just have the actual definitions. In general, beware of the
  word "basically".

- In the specification section, change the subtitle e.g. to
  "Normalization Form C" only, then add a sentence such as
  "The Normalization Form C for a string S is obtained by applying
   the following process, or any other process that leads to the
   same result".

- Part of the "notation" is only used in the examples section.
  Please move that there.

  Why use ND(), NC(),...? Why not just D()? The parentheses
  make clear whether it's the normalization form label or the
  function.

- "Primary canonical compositions": This is an interesting concept.
  But I think we should turn things around. Instead of marking the
  usual case as "primary", we should mark the exception case as
  something like "non-recomposing". It turns out that this case
  at least contains the following:
  - One-to-one compositions such as Kelvin and Angstrom
  - Special cases such as Hebrew
  - Post v3.0 precomposed characters

  Because we need the second and the third case in the database,
  we don't need to introduce definitions for the first case.
  [In the database file, I propose to use a label such as <norecomp>;
  this would make the v3.0 database file slightly incompatible with
  the v2.x database file, but the v2.x one could be produced easily
  with a sed/awk/perl expression such as "s/<norecomp>//".]

  Single-character decompositions would then just be listed as
  <norecomp>, and wouldn't be a special case anymore.
  With this, the section on definitions could easily be integrated
  with the specification section. The specifications themselves
  would then become a bit longer, but I think this is desirable.
  Trying to factor out everything from the specification itself,
  and making the specification as short and crisp as possible, is
  an interesting intellectual exercise, but it means that an
  implementer has to assemble all the factored-out pieces, and
  will easily overlook one of these pieces.

- Goals: Please say (ideally in the title) whether these are the
  goals of the algorithm design or the goals of the compositions,
  or what. Ideally, move that section to immediately after the
  intro, or to an appendix. Now it stands between sections that
  the reader needs to implement the algorithm.

- I think the second design goal is difficult to understand.
  "stability" is a very general term.

- Conformance should come after the specification itself.

- In the actual specification, it is not clear whether once
  one has found a C-B pair that combines, one has to start
  searching again all the B between the C and B just found
  or not. Please make this explicit. (It also might depend
  on the tables used).


General Considerations
======================

I have contacted several influential and knowledgeable people
in the IETF, and presented them with the following list of how
to recombine after decomposition and cannonical ordering:

  1) Stay with decomposed for everything

  2) Use precomposed if there is a precomposed character that
     matches the full string, otherwise use fully decomposed

  3) Use a precomposed character for the longest possible initial
     match, the rest with combining characters.

  4) Pairwise absorb combining characters into the base character,
     checking the combining characters in order. Stop when all are
     checked. Leave those as combining characters that have not been
     absorbed.

  5) Check all possible combinations of precomposed characters and
     combining characters that are equivalent; chose the shortest
     one; use some tiebreaker if there is more than one.

Everybody I presented with this list choose 3). At worst, this
means that 4) (the current proposal) will not be accepted by the
IETF. This would be too bad, because it would exclude 4). At best,
this is a strong indication that 4) is quite difficult to understand,
and we have to make a lot of efforts to describe it very clearly,
to explain its advantages, and to provide implementations. I'll try
to fathom further on which of the above applies. If you want to be
part of the discussion, please tell me, I guess that might help on
both sides.


A completely different point, but nevertheless important:
Does the Unicode Consortium have any policies about IPR?
In particular, does it have any disclosure policies for
patent claims? This is a 120% hypothetical example (I hope),
but suppose somebody files a patent on a certain way to encode
Mongolian variants. That party collaborates in designing
the Unicode Mongolian solution, and successfully manages
to get their ideas adopted. Later, the patent is granted,
and this party is now asking everybody implementing Unicode
to pay them licence fees for Mongolian.
Any informations on this are greatly apreciated.


Regards,  Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org