L2/07-126

From: Markus Scherer
Date: Apr 23, 2007 10:48 AM
Subject: Comments on Unicode Format for Network Interchange
To: discuss@apps.ietf.org


Dear Mr. Klensin and Mr. Padlipsky et al.,

I have reviewed and discussed your draft-klensin-net-utf8-03 with some
colleagues. We welcome the standardization on UTF-8 as the default
internet charset.

We would like to make the following suggestions
(each starting with *** and ending with *** *** among quotes from the
internet-draft):

[...]

2.  Net-Unicode

2.1.  Definition

  The Network Unicode (Net-Unicode) format is defined as follows:

  1.  Characters MUST be coded in UTF-8 as defined in [RFC3629].

  2.  Line-endings MUST be indicated by the sequence Carriage-Return
      (U+000D) followed by Line-Feed (U+000A).

*** Suggested change:
  2.  Line-endings MUST be indicated by the sequence Carriage-Return
      (U+000D) followed by Line-Feed (U+000A), or by a single
      Carriage-Return (U+000D), or by a single Line-Feed (U+000A).

Justification: We believe that single CR and LF are common because of
implementation practice on a variety of platforms, and that it is both
unrealistic and unnecessary to try to legislate them away.
Applications already commonly handle all of CR, LF and CR+LF, and some
support even more characters according to the Unicode Newline
Guidelines.
*** ***

  3.  Before transmission, all character sequences MUST be normalized
      according to Unicode method "NFC" (see Section 3).

*** Suggested change:
  3.  Before transmission, all character sequences SHOULD be normalized
      according to Unicode method "NFC" (see Section 3).

Justification: With the MUST language in the draft, we see the following issues:
* The draft later says that recipients should not just assume
 that incoming text is normalized. Therefore, recipients must
 already be prepared to at least check for normalization.
 -> We believe that the MUST is not useful.
* The normalization requirement is the reason for the Unicode versioning
 and stability discussion below which complicates this internet-draft
 considerably.
 -> We believe that the MUST is not necessary.
* The normalization stability restricts this specification to Unicode versions
 3.2 and above (see section 4).
 -> We believe that this is too restrictive.
    Unicode applications normally handle text from Unicode 2.0 and above.
* We believe that the MUST is unenforceable.
 Moreover, if recipients must check, it doesn't make
 any difference whether it is enforced.

(With this change, much of the following text of the internet-draft
can be simplified significantly. In particular, the discussions of
unassigned/unknown characters, stabilized forms, etc. can and should
be dropped.)
*** ***

  4.  As suggested in Section 6 of RFC 3629, the Byte Order Mark
      ("BOM") signature MUST NOT appear at the beginning of these text
      strings.

*** Suggested change:
  4. The UTF-8 signature byte sequence (EF BB BF, UTF-8 encoding of U+FEFF,
      sometimes called Byte Order Mark ("BOM")),
     when it appears at the beginning of the text, SHOULD be deleted
by the recipient.
     If a Word Joiner is needed in the text, U+2060 WORD JOINER SHOULD be used
     instead of U+FEFF ZERO WIDTH NO-BREAK SPACE.

Justification: We believe that the draft text is unnecessarily strong,
and at the same time not sufficiently specific for implementers.
*** ***

[...]

2.2.  The ASCII NVT Definition

  [...]

  1.  The "defined but not required" codes -- BEL, BS, HT, VT, FF --
      and the undefined control codes ("C0") SHOULD NOT be used unless
      required by exceptional circumstances.

*** Suggested change:
  1.  Control codes from both the "C0" (U+0000..U+001F, U+007F)
      and "C1" (U+0080..U+009F) ranges,
      with the exception of HT (09), LF (0A) and CR (0D),
      SHOULD NOT be used unless required by exceptional circumstances.

Justification: The sets of C0 and C1 control codes that should and
should not be used should be defined explicitly, and with code point
values. Only HT, LF and CR are very widely used.
*** ***

  2.  CR MUST NOT appear except when immediately followed by either NUL
      or LF, with the latter (CR LF) designating the "new line"
      function.  Because page layout is better done in other ways and
      to avoid other types of confusion, CR NUL SHOULD preferably be
      avoided.

  3.  LF CR SHOULD NOT appear except as a side-effect of multiple CR LF
      sequences (e.g., CR LF CR LF).

*** Suggested change:
Remove points 2. and 3.

Justification: The other suggested changes permit CR and LF.
*** ***

[...]

4.  Versions of Unicode

  In retrospect, one of the advantages of ASCII [X3.4-1978] when it was
  chosen was that the code space was full when the Standard was first
  published.  There was no practical way to add characters or change
  code point assignments without being obviously incompatible.  Unicode
  does not have that property: there are large blocks of space reserved
  for future expansion and new versions, with new characters and code
  point assignments, appear at regular intervals.

  While there are some security issues if people deliberately try to
  trick the system (see Section 6), Unicode version changes should not
  have a significant impact on the text stream specification of this
  document for the following reasons:

  o  The transformation between Unicode code table positions and the
     corresponding UTF-8 code is algorithmic; it does not depend on
     whether a code point has been assigned or not.

  o  The normalization specified here, NFC (see Section 3), performs a
     very limited set of mappings, much more limited than those of the
     more extensive NFKC used in, e.g., nameprep [RFC3491].

*** Suggested change:
Drop this second bullet and the following paragraph.

Justification: They are unnecessary with changing NFC from MUST to SHOULD.
*** ***

  The NFC tables may be updated over time as new characters are added,
  but the Unicode Consortium has guaranteed the stability of all NFC
  strings.  That is, if a string does not contain any unassigned
  characters, and it is normalized according to NFC, it will always be
  normalized according to all future versions of the Unicode Standard.
  The stability of the Net-Unicode format is thus guaranteed when any
  implementation that converts text into Net-Unicode format does not
  permit unassigned characters.

  Were Unicode to be changed in a way that violated these assumptions,
  i.e., that either invalidated the string order of RFC 3629 or that
  that changed the stability of NFC as stated above, this specification
  would not apply.  Put differently, this specification applies only to
  versions of Unicode starting with version 3.2 and extending to, but
  not including, any version for which no changes are made in either
  the UTF-8 definition or to NFC stability.

*** Suggested change:
Modify the paragraph above, removing references to NFC.

Justification: As a result, this specification will then apply to
versions of Unicode starting with version 2.0.
*** ***

[...]

5.2.  The Unicode Applicability Dilemma

[...]

*** Suggested change:
Add an item for a fifth way to get around the problem:   Strongly
encourage use of normalization form NFC in interchanged text, but do
not require it.

Justification: This is the alternative discussed here.
*** ***

9.1.  Normative References

***
Suggested change: Please add a reference for [RFC3629] UTF-8, a
transformation format of ISO 10646

Justification: Missing reference.
*** ***

Best regards,
Markus Scherer
Google Software Internationalization
ICU Project Developer