Comments on Public Review Issues

L2/08-282

Comments on Public Review Issues
(May 6, 2008 - August 7, 2008)

The sections below contain comments received on the open Public Review Issues as of August 7, 2008, since the previous cumulative document was issued prior to UTC #115 (May 2008).

In this review period, several longer comments have been pulled out into separate documents: L2/08-275, L2/08-278, L2/08-279, and L2/08-280. They are mentioned below in the relevant sections.

111 Proposed Update to UTS #18 Unicode Regular Expressions
120 Draft UTR #45 U-Source Ideographs
121 Recommended Practice for Replacement Characters
122 Proposal for Additional Deprecated Characters
123 Bengali Currency Numerator Values
124 Proposed Update UTR #23: The Unicode Character Property Model
Other Reports
Feedback on Encoding Proposals
Closed Public Review Issues

111 Proposed Update to UTS #18 Unicode Regular Expressions

Date/Time: Thu Jul 31 05:04:49 CDT 2008
Contact: asmus@unicode.org
Name: Asmus Freytag

Just a few nits:

There's a duplicated "intersection intersection" in the text.

There's the use of a 1 and 2 in the left column of a table (u+-grave). Hard to see with the coloring, but I suspect it would be less cluttered without the numbering.

120 Draft UTR #45 U-Source Ideographs

See also document L2/08-279.

Date/Time: Wed Jun 4 19:45:50 CDT 2008
Contact: from.unicore@jdlh.com
Name: Jim DeLaHunt
Opt Subject: UTR #45, links in Front Matter point to UTR #41

This is a purely editorial comment about Public Review issue UTR #45, "U-source Ideographs", at http://www.unicode.org/reports/tr45/tr45-1.html.

In the front matter for this UTR, there are references named "[Feedback]", "[Unicode]", etc., which are hyperlinked. From the formatting you would think that the names would be explained in the "References" section. But no, those names don't appear in the References. The hyperlinks all point to the References section of UTR $41, e.g. http://www.unicode.org/reports/tr41/tr41-3.html#Feedback .

This gets the job done, but it does seem a bit odd. An option for the editor to consider would be to copy all these references from UTR #41 to the next draft of UTR #45, and make these hyperlinks point locally, to the References section of UTR #45.

I hope this feedback is helpful.

Date/Time: Wed Jun 4 19:38:47 CDT 2008
Contact: from.unicore@jdlh.com
Name: Jim DeLaHunt
Opt Subject: UTR #45, describing [Glyphs]

This is feedback on a Public Review item, draft UTR #45 (U-source Ideographs) at http://www.unicode.org/reports/tr45/tr45-1.html. It comes purely from reading the draft, without any knowledge of what the IRG or other intended audience may require from the finished document.

The draft has a very nice description of the Text File Data. However, it doesn't say anything at all about the Glyphs document, except where to find the current version. Description of the Glyphs document that might be useful for this UTR:

One instance of each glyph is shown

Glyphs are all in the same typeface and in one typeface only (thus distinctions which are visible only in some typefaces may not be revealed)

There is no space for commentary about the glyph, and in particular how it may be similar to or different from another character's glyph.

It seems that it might be useful to some readers to add a section describing the Glyphs document, perhaps addressing some of the points above.

Feel free to weigh this feedback according to your more extensive knowledge of what the intended audience already knows and what they need to learn from this UTR. As I mentioned, I am completely ignorant of that.

I hope this submission is helpful.

Date/Time: Thu Jun 5 03:51:19 CDT 2008
Contact: dominikus@scherkl.de
Name: Dominikus Scherkl

In the Data are several characters with status "D" but a codepoint is given, almost in the range U+Fxxx. This seems senseless to me. These glyphs should have state U or may be C, but not D

Date/Time: Wed Jun 11 14:37:01 CDT 2008
Contact: rick@unicode.org
Name: Rick McGowan
Opt Subject: UTR #45 status paragraph

The status paragraph of UTR #45 latest draft is non-conforming. Should be updated to the latest template.

121 Recommended Practice for Replacement Characters

See also document L2/08-280.

Date/Time: Thu Jul 24 15:36:58 CDT 2008
Contact: behdad@behdad.org
Name: Behdad Esfahbod
Opt Subject: Feedback on Replacement Character

Hi,

I like option 3 more than option 2 as it exactly shows how many invalid codepoints were present.

Let me also explain how I handle ill-formed sequences in Pango. It's pretty much like option 3, but instead of REPLACEMENT CHARACTER I actually use the value U-FFFFFFFF. The reason I do this is that my text engine knows how to handle that value specially, and using REPLACEMENT CHARACTER would have been losing information.

The beauty of my approach is that my UTF-8-to-UTF-32 conversion produces valid UTF-32 if and only if the input was valid UTF-8.

Cheers,

behdad

Date/Time: Tue Jul 29 05:03:05 CDT 2008
Contact: duerst@it.aoyama.ac.jp
Name: Martin J. Dürst
Opt Subject: Public Review Issue 121

Based on the 'native' implementation of character encoding conversion in Ruby (String#encode, transcode.c), I agree that option #2 is the one to lean towards, because it's the most natural from an user viewpoint and probably the easiest to implement.

However, the "lean towards" should be a very weak lean, because the reason for standards is to reduce errors, not to discuss how to deal with them. In essence, a user should not be able to expect anything more than "garbage in, garbage out".

As for wording, I think it would help a lot if the text avoided to introduce new long-winded terms such as "maximal subpart of the ill-formed subsequence" only to define them a few lines later by a slightly longer phrase. This creates unnecessary overhead for implementers and other readers.

Date/Time: Sat Aug 2 13:55:13 CDT 2008
Contact: markus.icu@gmail.com
Name: Markus Scherer
Opt Subject: PRI #121: agree with preference for option #2

For what it's worth, most ICU functions implement option #2, "Replace each maximal subpart of the ill-formed subsequence". It seems the most natural, indicating each single error, each single ill-formed set of bytes. In the example in the PRI, the most natural assumption is that there are three multi-byte character byte sequences, each with one of their trail bytes missing.

It's also easy to implement, by handling each error right when it's detected, and by handling each error by one "event" (replacement character or callback function call or similar). In addition, an API that reports ill-formed byte sequences need deal with only fixed-maximum-length sequences; for UTF-8 input, at most 6 bytes (to handle the 5- and 6-byte sequences of the original UTF-8 definition) -- rather than the ability to handle ill-formed subsequences which could potentially be the entire input text. Further, using replacement characters with option #2 does not increase the text size as much as option #3 (but of course more than #1).

However, I don't think the Unicode standard should discourage the other options overly strongly. As long as well-formed subsequences of the text are handled properly and ill-formed ones detected properly, there should be freedom of the details of error handling for ill-formed subsequences.

Date/Time: Mon Aug 4 05:56:24 CDT 2008
Contact: texin@netapp.com
Name: Tex Texin
Opt Subject: #121 Recommended Practice for Replacement Characters

The choice of how to deal with ill-formed sequences should be made in the context of what is going to be done with the information.

If the intent is to indicate to the end-user that information is lost or corrupted, then it is perhaps unnecesssary to issue more than a very few replacement characters to represent a sequence of bad values. There is something to be said for indicating how many bytes or characters have been replaced by emitting longer sequences of replacement characters, but it is marginal in value.

On the other hand if it is expected to use the information to diagnose what might have occurred, then one per byte (for utf-8) makes sense and even better is an algorithm that converts the bytes to representable values (such as hex digits.

The exact choice should also be a function of the text format. If the text occurs in html or xml, there are more options for indicating there are invalid values and providing the original information should the user or a programmer want to see it.

The document does not explore the many different types of corruption and the possible consequences of continuing using replacement characters rather than rejecting the material.For example, if the document is markup language or some other format, or a source file for a programming language, replacing corrupt code units with the replacement character and continuing may lead to a more catastrophic failure or a security break.(Since the corruption may have affected the markup or program sysntax).

The recommendations are very weak. The UTC is not improving the standard or best practices by offering this recommendation. If this is the extent of the recommendation, it is better to leave it to users to decide what they want to do and not publish this as a recommendation.

122 Proposal for Additional Deprecated Characters

See documents L2/08-275, L2/08-278

123 Bengali Currency Numerator Values

Date/Time: Thursday August 7, 2008
Contact: cowan@ccil.org
Name: John Cowan

I support this proposal: it makes sense.

124 Proposed Update UTR #23: The Unicode Character Property Model

No feedback was received via the reporting form this period.

Other Reports

Date/Time: Tue Jun 10 01:58:56 CDT 2008
Contact: srikrishnan2003@yahoo.co.in
Name: Srikrishnan
Opt Subject: error in NamedSequencesProv.txt

Hi,

I think there is an textual error in the NamedSequencesProv.txt file in the following webpath "http://unicode.org/Public/5.1.0/ucd/NamedSequencesProv.txt"

The following line:

TAMIL SYLLABLE SHRII; 0BB6 0BCD 0BB0 0BC0

should be

TAMIL SYLLABLE SHRII; 0BB8 0BCD 0BB0 0BC0

because the second line only correctly generates the tamil letter SHRII.

May be I am wrong. Please check.

Regards, Srikrishnan

Date/Time: Thu Jun 12 23:18:55 CDT 2008
Contact: hardy_paul@hotmail.com
Name: Paul Hardy
Opt Subject: 5.1 Code Chart errors

The main code chart for the Unicode 5.1 Saurashtra script shows glyphs that are cropped, notably at the top. The tiny glyphs on the detailed description for each glyph that follows shows the glyphs correctly without cropped edges.

Date/Time: Thu Jun 19 16:05:22 CDT 2008
Contact: asyropoulos@gmail.com
Name: Apostolos Syropoulos
Report Type: Error Report

The file http://unicode.org/Public/UNIDATA/UnicodeData.txt defines the uppercase version of each character that is included in Unicode. The problem is that the uppercase form of a Greek letter with accents is the corresponding letter without any accents. If you ask a native speaker to write something (a word, a sentence) in uppercase, he/she will not use any accents at all. In Greek accents are there to indicate stress of the voice and no one uses accents in an all capitals word. In addition, when sorting an accented letter does not proceed the corresponding letter without accents.

Date/Time: Thu Aug 7 17:32:45 CDT 2008
Contact: markus.icu@gmail.com
Name: Markus Scherer
Opt Subject: buggy Joiner contexts in UAX #31 2.3

In http://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters

UAX #31 Unicode Identifier and Pattern Syntax
2.3 Layout and Format Control Characters
A2. Allow ZWNJ in the following context:

- A Letter, followed by a Virama, followed by a ZWNJ, followed by an Letter
- This corresponds to the following regular expression (in Perl-style syntax): /$L $V ZWNJ/

The regex does not have the trailing $L corresponding to the verbal description. I don't know whether the description or the regex is correct. In http://unicode.org/review/pr-96.html the regex does have the trailing $L.

markus

Closed Public Review Issues

No feedback was received via the reporting form this period.

L2/08-282

Comments on Public Review Issues (May 6, 2008 - August 7, 2008)

Contents:

Comments on Public Review Issues
(May 6, 2008 - August 7, 2008)