L2/04-026

Comments on Public Review Issues

The sections below contain comments received on the open Public Review Issues as of January 28, 2004.

13 Unicode 4.0.1 Beta

Date/Time: Wed Jan 21 20:54:23 EST 2004
Contact: Markus Scherer

UCD-4.0.1d5b.html says
Sentence_Terminal
B
I
Used in UAX #9. Marks characters that generally terminate sentences.

It's UAX #29, not #9.

markus

Date/Time: Sun Jan 25 04:08:22 EST 2004
Contact: Matitiahu Allouche

This is a comment on Public Review issue #13, and more specifically on the "Proposed Update UAX #9 Bidirectional Algorithm".

I have already expressed my view on the Unicode discussion lists, but I will repeat here as a reminder for the next UTC meeting.

I disagree with item 1 of section 4.3 "Higher-Level Protocols" which allows overriding the Bidi properties of characters. Im my opinion, this compromises the interchangeability of Unicode text, since the same text may be seen in n different ways by n users, with the differences affecting word ordering, i.e. meaning and not only cosmetics.

In order to avoid any confusion, I propose to suppress this paragraph, and to add in the conformance clauses that the Bidi properties of characters are normative and *not* subject to overriding.

Matitiahu Allouche

20 Draft UTR #31 Identifier and Pattern Syntax

No comments were received via the public reporting form.

25 Proposed Update UTR #17 Character Encoding Model

Date/Time: Fri Nov 28 12:27:48 EST 2003
Contact: Peter Constable

The draft text for TR17, section 5 says, "A simple character encoding scheme is a mapping of each code unit of a CCS into a unique serialized byte sequence." It goes on to define a compound CES. While not stated explicitly, Unicodes CESs do not fit the definition of a compound CES, and so the definition for simple CES must apply.

The problem is that this definition cannot accommodate all seven Unicode CESs. Since it defines a CES as a mapping from each code unit, there are only two possible byte-order-dependent mappings for 16- and 32-bit code units. In other words, the distinction between UTF-16BE and UTF-16 data that is big-endian cannot be a CES distinction because individual code units are mapped in exactly the same way in both cases.

A definition for simple CES must, at a minimum, refer to a mapping of *streams* of code units if it is to include details about a byte-order mark that may or may not occur at the beginning of a stream.

I would suggest that, in order to accommodate the UTF-16 and UTF-32 CESs, an appropriate definition should actually be a level of abstraction away from "a mapping": a CES is a specification for mappings. Any mapping is necessarily deterministic, giving a specific output for each input. A mapping itself cannot serialize "in either big-endian or little-endian format"; it must be one or the other, unambiguously. On the other hand, a specification for how to map into byte sequences can be ambiguous in this regard. Thus, the UTF-16 CES can be considered a specification for mapping into byte sequences that allows a little-endian mapping or a big-endian mapping.

Date/Time: Thu Dec 11 04:16:51 EST 2003
Contact: jhi -at- iki.fi

Comment on the "Proposed Update Unicode Technical Report #17

Character Encoding Model", http://www.unicode.org/reports/tr17/tr17-3.3.html.

Given that "The CES must take into account the byte-order serialization of all code units wider than a byte that are used in the CEF." how can then later UTF-16 and UTF-32 be CES?

"The Unicode Standard has 7 character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE."

My understanding was that "UTF-16" and "UTF-32" apply only when speaking of the "in-memory" formats, but when serialization via I/O is considered, the "BEness" or "LEness" must be explicitly specified. To summarize:

UTF-8: both a CEF and a CES
UTF-16: a CEF
UTF-16BE, UTF-16LE: both CES
UTF-32: a CEF
UTF-32BE, UTF-32LE: both CES

Right?

Date: Tue, 6 Jan 2004 18:02:01 -0800
From: Mark Davis

Here is feedback on UTR #17. It has already been supplied to Asmus, but so that it doesn't fall through the cracks I'll submit it as a document. The last item is the most important.

In addition to the five individual levels, there is the useful concept of a
Character Map (CM), which is an operation that bridges all five levels. It is
defined as a mapping from an abstract character repertoire to a serialized
sequence of bytes or octets. (In ISO standards the term octet is used for an
8-bit byte).

Add: For more information on character mappings, see [CharMapML]

Some other character sets use a limited notion of open repertoires. For
example, Microsoft has on occasions extended the repertoire of its Windows
character sets by adding a handful of characters to an existing repertoire. This
occurred when the EURO SIGN was added to the repertoire for a number of Windows
character sets, for example.

Add: For more information on the recommended mappings of unassigned characters, see [CharMapML]

The sequence represents a valid code point, but is unassigned.

Add same note.

We conclude that the correspondence ...

Strange phrasing, more like an academic paper. We don't conclude. It just is. Change to: Thus the correspondence ...

A simple character encoding scheme is a mapping of each code unit of a CCS into
a unique serialized byte sequence.
The CES may involve two or more CCSs, and may include code units (e.g. single
shifts, SI/SO, or escape sequences) that are not part of the CCS per se, but
which are defined by the character encoding architecture and which may require
an external registry of particular values (as for the ISO 2022 escape
sequences). In such a case, the CES is called a compound CES.
Both of these types are commonly referred to as CES.

This really needs to be fixed. It is *not* that they are commonly referred to. It is their definition! And the way it is phrased, a compound is not well formed, since it is in terms of simple CESs, not CSS -- the latter don't map to bytes!! Must be changed to something like:

A character encoding scheme (CES) is a mapping of code units to bytes in one of
two ways:

a. A simple CES is a mapping of each code unit of a CCS into a unique serialized
byte sequence.

b. A compound CES uses two or more simple CESs, plus a mechanism to shift
between them. This mechanism includes bytes (e.g. single shifts, SI/SO, or
escape sequences) that are not part of the simple CESs per se, but which are
defined by the character encoding architecture and which may require an external
registry of particular values (as for the ISO 2022 escape sequences).

Date/Time: Sat Jan 10 21:16:32 EST 2004
Contact: Doug Ewell

Here are my comments on Public Review Issue #25, "Proposed Update UTR #17 Character Encoding Model." All of these comments are editorial in nature, and refer to Revision 3.3 dated 2003-11-24.


New Section 5.1, "Byte Order"

The sentence starting "In Unicode" is a bit awkward. I suggest the following revision:

"In Unicode, the character at code point U+FEFF is defined as the byte order mark, while its byte-reversed counterpart is a non-character (U+FFFE) in UTF-16, or outside the code space (0xFFFE0000) in UTF-32)."


Section 7, "Transfer Encoding Syntax (TES)"

In the new passage about BOCU-1 and Punycode:

"Like BOCU-1, Punycode, defined in [RFC3942], is unique only on a string basis and is therefore properly understood as a TES."

the reference should be RFC3492, not 3942. The References section is OK.

Also, "Unicode Technical Note#6" should have a space after "Note."


Section 8.1, "Strings"

Replace "an sequence" with "a sequence" throughout the following paragraph:

"A string datatype is simply a sequence of code units. Thus a Unicode 8-bit string is an sequence of 8-bit Unicode code units, a Unicode 16-bit string is an sequence of 16-bit code units, and a Unicode 32-bit string is an sequence of 32-bit code units."

Date/Time: Tue Jan 20 08:35:24 EST 2004
Contact: Kent Karlsson

Re the public review issue 25, on UTR 17:

Section 5.1 on "byte" (octet!) order should state that this is a general problem for data that aren't already just octet sequences. There is also a general solution, which has been adopted for many years: it's called network byte (octet) order, which is defined to be big-endian. This way there is no need for any byte order mark, and numeric, sound, image, record, as well as character data can be transferred over the network without endian-ness issues (there may still be alignment issues; but all padding is removed; however, that is not directly relevant for UTR 17). Unix/POSIX has a built-in API for octet order correction:

	uint32_t htonl(uint32_t hostlong);
	uint16_t htons(uint16_t hostshort);
	uint32_t ntohl(uint32_t netlong);
	uint16_t ntohs(uint16_t netshort);

(see http://www.opengroup.org/onlinepubs/007904975/functions/htonl.html)

The use of network octet order is both superior and more general that the use of a byte order mark, and should therefore be strongly preferred over using a byte order mark.


Quote: "However, the size of the data type must correspond to the size of the code unit, or the results can be unpredictable, as when a byte oriented strcpy is used on UTF-16 data which may contain embedded NUL bytes."

Comment: Yes, but memcpy and memmove (in C), with an appropriate (octet length) value for the length argument, still work, even though they are octet oriented. The problem here is the misinterpretation of NULL (the character, both in char and in wchar_t) in some parts of the standard C library API, which is not related to data type size per se. It would be better to say something about that, since that is the problem here. Indeed, the size of the datatype can easily be LARGER than the size of the code unit, as long as that is taken into account for I/O and similar operations. Having the datatype smaller than the code unit, necessitates sequence handling also for individual code units, but as long as the "reassembly" is done properly, that can work too.

Date/Time: Thu Jan 22 13:06:57 EST 2004
Contact: Peter Constable

Re the proposed update to UTR#17:

I continue to maintain that the notion CES cannot be defined as a mapping of "CCS" code units (shouldn't that be CEF code units?) into byte sequences, particularly "a unique serialized byte sequence" since the UTF-16 and UTF-32 CESs do *not* determine unique byte sequences for any CEF code unit. Each defines two possible byte sequences that correspond to each CEF code unit -- neither unique, nor determinisitic (a property generally assumed to be true in most uses of the term "mapping" within IT literature).

Moreover, given the role of the byte-order mark in the UTF-16 and UTF-32 CESs, if these are to be considered mappings, they must be mappings of *sequences or streams* of code units since that part of the mapping output is defined in terms of the entire output.

I maintain, therefore, that a CES must be defined as a specification for defining mappings of CEF code unit sequences into byte sequences.

BTW, the definition of CES in UTR17 isn't in sync with D38, D42 and D45, which appear to have incorporated this feedback (which I also provided when the text of TUS4.0 was being prepared).

26 Update properties for Ethiopic and Tamil non-decimal digits

Date/Time: Fri Nov 28 13:36:56 EST 2003
Contact: Peter Constable

PRI#26: It is my understanding that Ethiopic numerals do not use a decimal radix.

Most sources describing Ethiopic script will list the characters representing tens, 100 and 10000. The existence of these characters, which are not combinations made from sequences of digits for 0 - 9, already indicates that this is not a decimal-radix system.

Traditionally, each syllabic character has a numeric value associated. This is described on page 8 of http://www.intelligirldesign.com/paper_gabriella.pdf, which shows Arabic decimal values, and http://www.library.cornell.edu/africana/ Writing_Systems/Numeric.html, which shows traditional Ethiopic numerals. By comparison of these two documents, one can get an idea of how the numbers work.

The following are some other useful discussions of Ethiopic numbering:

http://www.geez.org/Numerals/
http://www.abyssiniacybergateway.net/fidel/sera-faq_4.html
http://www.ethiopic.com/ethiopic/numerals.pdf

The last of these proposes the addition of a digit 0 in order to allow decimal-radix numbers in Ethiopic. I have no idea whether this has caught on at all or not, but it is not the traditional system.

Date/Time: Thu Jan 15 04:13:57 EST 2004
Contact: Jeroen Hellingman

With regard to Tamil numbers.

The issue of whether these are decimal digits or other numerals seems to me to be difficult to decide.

The most common, and traditional use of Tamil numbers is not as decimal digits, but together with the indication of the power of ten with the symbols provided for this in Unicode. This also explains the lacking zero in Tamil.

However, in a few places, Tamil numerals have been used (as can be seen on the bank-notes of Mauritius) as decimal digits, which also use a "Tamil" zero, which actually looks the same as the western zero.

In daily practice in Tamil, western numbers are used.

I would suggest to consider the decimal use of Tamil numerals an exceptional case, and remove the "Decimal Digit" status from these characters. If you decide otherwise, I would suggest to add a "Tamil Zero" to Unicode.

Kind Regards,

Jeroen Hellingman.

Date/Time: Fri Jan 23 16:45:09 EST 2004
Contact: John Cowan

Issue 26:

In response to Jeroen Hellingman's proposal for a TAMIL ZERO, if in fact Tamil digits are occasionally used as decimal digits with a European zero, then this European zero should be encoded as DIGIT ZERO and nothing else.

However, I support removing Nd status from the Tamil digits, as this is not the traditional use of them.

27 Joiner/Nonjoiner in Combining Character Sequences

Date/Time: Tue Jan 20 08:39:35 EST 2004
Contact: Kent Karlsson

Re the public review issue 27, on ZWJ/ZWNJ:

I strongly prefer to make them Mn (as I've suggested before!). This fits the Unicode model much better than adding lots of exception text to "almost" have them as Mn but actually keeping them as Cf.

The response I got when I suggested this was that it would disturb the BiDi algorithm. Apparently, closer investigation (as per the current proposal) has shown that there is no problem for the BiDi alg. to do this change.

From: Maurice Bauhahn
Subject: RE: Khmer "syllables" (PRI #27)

Hello Michael (and Ken and Mark),

I wrote a small Perl programme (attached...take off the .txt I added to get it past security software) that creates all conceivable combinations...but it ran to hundreds of megabytes with just one base letter of the alphabet!

At the moment I'm deeply into Unicode-encoding the Chuon Nath dictionary (and two versions of the Khmer Bible), so expect sometime next month I should be able to generate all combinations of cluster used in those three tomes. This also involves some complicated transcoding from legacy fonts where there was ambiguous use of vowel U/register shifters; ambiguous subscript DA and TA, and ambiguous colon/YUUKALEAPINTU, not to mention all those misplaced ZWSP/duplicated diacritics; and automated work break insertion. However, my biggest problem has been a computer language: Perl!!!!

Thank you for writing. I'm a bit puzzled on how to respond to Public Review Issue #27 (Joiner/Nonjoiner in Combining Character Sequences). There are three conceivable issues:

(1) "Note: for Khmer, canonical reordering is not an issue, since all the marks are ccc=0." What with all the complicated interactions between 'diacritics' in Khmer, I certainly do not want there to be any tolerance of these being outside of a canonical order. What does this phrase mean?

(2) I'm a bit puzzled by the suggested placement of ZWJ only before the Vowel (p. 281, Unicode 4.0). This presumably relates to preventing Consonant/Vowel ligatures (a mildly interesting variation...for example, the main entries of Chuon Nath dictionary do not have ligatures [except BA - AA]...whereas most of the text does). Microsoft and myself are using ZWNJ quite regularly, however, before the Register Shifters [MUUSIKATOAN and TRIISAP] if they have exceptions to the rule of moving to a subscript form when there is a superscript vowel. This ZWNJ forces them to stay in a superscript position regardless.

(3) One issue that has not been resolved to my understanding is how to handle the 'final consonant' type of subscript that historically and very rarely appeared after a dependent vowel (violating the Unicode accepted rules of CONSONANT-ROBAT?-REGISTER SHIFTER?-(COENG CONSONANT/INDEPENDENT VOWEL)*-VOWEL?-SIGNS* I was not aware of this until last year. Should we use a ZWJ after the vowel so that this subscript would be explicitly tagged as not 'out of place'?

Sincerely,

Maurice

Date/Time: Thu Jan 22 10:55:10 EST 2004
Contact: Cathy Wissink

From: Peter Constable

We definitely want this issue resolved, as it affects what we have already worked on for Khmer and is also needed for some issues in Biblical Hebrew.

As indicated, the choice between solutions A and B have no impact on Bidi, Line break or normalization.

Re identifiers, changing ZW(N)J to Mn would allow Khmer words distinguished by these to be distinct identifiers, but I wonder if there wouldn't be undesirable implications. The sequence < c, ZWJ, t > would become a valid identifier and be distinct from "ct". Do we really want that? My guess is not. I doubt anyone would be upset if the Khmer distinctions in question could not be used in identifiers.

Re script analysis, default grapheme cluster etc., regular expressions, I don't think either solution would have bad implications.

Re collation, as indicated it isn't particularly affected either way.

Re definitions: I don't think either solution has any negative impact. B would mean that a combining character sequence could contain controls within it, but it would only be these two, and I don't think anyone would ever have supposed that the function of ZW(N)J might interfere in a base-mark relationship.

Both options involve some work; B feels like less work in that it doesn't affect another standard, namely ISO 10646. It also seems less radical to keep calling the joiners control characters and adjusting where they can occur (which is really what is needed) than to fudge the concepts and refer to things that have control functions as combining marks.

So, my inclination is toward B.


From: Paul Nelson

1. The comment inaccurately singles out Khmer. The ZWJ/ZWNJ are also documented as having an impact in Indic scripts that also effects combining marks with regards to preventing or forcing conjunct forms.

2. As Peter has pointed out, there has been a significant amount of discussion with regards to ZWJ/ZWNJ with combining marks of Biblical Hebrew text.

3. I would prefer that approach "B" be used.

28 BIDI Boundary_Neutral Property Value

Date/Time: Mon Jan 26 20:15:04 EST 2004
Contact: Asmus Freytag

For the purposes of BIDI, characters that are needed in *rendering* should not be considered ignorable, and should therefore not become BN.

Characters that are ignorable, but have a definite location in the code stream, both before and after reordering - for which, in other words, it's possible to assign one of the regular bidi classes - should IMHO not be made BNs. They are less of a burden to bidi processing if they just flow with the stream. And the rendering process needs to be able to handle (or ignore as the case may be) anyway.

In summary, I am against making this change, but in favor of extending the definition of BN to cover SHY, non-characters and unassigned format characters.

Here are my comments on each of the characters or ranges proposed to become BN

U+00AD # SOFT HYPHEN

The existing bidi class for SHY, ON (other neutral) is clearly wrong. It happens to work out, since most SHYs are between pairs of strong letters of the same directionality, but this is not a required feature of SHY. Changing this to BN will solve that issue and at the same time not have any adverse effects on using SHY for linebreak since all SHY's that are active will have been interpreted *before* bidi. (Bidi is run on a per line basis). Also, it would be awkward to decide what other directional class to give a SHY, since it could, in principle be used nearly anywhere.

U+034F # COMBINING GRAPHEME JOINER

If this can affect rendering of adjacent characters, it must not be BN. Treating this as combining mark should give the correct result. (Applying a CGJ between an L and an N, or L and R makes little sense, so the the default action of inheriting the Bidi class of the base character seems to be reasonable - no need to change).

U+115F..U+1160 # HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
U+3164 # HANGUL FILLER
U+17B4..U+17B5 # KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA

Of these, U+115F and 1160 might have an effect on rendering. They clearly belong at specific positions in the character (code) stream, even after reordering. They are standins for characters that are L, so there are no issues with resolving them. Having Bidi take them out when the rendering system otherwise has to contend with them anyway, makes little sense.

This is true for 17B4 and 17B5 which are explicitly L. They only apply to characters that are L, so making them BN does not improve things.

U+180B..U+180D # MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
U+FE00..U+FE0F # VARIATION SELECTOR-1..VARIATION SELECTOR-16

There seems no reason to insist in the bidi algorithm that variation selectors be applied *before* reordering. Therefore, they should not become BN. Variation selectors should behave more like combining marks, inheriting the directionality of their base letter.

U+E0100..U+E116F # VARIATION SELECTOR-17..VARIATION SELECTOR-256

These were not listed as DCIP. What bidi class would these get? Ah, I see, they've been 'hidden' in this range: U+E0080..U+E0FFF # .., which contrary to the comment, does not purely contain unassigned code points.

Eighteen ranges of non-characters.

These should become BN.

Unassigned characters

Making unassigned future Cf's be BN by default seems like a good thing. However, as they become assigned, they need to be evaluated and, if necessary, given explicit bidi classes.