L2/04-308

Technical Report Recommendations

Mark Davis, 2004-07-23

Latest Version: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/utc/technical_report_recommendations.html

The following are recommended changes for certain technical reports in the U4.1 timeframe. The proposal is to authorize the posting of Proposed Updates incorporating the following changes between now and November, to allow more time for public feedback.

UAX #15 Unicode Normalization Forms

Needs updating for U4.1 to incorporate the corrigendum, moving identifier section to TR31 (a stub will be left to point to it), plus editing to have definitions more consistent with #23 and #30. The following have been reported as problems; they need to be reviewed and fixed if so.

"(In NFKC and NFKD, a K is used to stand for compatibility to avoid confusion with the C standing for canonical.)"
Here, 'canonical' should be changed to 'composition', because that's what the C in NFC and NFKC stands for.
Change the contents listing to the same style as in the template.
Replace the two PPT slides by tables. On some of my screens the anti-aliasing makes them completely unreadable.
Change the notation section into a table, so that the notation being described is in the left hand column. That makes it much easier to locate something.
Consider italicizing variables to set them off from text.
The sentence "This can be written as P in X=Q in Y." is really hard to read. consider putting the formula on a line by itself.
Annex 13 should not be an annex. The definition of respcting Can, Equiv. should be a formal definition, not hidden in the text.
There's a problem with the opening paragraph in Annex13. As defined, canonical equivalence *only* applies to strings. Therefore your distinction between preserving and respecting makes no sense (as worded).
This section describes the relationship of normalization to respecting (or preserving) canonical equivalence. A process (or function) respects canonical equivalence when canonical equivalent inputs always produce canonically equivalent outputs. For functions that map strings to strings, this is often called preserving canonical equivalence. There are a number of important aspects to this concept:
"The canonically equivalent inputs or outputs are not just limited to strings, but are also relevant to the offsets within strings, since those play a fundamental role in Unicode string processing." is also a pocket definition, this time of 'canonically equivalent offset'. Should be a formal definition.

UAX #29 Text Boundaries (UAX #14 Line Breaking Properties)

1. We should modify word selection so that it has the same 'escape hatch' as line break, for Thai/Lao. It would thus be parallel to Line Break's LB 1, and add the character classes that are described there.

LB 1 Assign a line breaking class to each character of the input. Resolve AI, CB, SA, SG, XX into other line breaking classes depending on criteria outside the scope of this algorithm.

2. In a related matter, we need to incorporate the LineBreak corrigendum into LineBreak, and modify the TR to remove LB 7a.

LB 7a In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID.

And document that NBSP is the preferred base character for showing combining marks in isolation.

3. There are other "special" rules in LineBreak.

LB 6 Don’t break a Korean Syllable Block, and treat it as a single unit of the same LB class as a Hangul Syllable in all the following rules

Treat a Korean Syllable block as if it were ID

LB 7b Don't break a combining character sequence and treat it as if it has the LB class of the base character in all of the following rules.

Treat X CM* as if it were X

These rules in general are difficult for regular expression implementations and for pair tables. They complicate regular expressions because they affect every instance where any characters could match; they complicate pair tables since they require prehandling in code, outside of the pair table. If they are present, they should be the top rules, since they should be 'handled' by changes all down the line. Both of the rules in LineBreak end up (because of other rules) having the same effect as the rules used in TR29: to treat a grapheme cluster as the base. This is because of the effect of other rules in UAX#14, that keep the differences between LineBreak=CM and Grapheme_Extend = true from surfacing. However, I have gotten the feedback from our implementers that it would be ideal they were unified, if UAX#14 had, instead of these two rules, the one rule used in #29:

Treat a grapheme cluster as if it were a single character: the first character of the cluster.

"I also think that it would help make things simpler if this grapheme cluster rule could be put as close to the top of the list of rules as possible, so that we don't have some rules looking within grapheme clusters, and others only at the boundaries. It would have to be a bug if a line break were to break a grapheme cluster, so it should be possible to say, from the very beginning, that line break rules work on grapheme cluster boundaries, and be done with any further consideration of combining marks (except for unattached ones)"

4. Failing adoption of #3 by the committee, LB6 should be replaced by ordinary rules. It only has an effect in 5 rules:

ID × IN

ID × PO

PR × ID

ALL ÷

÷ ALL

We can safely replace it by adding the following rules. They can go anyplace before the ALL rules, and can be put in logical locations. The first three rules correspond to the first three above; the last three disallow breaking in the middle of a Hangul Syllable (as described in Chapter 3).

L | V | T | LV | LVT × IN

L | V | T | LV | LVT × PO

PR × L | V | T | LV | LVT

L × L | V | LV | LVT

V | LV × V | T

T | LVT × T

As Asmus noted off-line, a common tailoring is to change Hangul Syllables to AL, but because sequences of AL don't divide either, it is safe to add the above rules: the tailoring just changes all of L | V | T | LV | LVT to AL.

5. Line Break Rule LB 18B can be dropped altogether. It has no effect on the results; anything that it would break will also be broken by LB20.

HY ÷

÷BB

6. Deborah identified some cases that are missed if the regular expression for numbers is used, rather than the list of pairs of rule 18:

Original LB18: PR ? ( OP | HY ) ? NU (NU | IS) * CL ? PO ?

Updated: PR ? ( OP | HY ) ? NU (NU | IS | SY ) * CL ? PO ?

PR × AL

PR × ID

There were no changes if the rules are being used, only if the big regular expression is being used as an alternative to the rules.

UTS #18 Unicode Regular Expressions

Needs updating for U4.1 for to account for changes in foldings, properties, names, scripts, and also the implications of Pattern_Whitespace and Pattern_Syntax. Other items:

Also add descriptions of the following so that they can be referenced in other UTRs, especially #30. These would be in a special section, since there are no real Unicode implications; they are really for our reference.
1. grouping syntax, e.g. (ab) cd
2. references, following Perl: $0, $1, etc in replacement strings, and back references \1, \2 in the pattern, used to refer to what a group matches.
3. variables, ie defining $xyz = [[:greek:]&[:lowercase:]], then using it in multiple regular expressions
Since any character could occur as a literal in a regular expression, when regular expression syntax is embedded within other syntax it can be difficult to determine where the end of the regex expression is. Add a note describing the common practice, which is to have a delimiter like /ab(c)*/, where the delimiter can be chosen to be some character not in the particular regular expression.
Consider adding a conformance clause for http://www.unicode.org/reports/tr18/#Compatibility_Properties, so that if people want to claim conformance to them they can. Also consider having a second column that follows the POSIX partitioning constraints.
principle use => principal use
\p{gc=Decimal_Number} ...
"Non-decimal numbers (like Roman numerals) are normally excluded. In U4.0+, this is the same as gc = Decimal_Number (Nd)." =>
"Non-decimal numbers (like Roman numerals) are normally excluded. In U4.0 and U4.1+, this is the same as Numeric_Type = Decimal (nt=De)."
[Since we slipped up in 4.0.1, and Nd was wrong.]
Note: ZWSP, while a Z character, is for line break control and should not be included.
[Remove: It recently (U4.0.1) became a Cf.]

UTS #23 Character Properties

I have scanned through a number of the TRs, and found other useful definitions that we should centralize so that they can be used consistently. Asmus may have already incorporated some these into his draft for the meeting; if so, skip over those that are. This is not a request that these definitions be added verbatim; they may need wordsmithing and changes for consistency.

By convention, toX(a) is notation used for a function that produces a result of the form X. Thus toLowercase(a) produces a lowercase form from a. There is a related function isX(a), coordinated with toX, whereby isX(a) is true if and only if toX(a) = a.
- It is important to distinguish being in a format from converting to that format, because mixing them up is fairly common and leads to misunderstandings.
- Used in UTS #15, Chapter 3.
Define the closure of a string S under a folding F to be the set of all strings that fold to it. Eg,
- closure('aa', toLowercase) => {'aa', 'Aa', 'aA', 'AA'}
- closure('ώ', toNFC) => {'ώ', 'ώ', 'ώ'}
  - The difference in the latter two is between U+0341 COMBINING ACUTE TONE MARK and U+0301 COMBINING ACUTE ACCENT
Define the closure of a set of strings SS under a set of foldings SF to be the union of the closures of the strings in SS. Eg for the toLowercase function,
- closure({'aa', 'AA', 'ß'}, {toLowercase, toUppercase, toTitlecase}) => {'aa', 'Aa', 'aA', 'AA', 'ß', 'ss', 'sS', 'SS', 'Ss'}
- used in case folding
Define the fold of a set of strings SS to be the union of the set of foldings of the elements of SS. E.g.
1. toLowercase({'aa', 'Aa', 'aA', 'AA', 'b', 'B'}) => {'aa', 'b'}
Define what it is to preserve a relation. A transform F preserves a relation R, when R(a,b) implies R(F(a), F(b))
- Thus a transform preserves a boolean function isX, when isX(a) => isX(F(a))
- This is used in UTS #15 and other places,
  - e.g. toNFC preserves canonical equivalence, because if isCE(a,b) then isCE( NFC(a), toNFC(b) )
  - a function preserves a normalization form NFC: isNFC(a) implies that isNFC(F(a))
- When F preserves R, this can also be expresses as saying R is closed under F. For example, you can say that programming identifiers are closed under NFC, meaning isIdentifier(S) implies isIdentifier(toNFC(S))
Explain how you can change a transform F so that it preserves a relation. Use the example of normalization.
1. In the general case, the new transform is defined as F'(s) = toNFC(F(toNFC(s))
2. If F preserves canonical equivalence, then it can avoid a step: F'(s) = toNFC(F(x))
Also, explain that from a partition one can generate a folding, by generating a function that picks one element of each partition to be the element that all and only the others in that partition fold to. That is what is done in generating CaseFolding, for example.
Add a section after Canonical Equivalence (in Asmus's current draft) called Normalized Input
If the input is guaranteed to be in NFD, then Step 1 is simpler. Additional rules do not have to be generated; instead, the matching part of the rule just needs to be transformed into NFD. Thus instead of generating new rules, one simply replaces a rule like:

<A-acute> <dot_under> -> Z

with:

A <dot_under> <acute> -> Z

Step 2 will be the same; however, it will need to be applied to fewer cases, since fewer rules will result from Step 1.
Add a section after Canonical Equivalence (in Asmus's current draft) called Normalized Output
The above two steps ensure that the folding preserves canonical equivalence. However, they do not guarantee that the folding preserves normalization. If normalization is required, then it must be applied as an additional step. This is typically an issue whenever the result of a rule contains combining marks. If normalization is to be applied after the each rule is applied, there are implementation techniques described in [Normalization] for ways to optimize this process. However, if there are any sizable number of changes, it is more efficient -- and certainly simpler -- to simply normalize the entire text once all of the rules have been applied.

Related Items

These are copied from email on property topics:

U4.0.1 changes U+002F Solidus from ES to CS but leaves U+FF0F Fullwidth Solidus at ES. U+FF0F should probably have the same bidi class as its regular sibling.
There's another misclassification, this one for a relatively new character that we didn't think through correctly when adding: NNBSP needs to be changed from WS to CS (in analogy to NBSP). The reason is that in all scripts but Mongolian it acts like NBSP except for its width. In Mongolian it may be recognized in shaping (details to be forthcoming).
We should also address explicitly the situation of <WJ, SP, WJ> we say this is equivalent to NBSP. Clearly, that's not true for its bidi properties. We need to check the text to make sure that discrepancy is noted, for example by stating on page 387 "for purposes of line breaking" or some such qualifier, and noting: "However, NBSP has a specific bidirectional behavior, which is different from this sequence (see UAX#9)."
I have gotten the request for a "one-stop shopping" for properties. It would be the contents of http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt in a formatted table, but with a short description of each property and of each property value. One can piece this information together, but it is rather painful.