L2/04-308

Technical Report Recommendations

Mark Davis, 2004-07-23

Latest Version: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/utc/technical_report_recommendations.html

The following are recommended changes for certain technical reports in the U4.1 timeframe. The proposal is to authorize the posting of Proposed Updates incorporating the following changes between now and November, to allow more time for public feedback.

UAX #15 Unicode Normalization Forms

Needs updating for U4.1 to incorporate the corrigendum, moving identifier section to TR31 (a stub will be left to point to it), plus editing to have definitions more consistent with #23 and #30. The following have been reported as problems; they need to be reviewed and fixed if so.

  1. "(In NFKC and NFKD, a K is used to stand for compatibility to avoid confusion with the C standing for canonical.)"
    Here, 'canonical' should be changed to 'composition', because that's what the C in NFC and NFKC stands for.
  2. Change the contents listing to the same style as in the template.
  3. Replace the two PPT slides by tables. On some of my screens the anti-aliasing makes them completely unreadable.
  4. Change the notation section into a table, so that the notation being described is in the left hand column. That makes it much easier to locate something.
  5. Consider italicizing variables to set them off from text.
  6. The sentence "This can be written as P in X=Q in Y." is really hard to read. consider putting the formula on a line by itself.
  7. Annex 13 should not be an annex. The definition of respcting Can, Equiv. should be a formal definition, not hidden in the text.
  8. There's a problem with the opening paragraph in Annex13. As defined, canonical equivalence *only* applies to strings. Therefore your distinction between preserving and respecting makes no sense (as worded).

    This section describes the relationship of normalization to respecting (or preserving) canonical equivalence. A process (or function) respects canonical equivalence when canonical equivalent inputs always produce canonically equivalent outputs. For functions that map strings to strings, this is often called preserving canonical equivalence. There are a number of important aspects to this concept:

  9. "The canonically equivalent inputs or outputs are not just limited to strings, but are also relevant to the offsets within strings, since those play a fundamental role in Unicode string processing." is also a pocket definition, this time of 'canonically equivalent offset'. Should be a formal definition.

UAX #29 Text Boundaries (UAX #14 Line Breaking Properties)

1. We should modify word selection so that it has the same 'escape hatch' as line break, for Thai/Lao. It would thus be parallel to Line Break's LB 1, and add the character classes that are described there.

LB 1  Assign a line breaking class to each character of the input. Resolve AI, CB, SA, SG, XX into other line breaking classes depending on criteria outside the scope of this algorithm.

2. In a related matter, we need to incorporate the LineBreak corrigendum into LineBreak, and modify the TR to remove LB 7a.

LB 7a  In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID.

And document that NBSP is the preferred base character for showing combining marks in isolation.

3. There are other "special" rules in LineBreak.

LB 6  Don’t break a Korean Syllable Block, and treat it as a single unit of the same LB class as a Hangul Syllable in all the following rules

Treat a Korean Syllable block as if it were ID

LB 7b  Don't break a combining character sequence and treat it as if it has the LB class of the base character in all of the following rules.

Treat X CM* as if it were X

These rules in general are difficult for regular expression implementations and for pair tables. They complicate regular expressions because they affect every instance where any characters could match; they complicate pair tables since they require prehandling in code, outside of the pair table. If they are present, they should be the top rules, since they should be 'handled' by changes all down the line. Both of the rules in LineBreak end up (because of other rules) having the same effect as the rules used in TR29: to treat a grapheme cluster as the base. This is because of the effect of other rules in UAX#14, that keep the differences between LineBreak=CM and Grapheme_Extend = true from surfacing. However, I have gotten the feedback from our implementers that it would be ideal they were unified, if UAX#14 had, instead of these two rules, the one rule used in #29:

Treat a grapheme cluster as if it were a single character: the first character of the cluster.

"I also think that it would help make things simpler if this grapheme cluster rule could be put as close to the top of the list of rules as possible, so that we don't have some rules looking within grapheme clusters, and others only at the boundaries. It would have to be a bug if a line break were to break a grapheme cluster, so it should be possible to say, from the very beginning, that line break rules work on grapheme cluster boundaries, and be done with any further consideration of combining marks (except for unattached ones)"

4. Failing adoption of #3 by the committee, LB6 should be replaced by ordinary rules. It only has an effect in 5 rules:

ID × IN

ID × PO

PR × ID

ALL ÷

÷ ALL

We can safely replace it by adding the following rules. They can go anyplace before the ALL rules, and can be put in logical locations. The first three rules correspond to the first three above; the last three disallow breaking in the middle of a Hangul Syllable (as described in Chapter 3).

L | V | T | LV | LVT × IN

L | V | T | LV | LVT  × PO

PR × L | V | T | LV | LVT

L  × L | V | LV | LVT

V | LV × V | T

T | LVT × T

As Asmus noted off-line, a common tailoring is to change Hangul Syllables to AL, but because sequences of AL don't divide either, it is safe to add the above rules: the tailoring just changes all of L | V | T | LV | LVT to AL.

5. Line Break Rule LB 18B can be dropped altogether. It has no effect on the results; anything that it would break will also be broken by LB20.

HY ÷

÷BB

6. Deborah identified some cases that are missed if the regular expression for numbers is used, rather than the list of pairs of rule 18:

Original LB18: PR ? ( OP | HY ) ? NU (NU | IS) * CL ?  PO ?

Updated: PR ? ( OP | HY ) ? NU (NU | IS | SY ) * CL ?  PO ?

PR × AL

PR × ID

There were no changes if the rules are being used, only if the big regular expression is being used as an alternative to the rules.

UTS #18 Unicode Regular Expressions

Needs updating for U4.1 for to account for changes in foldings, properties, names, scripts, and also the implications of Pattern_Whitespace and Pattern_Syntax. Other items:

  1. Also add descriptions of the following so that they can be referenced in other UTRs, especially #30. These would be in a special section, since there are no real Unicode implications; they are really for our reference.
    1. grouping syntax, e.g. (ab) cd
    2. references, following Perl: $0, $1, etc in replacement strings, and back references \1, \2 in the pattern, used to refer to what a group matches.
    3. variables, ie defining $xyz = [[:greek:]&[:lowercase:]], then using it in multiple regular expressions
  2. Since any character could occur as a literal in a regular expression, when regular expression syntax is embedded within other syntax it can be difficult to determine where the end of the regex expression is. Add a note describing the common practice, which is to have a delimiter like /ab(c)*/, where the delimiter can be chosen to be some character not in the particular regular expression.
  3. Consider adding a conformance clause for http://www.unicode.org/reports/tr18/#Compatibility_Properties, so that if people want to claim conformance to them they can. Also consider having a second column that follows the POSIX partitioning constraints.
  4. principle use => principal use
  5. \p{gc=Decimal_Number} ...
    "Non-decimal numbers (like Roman numerals) are normally excluded. In U4.0+, this is the same as gc = Decimal_Number (Nd)." =>
    "Non-decimal numbers (like Roman numerals) are normally excluded. In U4.0 and U4.1+, this is the same as Numeric_Type = Decimal (nt=De)."
    [Since we slipped up in 4.0.1, and Nd was wrong.]
  6. Note: ZWSP, while a Z character, is for line break control and should not be included.
    [Remove: It recently (U4.0.1) became a Cf.]

UTS #23 Character Properties

I have scanned through a number of the TRs, and found other useful definitions that we should centralize so that they can be used consistently. Asmus may have already incorporated some these into his draft for the meeting; if so, skip over those that are. This is not a request that these definitions be added verbatim; they may need wordsmithing and changes for consistency.

  1. By convention, toX(a) is notation used for a function that produces a result of the form X. Thus toLowercase(a) produces a lowercase form from a. There is a related function isX(a), coordinated with toX, whereby isX(a) is true if and only if toX(a) = a.
  2. Define the closure of a string S under a folding F to be the set of all strings that fold to it. Eg,
  3. Define the closure of a set of strings SS under a set of foldings SF to be the union of the closures of the strings in SS. Eg for the toLowercase function,
  4. Define the fold of a set of strings SS to be the union of the set of foldings of the elements of SS. E.g.
    1. toLowercase({'aa', 'Aa', 'aA', 'AA', 'b',  'B'}) => {'aa', 'b'}
  5. Define what it is to preserve a relation. A transform F preserves a relation R, when R(a,b) implies R(F(a), F(b))
  6. Explain how you can change a transform F so that it preserves a relation. Use the example of normalization.
    1. In the general case, the new transform is defined as F'(s) = toNFC(F(toNFC(s))
    2. If F preserves canonical equivalence, then it can avoid a step: F'(s) = toNFC(F(x))
  7. Also, explain that from a partition one can generate a folding, by generating a function that picks one element of each partition to be the element that all and only the others in that partition fold to. That is what is done in generating CaseFolding, for example.
  8. Add a section after Canonical Equivalence (in Asmus's current draft) called Normalized Input

    If the input is guaranteed to be in NFD, then Step 1 is simpler. Additional rules do not have to be generated; instead, the matching part of the rule just needs to be transformed into NFD. Thus instead of generating new rules, one simply replaces a rule like:

    <A-acute> <dot_under> -> Z

    with:

    A <dot_under> <acute> -> Z

    Step 2 will be the same; however, it will need to be applied to fewer cases, since fewer rules will result from Step 1.

  9. Add a section after Canonical Equivalence (in Asmus's current draft) called Normalized Output

    The above two steps ensure that the folding preserves canonical equivalence. However, they do not guarantee that the folding preserves normalization. If normalization is required, then it must be applied as an additional step. This is typically an issue whenever the result of a rule contains combining marks. If normalization is to be applied after the each rule is applied, there are implementation techniques described in [Normalization] for ways to optimize this process. However, if there are any sizable number of changes, it is more efficient -- and certainly simpler -- to simply normalize the entire text once all of the rules have been applied.

Related Items

These are copied from email on property topics: