RE: New version of TR29:

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Aug 14 2002 - 07:18:29 EDT


Mark Davis wrote:
> There is a new version of Unicode Technical Report #29: Text
> Boundaries on <http://www.unicode.org/reports/tr29/>,
> [...]
> Feedback that is received before the UTC meeting (starting
> August 20) can be
> made available for the discussion of TR29 at that meeting.

I think that the following comment by António Martins-Tuválkin, from the
thread titled "Is U+0140 (l with middle dot) ever used?", is relevant for
TR29:

| As for the nature of the middle dot, short of a specific code point
| attributed to LATIN LETTER CATALAN MIDDLE DOT, there should be
| something ensuring that this character can be treaded as a letter
| for all things refering to word delimitation (smart select, line
| break, word count, etc.).
|
| I imagine that with 9 million native speakers catalan may appear
| as a weak lobby to push to such a change in the standard, but note
| that while other uses of (non-letter) middle dot are marginal and
| scarcely content-bearing, catalan middle dot is central and
| essencial to quality textual content representation and encoding
| -- which AFAIK Unicode is all about.

I suggest that U+00B7 (·) be added to the "MidLetter" character class in
Table 2 ("Default Word Boundaries"). It would be inconvenient that U+0140
U+004C (ŀl) and U+004C U+00B7 U+004C (l·l) work differently.

A proper Catalan behavior is desirable for Catalan itself, of course, but
also for any other language occasionally using Catalan loanwords or proper
names.

A web search on Goggle for "Paral·lel" (the most famous road in Barcelona)
shows that this name is more common in English web pages (about 16,000) than
in Catalan ones (about 7,500).

Moreover, as Martins-Tuválkin says, non-Catalan uses of U+00B7 are too
unusual and uninteresting to be taken as the default.

BTW, notice that the most important of these non-Catalan usages work as
expected also if U+00B7 is a MidLetter:

 1) It works OK as a bullet: it correctly splits because it would never be
preceded by a letter;

 2) It works OK as a Greek semicolon: it correctly splits because it is
would always be followed by a space;

 3) It works OK as a CJK separator for Western personal names: it correctly
splits because MidLetter is not involved in rules with katakana or
ideographs;

 4) It works OK as a hyphenating separator in dictionaries: it correctly
joins as it does in Catalan.

_ Marco



This archive was generated by hypermail 2.1.2 : Wed Aug 14 2002 - 05:46:47 EDT