Dots as far as the eye can see (formerly: Re: New version of TR29:)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Aug 14 2002 - 11:30:36 EDT


I had just added this in the list of possible property changes for discussion at the next UTC meeting. I agree that the right status is MidLetter, for the reasons Marco cites. Note that there is also a dot explicitly used for a hyphenation point, one that should also be included as a MidLetter.

U+00B7 ( · ) {MIDDLE DOT}
U+2027 ( ‧ ) {HYPHENATION POINT}

(Note to all, especially those familiar with non-Latin written languages: be sure to look over the character classes for words and sentences in http://www.unicode.org/reports/tr29/ to see if there are any other punctuation characters that should be included.)

I agree with Marco that on balance it is better to respect the Catalan use, especially since as a MidLetter the character will not interfere with the most common other usages (e.g. as a bullet). There is a character explicitly used for a mathematical dot:

U+22C5 ( ⋅ ) {DOT OPERATOR}

Note that we have a gazillion other dots already:

U+002E ( . ) {FULL STOP}
U+02D9 ( ˙ ) {DOT ABOVE}
U+2024 ( ․ ) {ONE DOT LEADER}
U+22C5 ( ⋅ ) {DOT OPERATOR}
U+FE52 ( ﹒ ) {SMALL FULL STOP}
U+FF0E ( . ) {FULLWIDTH FULL STOP}
U+FF65 ( ・ ) {HALFWIDTH KATAKANA MIDDLE DOT}

U+2801 ( ⠁ ) {BRAILLE PATTERN DOTS-1}
U+2802 ( ⠂ ) {BRAILLE PATTERN DOTS-2}
U+2804 ( ⠄ ) {BRAILLE PATTERN DOTS-3}
U+2808 ( ⠈ ) {BRAILLE PATTERN DOTS-4}
U+2810 ( ⠐ ) {BRAILLE PATTERN DOTS-5}
U+2820 ( ⠠ ) {BRAILLE PATTERN DOTS-6}
U+2840 ( ⡀ ) {BRAILLE PATTERN DOTS-7}

U+0307 ( ◌̇ ) {COMBINING DOT ABOVE}
U+0323 ( ◌̣ ) {COMBINING DOT BELOW}

U+05C1 ( ◌ׁ ) {HEBREW POINT SHIN DOT}
U+05C2 ( ◌ׂ ) {HEBREW POINT SIN DOT}
U+05C4 ( ◌ׄ ) {HEBREW MARK UPPER DOT}
U+05B9 ( ◌ֹ ) {HEBREW POINT HOLAM}
U+05B4 ( ◌ִ ) {HEBREW POINT HIRIQ}
U+05BC ( ◌ּ ) {HEBREW POINT DAGESH OR MAPIQ}

U+302E ( ◌〮 ) {HANGUL SINGLE DOT TONE MARK}

U+073C ( ◌ܼ ) {SYRIAC HBASA-ESASA DOTTED}
U+073F ( ◌ܿ ) {SYRIAC RWAHA}
U+0740 ( ◌݀ ) {SYRIAC FEMININE DOT}
U+0741 ( ◌݁ ) {SYRIAC QUSHSHAYA}
U+0742 ( ◌݂ ) {SYRIAC RUKKAKHA}

U+093C ( ◌़ ) {DEVANAGARI SIGN NUKTA}
U+09BC ( ◌় ) {BENGALI SIGN NUKTA}
U+0A3C ( ◌਼ ) {GURMUKHI SIGN NUKTA}
U+0ABC ( ◌઼ ) {GUJARATI SIGN NUKTA}
U+0B3C ( ◌଼ ) {ORIYA SIGN NUKTA}

And these are just the obvious ones found with a quick search (and just for the single dots). There are probably more hiding out in little corners of scripts (it's a bit like "Where's Waldo" looking for them. Moreover, I believe we may even be adding more dots for UPA (http://www.unicode.org/unicode/alloc/Pipeline.html).

Perhaps we should have reserved a plane just for the darned dots; who knows how many we will end up with...

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
To: "'John Cowan'" <jcowan@reutershealth.com>
Cc: <mark.davis@jtcsv.com>; <unicore@unicode.org>; <unicode@unicode.org>; <antonio@tuvalkin.web.pt>
Sent: Wednesday, August 14, 2002 06:23
Subject: RE: New version of TR29:

John Cowan wrote:
> Marco Cimarosti scripsit:
>
> > Moreover, as Martins-Tuválkin says, non-Catalan uses of
> U+00B7 are too
> > unusual and uninteresting to be taken as the default.
>
> You omit, however, its very common use as a sign of multiplication.

Actually, I don't see it very often.

> > BTW, notice that the most important of these non-Catalan
> usages work as
> > expected also if U+00B7 is a MidLetter:
>
> However, it prevents a·b (a times b) from being correctly split.

In algebra, multiplication operators are normally omitted in such cases: a
times b is spelled "ab"; twice a is spelled "2a".

A dot-shaped multiplication operator is only used when both operands are
numbers (in which case it would split correctly), and when both operands are
alphabetic but at least one of them is longer than one letter, e.g.:

x·sin 3

But it seems to me that these borderline cases are overly rare to get the
priority over the proper spelling of Catalan or the common notation of
hyphenation in dictionaries.

Moreover, TR29 can be customized for special needs, and math applications
already have lots of things to customize.

_ Marco

 
----- Original Message -----
From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
To: "'Mark Davis'" <mark.davis@jtcsv.com>; <unicore@unicode.org>; <unicode@unicode.org>
Cc: "'Anto'nio Martins-Tuva'lkin'" <antonio@tuvalkin.web.pt>
Sent: Wednesday, August 14, 2002 04:18
Subject: RE: New version of TR29:

Mark Davis wrote:
> There is a new version of Unicode Technical Report #29: Text
> Boundaries on <http://www.unicode.org/reports/tr29/>,
> [...]
> Feedback that is received before the UTC meeting (starting
> August 20) can be
> made available for the discussion of TR29 at that meeting.

I think that the following comment by António Martins-Tuválkin, from the
thread titled "Is U+0140 (l with middle dot) ever used?", is relevant for
TR29:

| As for the nature of the middle dot, short of a specific code point
| attributed to LATIN LETTER CATALAN MIDDLE DOT, there should be
| something ensuring that this character can be treaded as a letter
| for all things refering to word delimitation (smart select, line
| break, word count, etc.).
|
| I imagine that with 9 million native speakers catalan may appear
| as a weak lobby to push to such a change in the standard, but note
| that while other uses of (non-letter) middle dot are marginal and
| scarcely content-bearing, catalan middle dot is central and
| essencial to quality textual content representation and encoding
| -- which AFAIK Unicode is all about.

I suggest that U+00B7 (·) be added to the "MidLetter" character class in
Table 2 ("Default Word Boundaries"). It would be inconvenient that U+0140
U+004C (ŀl) and U+004C U+00B7 U+004C (l·l) work differently.

A proper Catalan behavior is desirable for Catalan itself, of course, but
also for any other language occasionally using Catalan loanwords or proper
names.

A web search on Goggle for "Paral·lel" (the most famous road in Barcelona)
shows that this name is more common in English web pages (about 16,000) than
in Catalan ones (about 7,500).

Moreover, as Martins-Tuválkin says, non-Catalan uses of U+00B7 are too
unusual and uninteresting to be taken as the default.

BTW, notice that the most important of these non-Catalan usages work as
expected also if U+00B7 is a MidLetter:

 1) It works OK as a bullet: it correctly splits because it would never be
preceded by a letter;

 2) It works OK as a Greek semicolon: it correctly splits because it is
would always be followed by a space;

 3) It works OK as a CJK separator for Western personal names: it correctly
splits because MidLetter is not involved in rules with katakana or
ideographs;

 4) It works OK as a hyphenating separator in dictionaries: it correctly
joins as it does in Catalan.

_ Marco



This archive was generated by hypermail 2.1.2 : Wed Aug 14 2002 - 09:44:33 EDT