TR#29: Inconsistency in breaking sentences and words

From: Konstantin Ritt <ritt.ks_at_gmail.com>
Date: Sat, 6 Oct 2012 15:03:28 +0300

Hi list,

As of Unicode 5.1, the MidNumLet Word_Break property value
(apostrophe-alike + dot-alike characters) caused sequences like <
(ALetter)+ MidNumLet (ALetter)+ > to be treated like a single word.
Whilst it seems to be an improvement in handling words like "can’t" or
"aujourd’hui", it also causes a regression in handling words separated
with dot (e.g. domain names --
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/63311,
mistyped text -- "hi.there", or navigating through the code --
"struct.member" (yeah, I know this is out of scope of the default
algorithm, but still), and so on).
And the worst thing is that the default algorithm now specifies a
sentence break in the middle of a word. As for example "mr.Hamster" -
there are two sentences due to rule SB8 but still a single word due to
rules WB6-WB7.

A possible solution (a simple one) is to map some or all of those
dot-alike characters (FULL STOP, ONE DOT LEADER, SMALL FULL STOP, and
FULLWIDTH FULL STOP) back to MidNum Word_Break property value.

Another possible solution I see is to split ALetter into something
like Upper, Lower, and OLetter, to map those dot-alike characters to
some new Term Word_Break property value (mostly the same as the
Sentence_Break property values), and to extend the word breaking rules
so that no breaks will be allowed within sequences like < Upper x Term
x Upper (Term)? > and < Lower x Term x Lower (Term)? > (possibly
surrounded with < (¬(Upper | Lower | OLetter))* > ?).

What do you think?

P.S. I'd really wish unicode.org has a bug tracker so that one would
be able to report, search, and watch issues like this.

Kind regards,
Konstantin
Received on Sat Oct 06 2012 - 07:10:14 CDT

This archive was generated by hypermail 2.2.0 : Sat Oct 06 2012 - 07:10:17 CDT