Re: Question about the Sentence_Break property from Philippe Verdy on 2015-02-20 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 21 Feb 2015 00:56:14 +0100

2015-02-20 6:14 GMT+01:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
> One thing that is missing is mention of the convention that a single
> newline character (or CRLF pair) is a line break whereas a doubled
> newline character denotes a paragraph break.
>

In that case CR or LF characters alone are not "paragraph separators" by
themselves unless they are grouped together. Like NEL, they should just be
considered as line separators and the terminology used in UAX 29 rule SB4
is effectively incorrect if what matters here is just the linebreak
property. And also in that case, the SB4 rule should effecticely include
NEL (from the C1 subset).

But as SB4 is only related to sentence breaking, It would be e problem
because simple linebreaks are used extremely frequently in the middle of
sentences.

What the Sentence break algorithm should say is that there should first be
a preprossing step separating line breaks and paragraph breaks (creating
custom entities,(similar to collation elements, but encoded internally with
a code point out of the standard space), that the rule SB4 would use
instead of "Sep | CR | LF". That custome entity should be "Sep" but without
the rule defining it, as there are various ways to represent paragraph
breaks.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Fri Feb 20 2015 - 17:57:59 CST

This archive was generated by hypermail 2.2.0 : Fri Feb 20 2015 - 17:58:00 CST