Re: PRI #231: Bidi Parenthesis Algorithm from Konstantin Ritt on 2012-06-07 (Unicode Mail List Archive)

From: Konstantin Ritt <ritt.ks_at_gmail.com>
Date: Thu, 7 Jun 2012 19:42:52 +0300

After some investigation, I'm taking my question/proposition back )

There are two major problems I found while implementing BPA-alike
(relaxed) rules for the quotation marks:
1) The paired quotation marks that could be used a lone as
apostrophes: with the information currently provided by the UCD, it's
not possible to guess if the quotation mark is used to indicate a
start/end of the quoted block or to indicate an apostrophe (example:
he said: 'I don't use apostrophes' (the left/right single quotation
marks (U+2018/U+2019) were used to quote what he said and to indicate
an apostrophe)). Heuristic could use the word boundaries to skip a
middle-word apostrophes, the problem is that that language-tailored
word breaking implementation requires the text to be itemized into
script runs first -- too complicated;
2) In some languages, the paired quotation marks could be swapped or
used as unpaired (example: »…» or »…«): without an additional
information, it is easy to misinterpret the appearance of the quoted
block (example: »Danish«and»Polish»).

Perhaps, this should be somehow clarified in respective sections of
UAX#24, "General Punctuation", etc.

Konstantin

2012/6/7 CE Whitehead <cewcathar_at_hotmail.com>:
> Hi.
>
> From: Konstantin Ritt <ritt.ks_at_gmail.com>
> Date: Thu, 7 Jun 2012 13:06:04 +0300
>> Yep, forgot to mention that the difference is in that that some paired
>> quotation characters might be used alone in place of apostrophe, etc.
>> so that the BPA rules could be relaxed for the quotation marks.
>> Dunno about their mirroring in all languages. I thought the
>> BidiMirroring.txt is supposed to list a (language-independent)
>> characters and their respective mirrored brothers.
>
>> UAX#24 section 2.2 "Handling Characters with the Common Script Property"
>> states:
>>> In determining the boundaries of a run of text in a given script,
>>> programs must resolve any of the special script property values, such >> as
>>> Common, based on the context of the surrounding characters. A simple
>>> heuristic uses the script of the preceding character, which >> works well in
>>> many cases. However, this may not always produce optimal results. For
>>> example, in the text "... gamma (γ) is ...", this >> heuristic would cause
>>> matching parentheses to be in different scripts.
>>>
>>> Generally, paired punctuation, such as brackets or quotation marks,
>>> belongs to the enclosing or outer level of the text and should
>
>>> therefore match the script of the enclosing text. In addition, opening
>>> and closing elements of a pair resolve to the same script property >>
>>> values, where possible. The use of quotation marks is language dependent;
>>> therefore it is not possible to tell from the character code >> alone
>>> whether a particular quotation mark is used as an opening or closing
>>> punctuation. For more information, see Section 6.2,
>
>>> General Punctuation, of [Unicode].
>>>
>>> Some characters that are normally used as paired punctuation may also be
>>> used singly. An example is U+2019 right single quotation >> mark, which is
>>> also used as apostrophe, in which case it no longer acts as an enclosing
>>> punctuation. An example from physics would >> be <ψ| or |ψ>, where the
>>> enclosing punctuation characters may not form consistent pairs.
>
>> IIUC, this is the same problem like the one PRI #231 is intended to solve.
>
>> For the cases like "a«b»" one would expect similar results provided by
>> the UBA and the script itemization.
>
>> Konstantin
>
> 2012/6/7 Philippe Verdy <verdy_p_at_wanadoo.fr>:
>>> Their pairing and mirroring is not appropriate for all languages using
>>> them.
>>>
>>> 2012/6/7 Konstantin Ritt <ritt.ks_at_gmail.com>:
>>>> Actually, they have a respective entries in the BidiMirroring.txt:
>>>> 00AB; 00BB # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
>>>> 00BB; 00AB # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
>>>> and mapped into gc=Pi and gc=Pf.
>>>> Even without the per-language tailoring, it seems like a good basic
>>>> approximation, no?
>
> Phillipe is correct; Wikipedia gives some examples of language-specific
> variation in opening and closing quotation marks:
> http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
>
> (also of course as Konstantin notes the single quotation marks are used in
> some languages as apostrophes to indicate possession)
>
> I have not used say French-style quotations in facebook where parentheses
> get displayed at the wrong places if used in mixed right-to-left and
> left-to-right text. So I dunno what happens to quotation marks in
> mixed-directionality text yet.
>
> Best,
>
> --C. E. Whitehead
> cewcathar_at_hotmail.com
>
Received on Thu Jun 07 2012 - 11:48:45 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 07 2012 - 11:48:46 CDT