Re: RTL PUA? from Philippe Verdy on 2011-08-22 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 22 Aug 2011 15:29:53 +0200

2011/8/22 Peter Constable <petercon_at_microsoft.com>:
> From: verdyp_at_gmail.com [mailto:verdyp_at_gmail.com] On Behalf Of Philippe Verdy
>
>>> As I explained in an earlier message, the layout engine doesn't use
>>> the "default" property value but the resolved bidi level.
>>
>> Once again, you refuse to understand my arguments.
>
> I don't think I'm refusing to understand anything. I'm merely taking your assertions _as stated_ and evaluating whether I think they are accurate or not. Perhaps what you intend to convey assumes things not clear in what you've stated, since you think I'm not understanding you.
>
>
>> What I'm saying is that OpenType CANNOT resolve the bidi level of
>> PUAs (with the exception where we use additional BiDi controls,
>
> Of course _OpenType_ cannot, but any rendering engine that uses OpenType _must_ resolve the bidi level of _all_ characters in a sequence that it is given to render. Given our current situation, a default rendering implementation would resolve PUA characters to an even (LTR) level unless, of course, bidi control characters -- particularly RLO -- are used to override the directionality of the character, as you mention.
>
>> which remains a hack, because it adds unnecessary unvisible markup
>> around the encoded texts, and complexifies the use of strings and
>> substrings).
>
> We'll, depending on how you define "hack", some might reasonably suggest that any usage of PUA is "a hack". (Of course, some who may not use the term in the same way might argue that it is certainly not "a hack".)
>
> You can turn the problem as you want, but PUAs (as well as unknown
> characters) still have default properties that, in fine, will get used in absence of a more precise definition (i.e. an explicit override) of the actual BiDi property needed for the character.

So now I perceive your opinion :

- you don't want the solution proposed by Michael Everson (simply
adding a range of RTL PUA), that I also think is not necessary, but is
clearly a possible solution.

- you propose to use BiDi overrrides. I also think (like Michael
Everson) that this is an unpractical hack (Michael Everson that has to
work and discuss with old scripts, or many new unencoded characters to
add to existing scripts (notably Arabic) trying to encode them,
finding various ways to represent them, and *test* his solutions, will
certainly think that embedding each occurence of a PUA substring in
BiDi controls, including in the middle of Arabic words, is certainly a
very bad hack.

- He must certainly think (I also think it too), that PUA characters
are NOT hacks. They are architectural to the well-being of the UCS,
essential in various situations to preserve the software conformance
with the standard. In fact, for old and rare scripts, using PUAs will
remain essential for long, because those scripts will need more and
more time now to get encoded, requiring more extensive researches,
more collaborations with less technical-aware people that cannot
understand why they'll have to test the proposed solutions using test
fonts and test input methods tht require them to enter BiDi controls
around all those PUA characters.

The only problem here is the strong LTR property of all existing PUAs,
as if they were only needed for rare Han sinograms, or for symbols.

Note that, for using a PUA for rare letters found in Arabic, it is
impossible to embed the whole Arabic text in Bidi overrides: this
would completely break the normal behavior of the non-PUA characters
found in the text, notably sequences of Arabic digits, because the
BiDi controls are effectively disabling the BiDi algorithm so that it
will return a single RTL run for all the text in these controls. IF
BiDi controls are used, they have to be inserted ONLY between
subranges containing the PUAs, and only those.

The solution proposed by Michael (a new block of RTL PUAs, probably in
plane 14) still has an advantage: no BiDi controls are needed at all.
The BiDi algorithm does not have to be disabled. All other aspects of
RTL scripts (or mixed RTL/LTR scripts) are preserved (including
mirroring behaviors for "auto-LTR" characters (at the begining of
paragraphs) and characters whose directionality depends on the
resolved direction of the precening text.

I don't think this is necessary though: I see no reason why
implementations *have to* keep the strong LTR property of existing
PUAs. This strong LTR property is only the consequence of the fact
that this is only the *default* value of those PUAs, and applications
should not be restricted from changing this property as they want,
especially for PUAs.

But to change this property value, we need an explicit PUA agreement
about their usage, in such a way that it can be understood by a
computer. This means an external source of character properties. My
opinion is that this need is most often sufficient if it solves just
the problem of correct display order. Given that the encoded texts
(using those existing strong LTR PUAs that we want to adopt a RTL
behavior) do not explicitly encode the PUA agreement, the source of
the PUA agreement cannot be the encoded text (BiDi controls are
definitely not a demonstration of such PUA agreement).

For me, it would be simple to embed this PUA agreement, for computer
use, in a font suitable for displaying those PUAs (let's remember that
those PUAs will still need such a specific font that *must* match the
same PUA agreement). All that is needed then, in such a PUA font, is
that it indicates which PUA characters (that are "cmap'ed" in it) are
RTL or not. This just requires a new very small table in the PUA font,
to help the text layout engine to correctly resolve the direction of
text runs when it implements the BiDi algorithm.

Then, if needed, the standard "rtlm", "ltrm" features (for glyph-level
mirroring, in the case of OpenType layout) can be used reliably
('rtla" and "ltra" features for typograpic variants may also be used,
but they are probably much less essential for PUA characters that are
likely expected to be represented in the PUA font with just a single
typographic variant per "cmap'ped" glyph).

If the PUA font does not have such information about which of its
cmapped PUA is RTL, all of them will resolve as LTR, only if a third
data source is used and associated to the document and its PUA font,
to specify their effective BiDi properties. This is in practical more
complex to manage (notably for plain-text documents: this would
require that plain-text editors can load a separate properties file in
addition to loading the plain-text document and selecting the
appropriate PUA font.

If editing or viewing a rich-text document (e.g. HTML in an HTML
editor, or a word-processor document, or an online Wiki document) in
WYSIWYG mode, that rich-text document will need to supply the
reference to that source of information, in some meta-data field, just
like it can store also the font to select, in order to render the text
in the expected order: this is possible to do automatically, without
user interaction each time he loads the document, but it won't be as
easy for plain-text documents (including when editing the plain-text
HTML or Wiki source code).

So there are only two options:

- (1) the solution advocated by Michael Everson (a new RTL PUA block,
say in plane 14); it does not require a change in the BiDi algorithm
itself, but renderers must implement the new version of the UCD.

- (2) an external source to override the strong LTR property of
existing PUA blocks (my opinion is that the PUA font is the perfect
fit to place those PUA information), and use this information in
renderers. (I advocate placing this information in PUA fonts directly,
something that some font formats is already doing, but not OpenType
for now).

In both cases, the text renderers have to be modified (including
renderers for OpenType, whose specifications will need to be updated,
to change the BiDi algorithm implementation): this requires an
approval either by the UTC & WG2 (solution 1) or by the OpenType
working group (solution 2).

-- Philippe.
Received on Mon Aug 22 2011 - 08:33:31 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 22 2011 - 08:33:32 CDT