Re: RTL PUA? from Philippe Verdy on 2011-08-21 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 22 Aug 2011 00:51:41 +0200

2011/8/21 Doug Ewell <doug_at_ewellic.org>:
> For once, I am in strong agreement with something Philippe had to say:
>
>> We really need a raliable way to transport a PUA agreement in such a
> way that it can be understood by a computer.
>
> I don't necessarily agree that fonts, or (especially) any particular font technology, are the one and only way to accomplish this, because there's more to character handling than display. Maybe some sort of open format could be devised that could be used as a plug-in to a variety of existing components.

Yes but without display support, at least, all the other needs will
never be addressed, because you won't have text encoded to work with.
So don't even dream for example about performing plain-text search, if
you don't have encoded texts to search in ! Collation is then a
secondary target. Proper display is an immediate need (that even comes
before the development of easy input methods, or later developments of
spell checkers, content indexers, semantic analyzers, and localization
of softwares to use a given script through its UI).

For proper display of PUAs, all that is needed is a minimum set of
character properties. I have argued, against what Peter Constable
thinks, that OpenType cannot handle RTL characters with PUAs, because
it has absolutely no source of information to know if a run of text is
RTL or LTR, when implemeing the BiDi algorithm.

OK, the mirroring property is probably not essential (because most
mirrored characters are today only punctuations, that already cover a
very wide range. If needed additional PUA punctuations may be added,
and even coded in two mirrored code positions, even if they are not
automatically mirrored according to their context : for such rare
cases, using BiDi format controls around them, or other equivalent CSS
embedding styles in HTML, and similar technics, will be enough.

But for most of the RTL text using PUAs in long runs or mixed within
other sequences of standard RTL characters (for example in the middle
of words), format controls are clearly not the solution (it does not
work reliably in HTML for example, if you have to split words within
separate spans, and inserting those controls in the middle of words is
really a nightmare). In addition it completely defeats the plain-text
searchability and editability of encoded texts. This will only slow
down the production of encoded texts that in fact, almost no work will
be done with those PUAs. As a consequence, most texts will wait
indefinitely for some encoding effort.

The need will become even more urgent now that the UTC and WG2 will
pass most of its time in discussing scripts that are rarely used,
where the cultural knowledge will be difficult to find. If we don't
have an easy way to experiment their encodings at least with PUAs, for
extended periods (because there will be the need of a long research
period, with conflicting experimentations), those scripts will remain
unencoded in the UCS for very long. And in fact I doubt that even the
WG2 or the UTC will have the resources to provide all this effort
without commiting many critical errors that will be a plague for the
long-term future.

We absolutely need a transition mechanism, and PUAs can be part of
this transition. For the same reason, the possibility offered to
support external character prorperties, for characters that are not
encoded or encoded in separate efforts via PUAs, and later that will
be encoded with low levels of implementations and deployment for many
year, would certainly help maintaining the needed resources (at UTC
and WG2) at a low level, where most of the experimentations will be
performed independantly without depending on the release of a putative
version of the UCS finally accepting to encode the script.

But even in this case, or historic scripts, the encoding effort will
be hard to finalize: it is highly probable that those scripts will be
encoded progressively, with a starting minimum subset about which most
people will agree, and many other characters remaining that need
longer experimentations or researches. Those scripts will then need to
support for long a mix of standard assignments, and PUAs, at the same
time, for distinct small communities that will need to share and
discuss their agreement.

The current problem is that there is absolutely no transition
mechanism in the UCS encoding process: a character gets fully encoded
with most of its essential properties becoming normative, some of them
impossible to change later (even if there was an error or an
unexpected caveat, that the interested communities have not had any
chance to experiment before they were finally approved by the UTC and
WG2).

Unicode should not interfere with what users will want to do with
PUAs. After all, PUAs was made specifically for that. If users need to
assign their own property values to PUAs, they must be able to do
that. And these properties must find a way to be representable in the
current technology frameworks.

If those frameworks refuse all changes (e.g. UTC/WG2 reject the
assignment of new RTL PUAs, or UTC rejects the change of properties
for some PUA ranges, or the OpenType promoters don't want to integrate
custom character properties for PUAs assigned in fonts, or OpenType
layout engine implementers refuse to include a way to use an external
set of properties), there will be no other way than creating another
technology that won't require any prior approval by existing
non-collaborative standard bodies (or implementers), that have strong
requirements that cannot even be satisfied, even with PUAs. This also
means that there will be independant developments of non-compliant UCS
implementations, that will be later hard to reconciliate with the
current standard framework.

UCS promoters and designers must admit that they had to offer a
transition mechanism in order to facilitate the transition and
adoption. This was in fact what happened when both the UTC and WG2
started independant efforts (but technically very different) to create
an universal character set: they had to accept also transition
mechanisms with roundtrip compatibility with the prior encodings (both
standard encodings at ISO and its NB's, and proprietary ones that have
been opened for more general use and integrated by UTC in the
support). Transition mechanisms were also added later to coordinate
the efforts by UTC and WG2 into a common UCS.

In all these mechanisms, we did not move suddenly from legacy
encodings to the UCS. In fact this is not just for the encoding at UTC
and WG2, but this still exists now in the OpenType format (multiple
"cmap" tables, multiple glyph formats, multiple typographic feature
formats). This means the integration of optional features, and the
design of a set of priorities that can enforce some common usage
policy, in order to converge later to a more stable situation and a
wider adoption.
Received on Sun Aug 21 2011 - 17:55:28 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 21 2011 - 17:55:29 CDT