Re: Proposed new characters updated in Pipeline Table

From: Philippe Verdy <>
Date: Sun, 14 Aug 2011 14:57:18 +0200

Isn't there also a new datafile with beta status, that lists the usage
of characters shared by multiple scripts ? If so, it should also
concern the Arabic-Syriac number sign (actually an abbreviated
ligature of the Arabic word for year, with a subtended stroke that can
span below several digits appearing on its right, but to be encoded in
texts before sequences of either Arabo-Indic, or Arado-Persian, or
Syriac digits) currently being proposed as a format control in an
Arabic block, and to the "serpentine" format control also proposed in
the same Arabic block and that is also another "sort of" diacritic
that also spans several digits of Syriac or Arabic numbers encoded
after it, and so will be encoded the same way as a format control.

Anyway, I don't know how fonts can support those format controls
(except for a limited number of digits, by splitting the glyph in
several parts: a fixed leading glyph, variable number of middle glyph,
and trailing glyph, and some complex and contextual substitution rules
to reorganize the list of glyphs). With OpenType only, and based on
the behavior of OpenType renderers that offer very limited control on
glyph order and no easy representation to handle this transform,I fear
that it will not work ; but it could be possible with a SIL Graphite
feature extension to TrueType or OpenType/TT (and a Graphite enabled
renderer) and/or an Apple AAT feature extension to OpenType/CFF (on
MacOSX only).

This requires some research, because it could finally influence the
effective encoding and properties needed to get a working support of
these new proposed characters (which are not really control formats
but really true characters with uncommon layout that does not fit very
well with the current Unicode character encoding model, which has a
same caveat with the representation of double-width diacritics, and
which offers absolutely no layout support at all for Egyptian

Shouldn't the Unicode character encoding model be updated to better
take into account those complex layouts ? (Let's remember how Hangul
was encoded to avoid the issue, assigning a LOT of characters for the
same logical characters and duplicating the consonnants, or how the
Han script is constantly being increased almost infinitely, despite it
really has an evident structure for the complex layout of its grapheme
clusters based on simpler base sinograms...)

Also, some research for integrating in the ISO standard for OpenType
some of the capabilities currently offered only in Graphite and/or AAT
would be welcome: creating 3 font flavors and maintaining
compatibility within documents rendered by several platforms should
find some end, even if this means deprecating some older features of
OpenType, AAT and Graphite. But due to the existing differences, this
forces documents to be tweaked in their encoding, and this is a
problem for the stability of the Unicode encoding itself, across
platforms and even across versions of their text renderers and layout
engines on the same platform...

If we continue like this, we'll get in Unicode too many characters
maintained for backward compatibility, but not working effectively and
being replaced later with new confusables... I do think that newly
proposed characters SHOULD take into account the cross-platform
compatibility of the proposed technical encoding, within a CLEAR and
UNIVERSAL character model. The character encoding process at ISO and
UTC is not enough, we do need now a more serious cooperation with
implementors of font technologies, especially for scripts needed
complex layouts (encoding with "control formats" will not solve any

2011/8/13 Peter Constable <>:
> From: [] On Behalf Of mmarx
>> While I am at it:
>> should Unicode acknowledge that U+0730, U+0733, U+0736, U+073A
>> and U+073D are used with the ARAB script or does this lie outside its
>> competence and jurisdiction?
> Unicode has character properties declaring script associations for characters. Currently, all of those characters have a Script property of "Syriac". If characters are known to be used in multiple scripts, they generally should have a Script property of "Common"; there is also a provisional data file called ScriptExtensions that can be used to provide more information about Script = Common characters (e.g., the character is used for Syriac and Arabic scripts, but not generally _any_ script).
> Properties for the characters you mention could be changed. That will not happen unless a document is submitted to UTC that outlines the specific changes proposes and provides rationale for making those changes. The rationale should provide evidence in terms of existing usage scenarios and should also consider any potential issues that may arise for existing implementations (if applicable). The information should also be adequate for making changes in the Arabic and Syriac block descriptions in the text of the Standard so that implementers have a chance of learning about the scenarios and requirements; the proposal document could include draft text for insertion in the block descriptions.
> Submitting a doc to UTC is a basic requirement. The issue also needs to make it onto the agenda of a UTC meeting, and it helps to have a champion to make sure that happens and that can be available to discuss the issue with the UTC. These things are much easier if you are a member of the consortium (cost is as little as $35/yr for students).
> Peter
Received on Sun Aug 14 2011 - 07:59:51 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 14 2011 - 07:59:52 CDT