conflicting property value aliases for scripts (Qaac, Qaai)

From: verdy_p (
Date: Fri Jan 23 2009 - 19:36:09 CST

  • Next message: Doug Ewell: "Re: Groes Eszett"

    I'm not writing about ISO 15924 (which is fine as it is), but about the Unicode's "PropertyValueAliases.txt" file
    which continues to suggest non-conforming preferred values and accepts some aliases that are not meant to represent
    those scripts.

    There are two entries that make problem:

    sc ; Copt ; Coptic ; Qaac
    sc ; Qaai ; Inherited

    The first one still accepts "Qaac" (as a secondary alias, though) despite it is not the prefered form. I can't see
    any place in the rest of the UCD where this code (meant for private used only) is used, so why is it kept there ? I
    see absolutely no value in keeping this alias (which may have been used in preliminary encodings where ISO 15924
    did not even exist, so that "Qaac" could not even be referenced for Coptic when the preferred value is "Copt" from
    the ISO 15924 standard. (only "Coptic" could exist from past Unicode versions)

    The second one is problematic, because TUS says that this is the "preferred" form. However, I can't see the
    rationale that conducts us to prefer a non-interoperable private-use code here, even if "Inherited" script
    properties are needed for some Unicode algorithms. My opinion is that this "Qaai" code should really be removed
    there as well, and replaced by a stable (interoperable) code, like "Ziii" to be allocated in ISO 15924 (in a way
    similar to the line "sc ; Zyyy ; Common". If no new ISO 15924 code is allocated, I think that "Inherited" should
    still be the preferred value, and "Qaai" should be removed from this list of aliases.

    Note that the current use in UCD conflicts with ISO 15924. For me "Coptic" is not the same as "Private-Use", and
    "Inherited" is certainly not "Private-Use" but an effective script.

    I've just seen a bug caused by these two undesirable aliases that had the effect of rejecting valid text because it
    used private scripts.... To fix it, I had to drop these aliases from the implementation (this has no major effect
    on Coptic, but it meant several changes in various locations to use "Inherited" instead of "Qaai", or to make sure
    that it does not get propagated in the rest of the application. They also produced rendering caveats elsewhere :
    * For Coptic it caused the text to be rendered using incorrect fallbacks, despite there was a matching font for it.
    * For Inherited, this had a similar effect with some diacritics being rendered separately from the base character.

    In no action is taken in the UCD to remove these aliases in PropertyValueAliases.txt, then I suggest that a note
    should be added in the Annex of TUS describing the UCD, to warn users about these values:
    * "Qaac" should be really strongly NOT recommanded and really DEPRECATED/OBSOLETED, and it should suggest that the
    preliminary applications still using it be updated to remove it. For new applications, this alias should be
    completely ignored (this includes applications like the Unicode Regular Expressions).
    * "Qaai" should not be used in the rest of the database, but if it is, applications should make sure that this code
    will not be output from interfaces that are querying character properties (these interfaces should return
    "Inherited" instead, or any other convenient numeric mapping that applications may be using internally or in their
    interface, such as through enumeration datatypes (in such enumerations, the identifier used should not be Qaai as
    it will still be needed for application users for their own private properties. This identifier should also
    disappear from public interfaces of libraries that are distributed with development tools. "Qaai" should then be
    completely "blackboxed" within the library implementing Unicode using Unicode datafiles as their source (but my
    opinion is that their internal database should better remove it as well).

    I also don't see the rationale that would forbid a specific formal allocation "Ziii" in ISO 15924 for "Inherited",
    given that it accepts other values not used in Unicode (they represent differences that have been unified in
    Unicode into single scripts, or represent multiple scripts; "Latf" and "Latg" are good examples, but there are
    other examples for scripts that have been rejected from encoding in Unicode like Tengwar) : the ISO 15924 standard
    is more tolerant than Unicode and ISO 10646, because it encodes scripts independantly of the characters used to
    encode them, and the technical need for a standard code in ISO 15924 for use in Unicode's Inherited property value
    aliases is enough for me to justify this addition; note that "Common" makes no sense in ISO 15924 in the domain of
    bibliographic applications, and is precisely present for technical reasons (the ISO 15924 is not meant to be
    reserved only to librarians).

    --- unrelated topic: defining scopes in ISO 15924 ---

    Also, I think that ISO 15924 should contain an additional "scope" field, similar to ISO 639-3:
    * "A" for single generic alphabetic scripts (including all alphabets, abjads, abugidas, ideo-phonographic and
    ideographic scripts) : this "scope" should be set in most ISO 15924 codes.
    * "V" for single script with variant forms ("Hans", "Hant", "Latf", "Latg", "Syre", "Syrj", "Syrn") used in some
    languages with specific orthographic conventions that are not applicable to all texts and all languages using the
    generic alphabetic script. Such "multiple" scripts exist only because part of their repertoire are shared and
    partly unified. My opinion is that they are not really scripts, but are representing orthographic conventions added
    on top of the tuple langue+script and that, in the context of BCP 47 locales, should be remapped as variant subtags
    for these orthographic conventions, however this still represents a challenge for existing renderers and librarians
    standards, that still don't fully and correctly implement BCP 47, so these "variant" scripts are just legacy script
    codes used for technical reasons). (they are working in a way quite similar to ISO 639-5 language families and ISO
    639-3 languages with "collection" scope)
    * "M" for codes representing multiple scripts that can be used simultaneously in the same text with the same
    language and orthographic convention ("Jpan", "Kore"). It is expected that more codes in this scope will be needed
    for bibliographic classification or localization purpose (unless the scripts that are borrowing some other script
    subsets are extended to include their own local copy of the borrowed characters, like it has been done in Latin for
    characters borrowed from Greek or Cyrillic and reencoded as "Latin"). These codes are not very useful for renderers
    except as legacy technical codes, but may still be needed by librarians for classification purpose (they are
    working in a way quite similar to ISO 639-3 languages with "macrolanguage" scope), and in fact, they should be
    replaced by the list of script codes they actually represent, if possible.
    * "N" for other notational scripts that require other information than just the encoded text to produce meaningful
    content, or that can't be converted simply into normal text for a language without the help of complex conversion
    rules and possibly according to user preferences in their locale ("Brai", "Zmth", "Zsym"). These codes can become
    significant as "scripts" is they are used along with other codes like a language code (in the subtag of a locale
    code), but become mapped to other scripts according to the convention.
    * "P" for all private-use codes ("Qaaa".."Qabx") : they are not meant for global interchange but specific to local
    implementations within well defined and restricted domains for only some users. They should not be part or
    referenced directly or needed by any international standards, and restrictions could be included in those standards
    forbidding their use completely in some or all of the defined interoperable interfaces.
    * "S" for special codes needed for some technical applications where none of the scopes above can be significant or
    when no other codes can be precisely determined (like "Zxxx", "Zyyy", "Zzzz", and... "Ziii": "code for characters
    with contextually inherited script", "codet pour caractères à écriture héritée du contexte"). They may be used and
    referenced in international technical standards (like Unicode). The ISO 15924 standard should register and exhibit
    the standards needing these special codes and justifying their existence, by referencing or linking to the
    appropriate documents defining them precisely and defining their meaning and usage policy.

    In this context, the "Zxxx" code for unwritten documents (with scope "S") needs a more formal definition, because
    its existence cannot be justified by external standards but only by the ISO 15924 standard itself; it may not be
    precise enough for effective applications that would need more preceise codes (a code for "aural"/"vocal"
    documents, a code for "photographic" documents, a code for drawings and diagrams, a code for artistic "graphic"
    documents which have no reading at all but just interpretations, a code for other contents that can't even be
    reproduced correctly on printed paper (such as architectural designs, artistic objects, ...).

    The absence of any content (textual or not) would also merit its own ISO 15924 code (there's ambiguity in this case
    between "Zxxx", "Zyyy", and "Zzzz") to encode that NO script is actually present because there's simply no content
    at all; it could be anything if the content is added later and such addition is permitted (in that case this
    content will have another code):
    * Using "Zxxx" just encodes that if there's content, it cannot be text, but it does not really specifies that other
    contents do not exist at all;
    * Using "Zyyy" for undetermined scripts is also inappropriate (look at the definition of the "Common" script type
    in Unicode, which is aliased to it) because it indicates that there can exist some encoded text;
    * Using "Zzzz" (to which "Unknown" is mapped) does not really encode the total absence of text, just the fact that
    no text could be encoded correctly (for example the text could contain characters still not encoded in Unicode, and
    can be represented only using private-use characters or other means like graphics or fac-similes, those characters
    being possible candidates for encoding in a future script with its own code in ISO 15924 and then with its own
    characters in Unicode);
    * Some applications (or other users than me) may have their own interpretation and may opt to define their own
    policy about using one of the three codes above.
    * For this case a code like "Zero" would be convenient to encode such emptyness and total absence of content
    (textual or not).

    This archive was generated by hypermail 2.1.5 : Fri Jan 23 2009 - 19:37:54 CST