RE: PH technical issues (was RE: Why Fraktur is irrelevant

From: Peter Constable (
Date: Fri May 28 2004 - 16:58:09 CDT

  • Next message: Peter Constable: "RE: PH technical issues (was RE: Why Fraktur is irrelevant"

    > From: Peter Kirk []
    > Sent: Friday, May 28, 2004 1:40 PM

    > Well, I understood the semantic content of a text to be the meaning of
    > the words

    Unicode encodes characters, not languages, not morphemes, not senses of
    words. The character semantics of "Sally" and of "Sally" transliterated
    into Hebrew are not the same.

    > >Her "Phoenician words" in this case are probably something like her
    > >name, or a transliteration of English words.

    > Is it really in the scope of Unicode to encode such trivialities? I
    > a key ring with my name "written" in an Egyptian hieroglyphic
    > pseudo-alphabet. Will such abuse of Egyptian hieroglyphs have to be
    > taken into account in the possible Unicode proposal for this script?

    Why is that an abuse of hieroglyphs any more than Hebrew text
    transliterated or transcribed in Latin characters, or Arabic text
    transcribed in Hangul characters? Unicode is uninterested in what the
    content of the text is; it encodes characters, not text. It is up to
    users and implementers to decide what texts those characters can

    So, absolutely, it is in the scope of Unicode.

    > Children invent all kinds of alphabets in which to write their names;
    > will all of these have to be encoded in Unicode?

    The scenario did not involve children inventing an alphabet; it involved
    students making a history presentation that touched on, among other
    things the Phoenician script.

    > Well, if anyone has another scenario to propose, let's see it.


    Scenario (undesireable):

    The editor of a UCLA journal on ancient Indo-European linguistics
    receives submissions from numerous sources for publication in the
    journal. Certain formatting requirements are specified for submissions
    wrt the kinds of document elements used, paragraph formatting and
    overall page layout. As is often the case in similar situations,
    however, no constraints are placed on fonts used. Submissions are
    accepted in various file formats, included Word DOC, RTF, and certain
    XML or SGML languages. Once approved for publication, submissions will
    be converted to one common file format and will be typeset using one
    collection of fonts.

    Submissions regularly contain characters in a variety of scripts /
    writing systems: Latin, Cyrillic, Old Italic, IPA, various Latin
    transliteration schemes, etc. Very often, the submitted text is
    formatted using fonts that the editor does not herself have. In some
    cases, the submission is formatted but not consistently marked up; in
    other cases, the text is marked up to identify document elements but not
    formatted at all. Markup does not always identify the language of text
    as in some cases the language may be unknown, or the text is an
    analytical reconstruction and not in any actual, known language; and
    because authors cannot be assumed to know how to do this with their

    With some regularity, a submission makes reference to Phoenician
    characters or includes examples in Phoenician-script text. Also, on rare
    occasion, submissions will cite Hebrew-language words, which are
    intended to be presented with square Hebrew glyphs. The Phoenician
    characters have exactly the same encoded representation as the square
    Hebrew text. As a result, fallback or default formatting will cause all
    such text to appear with square Hebrew glyphs, and therefore before the
    editor can provide a draft to her panel of reviewers, she must go
    through a laborious process to carefully read each submission to ensure
    that what she provides to reviewers has the intended presentation as
    either Phoenician glyphs or square Hebrew glyphs. This add to her
    workload in reviewing all submissions, and especially so for any
    submissions that contain either Hebrew or Phoenician. On some occasions,
    this leads to costly delays in publication. On some occasions, incorrect
    glyphs are not spotted in proofs until after publication, requiring
    additional work to add corrigenda to subsequent editions, and detracting
    from the perceived quality of the journal as a whole.

    Alternate scenario (desireable):

    The editor receives submissions as described above. Because Phoenician
    script and Hebrew script are encoded distinctly, there is never any
    concern as to how text provided to reviewers will appear. She saves many
    hours of work both in preparing submissions for reviewers and in final
    typesetting. Embarrassing errors and the need to publish corrigenda are
    significantly reduced.

    Now tell me that's an unrealistic or trivial scenario.

    > Well, I have used Shoebox and Toolbox. I have also used your company's
    > products, which at least allow me to add a script name field to my
    > database but don't allow me to tailor collations. But I was thinking
    > terms of tailored collation weights for the Unicode collation
    > These are much more complex than setting up a new language
    > for Shoebox or Toolbox.

    I suspect few Semitic paleographers are using MS database products.
    Also, from what I have seen, it is not at all uncommon for researchers
    in academia to have access to technology-support staff, including
    programmers. Not necessarily in every case, but every time I've
    interacted with someone associated with a university on such issues,
    they have had access to some kind of support of this type. (That's one
    of the things their funding requests are for.) Moreover, unless I'm
    mistaken, the collation weights in this case would *not* be difficult to
    deal with, and in addition there have already been offers to do that

    Moreover, the Semitic paleographers have indicated that their preference
    is to encoded all of their text using the square Hebrew characters, so
    the character-folding issue is at best an occasional concern that many
    will never actually have to deal with.

    I'm still completely unconvinced that the need for character folding is
    a significant impediment.

    Peter Constable
    Globalization Infrastructure and Font Technologies
    Microsoft Windows Division

    This archive was generated by hypermail 2.1.5 : Fri May 28 2004 - 16:58:41 CDT