Re: character set : imprecise terminology to use for general audience

From: Philippe Verdy (
Date: Mon Mar 06 2006 - 21:28:05 CST

  • Next message: Peter Constable: "RE: Default rendering of Combining diacritical marks"

    Do you mean that there's a need to reformulate the definitions, in a separate document used by semitists, and with a terminology that they already know and use today? Then it would require adding definitions using only the terms that they know, but that remain compatible with Unicode definitions if you want to exhibit the encoding problems.

    The main problem will start immediately with the distinction between character and glyph. What is a character for semitists? Do they also have a concept of abstract characters, that span several writing styles, but make additional distinctions that Unicode have unified in its abstract chracter model used for encoding?

    If so, the only way to solve the problem is to exhibit how semitists characters translate into Unicode sequences of codes (but don't use the term character for these encoded sequences or you'll confuse those semitists, use the term "code" instead and keep their concept of "character" clean and separated from the Unicode definitions).

    You will also need to include a glossary to convert Semitists' character names into sequences of Unicode character names (but to avoid confusions here, use the term "code name" not "character name"). Use also "code point" to speak about numeric values associated to codes (because I don't think they use this terminology, or may be "Unicode code position" if they consider that Unicode is just another encoding comparable to other encodings that use "code position" to designate the numeric values associated in each encoding).

    For hebrew semitists, avoid speaking about "combining classes" because the concept in Unicode does not match with anything in their language. Use instead "Unicode normalized reordering classes", because it clearly shows the caveats.

    The most strange for them will be to find a way to explain what the "CGJ" code means. For them this "character" does not even exist, despite it is needed to encode some Hebrew texts with Unicode codes. So it will be helpful to describe an algorithm that translates their characters into lists of codes, and then an algorithm that simplifies and removes the unnecessary CGJ codes and reorder them into a normalized form.

    For such algorithm, it will be helpful to create "named sequences of codes", with names matching those used by Semitists. This can be summarized in a simple conversion table as well... even if those sequences are not still not part of the standard "Unicode named sequences" (or won't ever be standardized by Unicode because such table could contain codes that are not part of Unicode, but part of a rich-text format that can describe some complex characters layout).

    ----- Original Message -----
    From: "E. Keown" <>
    To: <>
    Cc: <>
    Sent: Monday, March 06, 2006 5:16 PM
    Subject: character set : imprecise terminology to use for general audience

    > March 2006
    > Hello:
    > Thanks to all who wrote in.
    > I think that Tim Greenwood understood me the best,
    > so I used his note in the Subject line.
    > During the dreadful P-debate in 2004 (? how time flies
    > when you're having fun), I became more and more aware
    > that Semitists and Unicoders can't communicate. At
    > all, pretty much, I think.
    > So I'm not sure that even *one* Semitic epigrapher
    > understood the issues involved.
    > That makes it very difficult to collect accurate
    > responses from real Semitists.
    > I'm grateful to learn that Unicode has a terminology
    > section, but I'm sure that it would have to be heavily
    > edited to produce a glossary that a total outsider
    > could understand.
    > Elaine Keown
    > __________________________________________________
    > Do You Yahoo!?
    > Tired of spam? Yahoo! Mail has the best spam protection around
    > ---------------------------------------------------------------------------------------
    > Wanadoo vous informe que cet e-mail a ete controle par l'anti-virus mail.
    > Aucun virus connu a ce jour par nos services n'a ete detecte.

    This archive was generated by hypermail 2.1.5 : Mon Mar 06 2006 - 21:31:34 CST