Re: Representative glyphs for combining kannada signs

From: Philippe Verdy (
Date: Mon Mar 27 2006 - 11:11:15 CST

  • Next message: Antoine Leca: "How to encode abbreviations [Was: Representative glyphs for combining kannada signs]"

    From: "Antoine Leca" <>
    >> Definitely. In this particular case one may debate whether to use
    >> markup or to (ab)use U+1D50 MODIFIER LETTER SMALL M and
    > Put it in clear: to write the French equivalent of Mrs, I can:
    > - either write the slightly incorrect Mme
    > - or write the more "correct" M[][] (where [] represent the empty box that
    > everybody except four cats will effectively see).
    > Somewhere I am thinking this is *not* a working solution.

    And I also think like you. Using modifier letters that don't have the semantics of Latin letters is a bad choice. These are created mostly for IPA notation, not for denoting general superscripts, and so they carry other information, and should better reproduce the IPA usage.

    There even exists languages where those letters are considered completely distinct from the non-superscript Latin letters. Unicode gives these letters any general class "Lm", and they are NOT cased as they should be if one had to transcript the French "Mme" abbreviation, where the last two letters are really a normal "m" and a normal "e" semantically, only written in superscript to denote the abbreviation by the fact that these are final the letters of the abbreviated word (the leading letters are not superscripted in such abbreviation rendering style).

    The superscript notation is very general and usable for any kind of abbreviation, including letters that don't have any encoding in Unicode with superscript style. Soif westart using them, we'll soon need to reencode the whole Latin alphabet into superscripts (including diacritics...).

    The only thing that would be reasonnableto keep the semantic of the letters would be to have formating controls in Unicode to specify which letters are denoting the final or initial letters of an abbreviation (Note: I'm not proposing such addition, I'm sure there are plenty of people opposed to such additions). It would be something like that with combining controls:

    * "M"
    * "me"

    Or it could be something like that, using combining alternate diacritics to mark alternate abbreviation letters:
    * "M"

    With those additional rules:
    * The combining abbreviation marks would have combining class 0, and should then be encoded at end of each combining sequence that encodes the abbreviation, so it would have general category "Mc" (combining, control) or "Mo" (combining, other) or possibly "Cf" (control, formating control)
    * It would be ignorable in the default Unicode collation table, but it could be tailored if needed, if one needs to make distinctions between ambiguous normal letters, and letters explicitly marked as denoting an abbreviation.
    * Processes would be allowed to treat these marks like other formating controls,if they don't know the semantic of these marks.

    It would be much better than using the existing superscripts that don't have the needed semantic, because :
    * It would allow encoding arbitrary abbreviations, without limitation on the usable letters,
    * It would not require that the effective rendering uses a superscript rending style (it could be rendered as well only a font with a smaller size, aligned on the baseline) ;
    * Full-text search remains possible in documents that use such notation of the common simple "Mme" notation on the baseline with the same font style ;
    * Text effects remain also possible (for example when rendering the whole text with small capitals instead of small letters.), using only the standard case mappings defined on small letters and font size adjustment, or built-in support of smallcaps in fonts or renderers...

    Because such extra marks (I don't know which would be better: combining marks or formatting controls) would be needed for other types of out-of-band semantic textual annotation, may be a set of controls or marks should be provided.

    But let's not go that far : if we start this process, we are trying to reinvent the whell for a markup schema for annotated texts, i.e. exactly what the core set of XHTML 1.0 Strict is defining for its structural text module (quotations, citations, paragraphs, headers, spans of text for abbreviations, keyboard input, source code...). Is that the job of Unicode to walkon the feet of other standards that laready depend on Unicode to allow implementing safely suh markup without conflicts of interpretations?

    This archive was generated by hypermail 2.1.5 : Mon Mar 27 2006 - 11:16:36 CST