RE: Re[2]: marks

From: Philippe Verdy (
Date: Thu Sep 27 2007 - 07:27:47 CDT

  • Next message: Philippe Verdy: "RE: Hardwarily formating colour and size of font (3 new symbols)"

    > -----Message d'origine-----
    > De : [] De la
    > part de Dmitry Turin
    > Envoyé : jeudi 27 septembre 2007 08:11
    > À :
    > Objet : Re[2]: marks
    > William,
    > WJP> can be done quite nicely
    > WJP> using markup, e.g.: <font case="upper">foo</font> or whatever.
    > (1) You version is markup language (like HTML) instead of simple text.
    > I wrote about usual case.
    > (2) My proposal not only economize mark-place in table of encoding
    > (what is important itself), but also simplifies comparison
    > of various variants of spelling (all letters are lower-case,
    > first letter is upper-case, all letters are upper-case),
    > because comparison is reduced to comparison in one variant
    > of spelling (all letters are lower-case).

    For (2), your option is not needed. All the solutions are already
    standardized in the Unicode standard itself. There's nothing wasted in the
    Unicode standard due to the encoding of capitals.

    You also seem to assume that capitals have the same semantics as small
    letters (may be this is true in your Russian language, but this does not
    apply to many languages that have strict rules about the usage of capitals
    and that even make differences of semantics); if you ignore capitals in many
    languages, you'll find matches that are unrelated (take Italian for example,
    "uno" is not synonym of "UNO"), and you'll see that even in proper names
    your assumption that only one leading capital is needed is WRONG: there may
    be NO capital at the first letter (for example with prefixes), and/or a
    required capital in the middle of a proper name, and NO separator or space
    between those parts of the name.

    Really, your suggestion will just complicate things. Capitals are considered
    separate letters since long, and have always been encoded separately (except
    possibly in the early period of telegraphs with very reduced alphabets where
    ONLY the capitals could be used, forcing all lowercase letters to be
    capitalized, but making the texts difficult to read: there was not even the
    support for other needed differences like accents).

    Your suggestion just looks like if you wanted to return to the age of
    telegraph. In that case, you don't need Unicode at all, and not even 7-bit
    ASCII: use the 6-bit or 5-bit Baudot-like encodings ! And then try to
    transport meaningfull texts for many languages... You'll loose much more.

    Stop your suggestions here, consider the layered approach that simplifies
    all the problems: Unicode has only encoded some number of characters only to
    offer rountrip compatibility with largely used legacy encodings (they would
    not be accepted if they were requested today without use in prior
    standards), but all the rest is encoded according to principles and sets of
    rules and usage algorithms that make it work without needing to encode too
    many characters.

    Consider also the encoded capitals: how many will you find? Not so many,
    they are a very small part of the Unicode assigned codepoints, and they
    don't evolve much, because this assumes a bicameral alphabetic script, and
    there are not so many scripts with such feature: Latin, Greek, Cyrillic.
    Case mappings are already working perfectly with those scripts, as well as
    collation. There's no difficulty with case-insensitive searches, the
    algorithms are extremely simple and fast in their implementation. There's
    much more difficulty when handling letter variants (like those with accents,
    diacritics, and contractions in collations like digraphs in some languages).

    Your suggestion does not solve any problem that is not already solved, it
    just adds more complexity (because it does not work with roundtrip
    compatibility, but there are many other reasons why your solution is even
    more complicate than what is already encoded now). You have completely
    forgotten the goals of the Unicode standard.

    This archive was generated by hypermail 2.1.5 : Thu Sep 27 2007 - 07:30:02 CDT