RE: Re[4]: marks

From: Philippe Verdy (
Date: Fri Sep 28 2007 - 13:47:16 CDT

  • Next message: Philippe Verdy: "RE: marks"

    Dmitry Turin wrote:
    > Envoyé : vendredi 28 septembre 2007 11:52
    > À :
    > Objet : Re[4]: marks
    > Philippe,
    > >> (2) My proposal not only economize mark-place in table of encoding
    > >> (what is important itself), but also simplifies comparison
    > >> of various variants of spelling (all letters are lower-case,
    > >> first letter is upper-case, all letters are upper-case),
    > >> because comparison is reduced to comparison in one variant
    > >> of spelling (all letters are lower-case).
    > PV> There's nothing wasted in the
    > PV> Unicode standard due to the encoding of capitals.
    > +
    > PV> case-insensitive searches, the
    > PV> algorithms are extremely simple and fast in their implementation
    > These algorithms are unnecessary in general.

    Unnecessary ?!?!?

    These algorithms are used and implemented everywhere (at least in their most
    basic way for handling the Basic Latin subset, but this is still an
    implementation of the algorithm, widely understood, and found in almost all
    applications, libraries and OSes handling text data, and written since many
    decennials and still used in every computer today!)

    Really, you may want a revolution but then you need to consider the huge
    cost of the conversion, and handle the conversion from users that are used
    since ever to make distinctions between capitals and small letters,
    including linguists that need them in their standardized orthographies,
    where there are even strict rules about their usage (not in all languages,
    where their usage may be quite liberal).

    Then handle the tricky things that will appear in technical notations making
    STRICT distinctions between lowercase and uppercase letters: think about
    Base64 representation of binary data, and what such reencoding would mean
    for encapsulating these chains of data in other protocols like Email and
    networking protocols. Think about numeric parsers that will have to handle
    simple hexadecimal data, and parse an additional unneeded symbol.
    Think about phonetic transcriptions that would no longer be searchable if
    you remove the distinction between small letters and capitals, andneed to
    parse the text contextually by looking if there are some prior "symbol" or
    control somewhere at an unknown distance.

    Think about those algorithms that try to extract substrings, including text
    parsers used for linguistic analysis: what is the rule for inserting your
    proposed control? How many controls will you need?

    Think about concatenation with your notation: what is the result of "#o"
    plus "#nu" : "#o#nu" or "#onu" ? Your notation introduces new unexpected
    equivalents that applications would need to recognize, instead of just
    having to handle the concatenation of "O" plus "NU" as "ONU", from which it
    is simple to extract substrings... Now thing about the effect of word
    breakers, line breakers, and the effect of layout rendering: what is the
    scope of application of your "#" control? If such scope is unambiguous, then
    the only safe choice would be to make this scope limited to only the next
    character, so that you'll need to always write "#o#n#u" and not "#onu".

    Your proposal is also inconsistent: you propose two distinct controls for
    encoding all-caps (I'll note it "*") and leading-cap (I'll note it "#" like
    you did). This means that you have now "#o#n#u" and "*onu" encoding the same
    text, where capitals are encoded differently. Now extract the initial letter
    of both strings, is it "#o" or "*o"? There's no way to determine this, in
    both cases they are the initial capital letter of the same word
    "Organisation"... And it's illogical to encode the same capital letter in
    different ways.

    In conclusion, the "*" proposal (next word in all capitals) is superfluous
    and just complicate things. So if it remains just your "#" proposal (next
    letter only in capital), this means that you have reencoded all existing
    individual capitals from "A" to "#a", and... doubled the size of texts using
    capitals only. What is the benefit, given that Unicode will still maintain
    the encoding of all existing capital letters?

    Now suppose that Unicode accepted your "#" proposal only (the only one
    producing consistent results for text algorithms, andwhose effect is to
    modify only the next letter), it should become a format control (using the
    Unicode terminology) usable separately and ignorable in some conditions, but
    then what will be the meaning of "#1" or "#!" : not all subsequents
    characters would be letters of a bicameral script!

    Conclusion: your proposal has not solved any problem, just introduced more
    complexity, breaking too many text handling algorithms used in every
    computer and almost all text-handling applications, or even in many widely
    used and standardized networking protocols (so you'll break interoperability
    everywhere, you could even say good bye to the Internet, with so many
    protocols to fix). Such proposal is not worth the value, given the huge task
    it would mean for others to adapt to your encoding scheme.

    But then, if your encoding is just optional (meaning that a capital A could
    continue to be encoded as "A" or optionally as "#a", what is the interest of
    making such change, except locally within your own local applications? If
    you need such transform for your local search algorithm, then transform
    texts locally in your system, by reencoding texts using your own control
    (you can do that using a PUA), and look at the new caveats that such
    conversion will imply: database size, data field length constraints,
    interoperability with the rest of the world because you'll need constant
    conversions between your local encoding scheme and the rest of the world.

    This archive was generated by hypermail 2.1.5 : Fri Sep 28 2007 - 13:51:16 CDT