RE: Marks

From: Philippe Verdy (
Date: Wed Sep 26 2007 - 18:22:13 CDT

  • Next message: Dmitry Turin: "Re[2]: Marks"

    There are strong evidence that you have not even read any line of text of
    the Ubicode standard. You are mixing everything in your message, every
    concept, and attempting to completely remove all the abstractions already
    performed and the intended goals.

    Not only your various proposals to the list won't work as intended (you
    forget MANY things), won't be accepted (you loose all interoperability
    features), your encoding completely changes the way to handle text (by
    making the text only parsable through contextual rules that are even
    embedding concepts in unlimited levels and on unbound distances, making
    things like safe extractions of substrings within text becoming very
    ambiguous and nearly impossible to perform without having to parse the WHOLE
    text where the substring is extracted, FROM THE BEGINING), and finally your
    proposals are not even needed.

    These notations you propose are ALREADY implemented using other opper-level
    layers or standards, that DON'T break the compatibility with the lower layer
    plain-text representation using Unicode.

    Nothing in your proposal is needed. You are trying to redefine a new
    rich-text format, something really different from the intended goal
    supported in plain-text by Unicode, which concentrates on semantics of
    *text-only* elements, not on their rendering.

    None of your proposals will even work with existing font technologies. By
    trying to mix every possible concept into a single merged layer, you create
    a havoc that will soonbecome non interoperable and not manageable. The right
    approach is to separate the problems, i.e. encode text only with Unicode,
    and everything else in other upper-layer, out-of-band, standards, based for
    example on XML (such as HTML, DocBook, MathML, ...) or other legacy formats
    (RTF, MSDOC, Postscript...) where extra out-of-band semantics can be also
    added on top of the represented text, as annotations, properties, grouping
    behaviour, structuring elements...

    So before you continue your proposals here to this list, please read the
    standard, and notably the first chapters that explain the goals, formalize
    the concepts, and discusses about conformance requirements, and what the
    Ubicode standard is and IS NOT.

    You also strongly need to read it, just to use the right terminology and
    concepts. This will avoid you many errors like your usage of "byte" instead
    of code point: the Unicode standard does not mandate a single binary
    representation but represents characters by assigning them code points, that
    have several binary representations independent of the architecture or
    transport layer, and by assigning them a collection of properties to support
    lots of text-handling algorithms : not every algorithm can be created with
    these properties, as many of them will depend on context or application, or
    on things that are NOT encoded in the text itself, but in other contexts
    like the user locale (instead of the text writer's locale) and the other
    upper-layer protocols (like XML based formats, or other networking and
    file-format protocols) that embed Unicode text to map other properties on
    top of it.

    Don't forget that Unicode-encoded text formats can be used in other
    applications than just text input forms and rendering on display or print.
    Almost all you propose won't even have any meaning in all other contexts,
    because they are NOT plain text, or they would have to be completely
    ignored/discarded, making your proposed characters just an unneeded
    pollution complicating the implementation of the many other upper-layer
    protocols (many of them standardized too!) that are accepting to embed
    Unicode-encoded plain texts:

    For example what would be the meaning of a fraction of other mathematical
    formula within the designation of a domain name? or in the designation of a
    variable name or API name in a computer language? Really rethink about the
    problem and consider the layered approach. Not every concept needs to be
    formalized at the plain-text level.

    If you want to transport text documents that include some advanced features,
    use some other formats than just *.txt files: OpenDoc for example is based
    on XML and offers such capabilities. If you want an exact rendering, use
    PDF. If you want to publish your text for rendering on the web in browsers,
    use HTML... Not every concept needs to map into a plain-text format (where
    it is acceptable to have complications to represent things like fraction
    bars, radicals, emphasis and italic/bold presentations... Base your choice
    of format according to the use of the text intended by the author for some
    specific purpose.

    > -----Message d'origine-----
    > De : [] De la
    > part de Dmitry Turin
    > Envoyé : mercredi 26 septembre 2007 07:59
    > À :
    > Objet : Marks
    > I repeated postings, because it was not come into my mailbox.
    > Spending mark-place in coding table for capital letters - is
    > inadmissible spending.
    > Let all letters will be lower case: when there is a own name or beginning
    > of sentence,
    > one prefix-byte before a word is enough to specify, that first letter is
    > upper-case.
    > Let's name this prefix-byte as 'mark "own name"'. It works so: #anna ->
    > Anna
    > (where # is this prefix-byte).
    > It's necessary to tell the same about abbreviations. One prefix-byte
    > before a word
    > is enough to specify, that all letters to symbol "blank" are upper-case.
    > Let's name this byte as 'mark "abbreviation"'. It works so: #uno -> UNO.
    > User himself puts prefix-bytes by pressing keys "Shift" and "Caps Lock".
    > So comparison of various variants of spelling (all letters are lower-
    > case,
    > first letter is upper-case, all letters are upper-case) is reduced to
    > comparison in one variant of spelling (all letters are lower-case) at
    > search of similar word.
    > Widespread error is equating of designation of a letters (__coding__)
    > and their graphic images (__font__). It's absolutely different things.
    > Pictures of prefix-bytes in in
    > Dmitry Turin
    > Unicode2 (2.1.0)
    > HTML6 (6.4.1)
    > SQL4 (4.3.0)
    > Computer2 (2.0.3)

    This archive was generated by hypermail 2.1.5 : Wed Sep 26 2007 - 18:24:37 CDT