RE: marks

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Sep 28 2007 - 16:01:51 CDT

  • Next message: Philippe Verdy: "RE: Marks"

    Otto Stolz wrote:
    > Dmitry Turin wrote:
    > > but also simplifies comparison of various variants of spelling
    > > (all letters are lower-case, first letter is upper-case, all
    > > letters are upper-case), because comparison is reduced to
    > > comparison in one variant of spelling (all letters are lower-case).
    >
    > This is plainly wrong. For, e. g., a case-invariant comparison,
    > your proposition requires removal of your $B!H(Bmarks$B!I(B, whilst the
    > Unicode way requires case folding. Both are commensurably cheap
    > operations, on contemporary computers.

    +1 for this argument: the proposal does not simplify anything, given that
    processing (even if it looks simple, which is a false assumption according
    to effective linguistic rules) is still needed for case-insensitive
    searches!

    > Believe me, computer users are quite a conservative lot:
    > they want their data to be readable, editable, and processable,
    > for decades, if not for centuries.

    +1 for this argument too. That'w what I meant when I spoke about the goals
    of the Unicode (and ISO 10646 standards): preserve a roundtrip compatibility
    with past existing standards (i.e. encodings), terminating their
    proliferation that compomized interoperability of systems (that constantly
    needed to be updated to support and interpret more encodings), and creating
    a framework where no newer encoding would even need to be created to be
    interoperable, even for characters and scripts that are still not encoded
    (so that Unicode-based implementations would continue to work reasonably
    with lots of immediately supported features with characters and scripts
    encoded in the future, as well as with scripts and languages still unknown
    to existing software writers).

    > You have also written:
    > > "Widespread error is equating of designation of a letters (_coding_) and
    > > their graphic images (_font_). It$B!G(Bs absolutely different things".
    >
    > That error is definitely not widespread among the addressees of your
    > remark;
    > rather, they are used to the notions of $B!H(Bcharacter$B!I(B vs. $B!H(Bglyph$B!I(B.
    > However, most of them will agree that a capital A, a small a, a capital
    > $B&!&K&U&A(B,
    > a small $B&A&K&U&A(B, a capital $B'!'Y(B, and a small $B'Q'Y(B are six different
    letters.
    >
    > But this has nothing to do with the encoding of those letters.
    > It was a deliberate decision, based on a history of about 30 years of
    > character encoding (before Unicode, as we know it), to assign six
    > different
    > code position to those six characters, and not three or even only one.

    Another thing to note: despite the Greek Alpha looks like the Latin or
    Cyrillic A, it behaves differently in association with combining characters,
    and Greek offers several conventions about the placement of these characters
    (see for example the special case of the iota subscript, notably inrelation
    with uppercase mapping... that depends on the Greek convention to use:
    historic or modern).

    So before proposing something else, Dmitry has to prove that its proposal
    will support AT LEAST all the special case mappings that Unicode already
    supports, and prove that it offers superior capabilities to handle even more
    critical cases. Our argument is that it is not even needed, given that the
    existing algorithms are already widely implemented and do work, and that
    Dmitry has not even demonstrated anything regarding interoperability (what
    Unicode has smartly and very conveniently preserved).

    > $B"w(B Armenian, Cyrillic, (Georgian), Greek, Latin; where Georgian
    > has not a fully developped case system,
    > cf. <http://www.unicode.org/versions/Unicode5.0.0/ch07.pdf>.

    In fact, Georgian is not bicameral at all, in its modern script. It was
    bicameral but used two separate alphabets for this, and Unicode considers it
    now as two distinct scripts, where the modern script is unicameral, and the
    extremely rare use of a secondary alphabet from the historical script to
    make it bicameral also makes it ambiguous (due to the swapped meaning of
    some historical letter forms).

    I forgot Armenian as a bicameral script. This does not change things a lot
    (and even Armenians are not always using their capitals, but use it only as
    a stylistic option in many cases to write texts in all-caps style for
    titling or monumental scriptures, given that the two sets of Armenian
    letters do not match exactly, with some missing capitals, for which
    lowercases letters need a complex mapping rule, just like with the German
    Ess-tsett or the historical Latin long s that was used for initial or medial
    forms but not for final forms that carried some distinctions in the case the
    final form long s was used instead of the usual long s in the middle of a
    compound word, or to show a difference between a prefix and a longer
    radical, or some other similr distinctions in Greek for letters in final
    form).

    In general, within the 5 multicameral scripts, the capitalisation of text
    removes some semantic differences that cannot always be inferred back
    correctly by reconverting the text to small letters. There are more small
    letters than capitals, just because in those scripts small letters are the
    most modern forms that most widely used for normal text, so newer
    distinctive letter forms have been added to the small letters set, without
    being necessarily added to the historic set of capitals, which is now less
    often used except for some limited cases like initials only or titling.

    Writing titles in capitals only is not always correct in all languages
    because it removes these letter differences (in addition of loosing the
    minuscule/capital distinction for proper names), and that's a good reason
    why an alternate "capital-like" style was added later for writing minuscule
    letters in titling, i.e. "small capitals", which are NOT capitals, but an
    alternate glyphic representation of linguistic minuscule letters, dictinct
    from capitals that should remain used only for limited cases.

    It's true that we can often see texts written in "all capitals" style, but
    this is not a good practice (it seems to work reliably in English for
    example, but not in other languages; and anyway it is difficult to read,
    looks like SHOUTING, and it makes accents and diacritics difficult to read,
    so this should also be limited to short parts of texts).

    For these reasons (and many others), case conversion should be used with
    care, they are not recommended, should be absolutely avoided when storing
    texts, as they are lossy even if you implement them correctly to minimize
    the semantic losses according to a reference language (if you effectively
    know in which language the text is written... something that is not always
    indicated and that you cannot easily infer).

    In other words, capital letters are NOT simply equivalent to lower case
    letters. They are NOT stylistic glyph variants of the associated small
    letters, even if they are closely related.



    This archive was generated by hypermail 2.1.5 : Fri Sep 28 2007 - 16:05:26 CDT