Re: Slots for Cyrillic Accented Vowels

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 20 2011 - 17:00:09 CDT

  • Next message: Martin J. Dürst: "Re: Slots for Cyrillic Accented Vowels"

    Christoph Päper <christoph.paeper@crissov.de> wrote:
    > Doug Ewell:
    >>
    >> Text editing and processing with combining marks is not "very difficult and erroneous."
    >
    > The biggest problem with precomposed versus combined characters in text editors and word processors is that they are in fact treated differently.
    >
    > Input:
    >
    > Some accented letters are found on keys of their own on relevant national keyboard variants.
    > Others can easily be produced by a combination of base letter and dead-key diacritic mark, although they have to be pressed in a different order than they are coded.
    > Finally, some accented letters need a special kind of assisstive input system, often visual character maps (though these are often ordered in a not too helpful way, i.e. by Unicode position).

    It does not matter how they are entered. The purpose of the input
    method is to generate whatever sequence of characters correctly
    encoded is appropriate for representing the selected character. It
    does not matter if this input method will generate one or more
    characters.

    But of course, a simple input method that would just present a
    character map where some combinations of letters+diacritics can't be
    found or even be generated is just a software deficiency, and not a
    problem of the encoding in the UCS itself. Such software can always be
    enhanced to match what users want to input and see. Sometimes this not
    only involves the keyboard driver or input method editor, but as well
    the handling in the software when editing existing documents or
    correcting it as you've noted below:

    > It might be useful if computers offered their users a standard way to access and change diacritics on base letters, no matter how hey were enterd in the first place or how htey are encoded. For instance, I could write “resume”, hit the one special key, e.g. ‘^’, and get an inline drop-down list to change the ‘e’ to ‘é’ (because that is a variant of the word in this instance that was found in the dictionary) or ‘è’, ‘ê’ etc. (shown in an standard fix order by frequency / probability).
    >
    > Delete:
    >
    > The backspace (leftwards delete key) and (rightwards) delete keys should always delete one visual entity perceived as a single character by users, i.e. a combination of base letter and accent(s).

    I fully approve there. The normal working in editors is to use the
    simplest editing method for working with the default grapheme clusters
    (but it should be noted that this level is too large for users working
    in languages where diacritics are optional or added as supplemenatry
    notations, such as Hebrew and Arabic, at least for a large subset of
    the diacritics used in those scripts, or for users working with Indic
    abugidas, because they still spell at least the vowels distinctly).

    > The software could offer a key combination to free selected or adjacent base letters of all their diacritics, though, e.g. [Ctrl+Shift+Del/BS].

    As long as this remains an advanced editing feature, notammly not
    needed for entering text correctly at the first time and still
    allowing normal corrections, this will be fine (it would typically be
    used when handling with files that were incorrectly encoded at the
    first place, using unsuitable input editors or incorrectly generated
    by poor softwares). But for this mode, I would leargely prefer to have
    another graphical presentation, more technical, where in fact you
    would inpect each character, and all grapheme clusters would be broken
    into their indicidual part, including visible controls. This type of
    rendering would be mostly for debuggers or data analysis and parsing,
    i.e. for software developers mainly, but not for the most frequent
    uses by most people.

    Such adaptation is in fact not a problem of encoding or of
    translations or internationalization, but part of the work that
    developers must study for the localization of their software according
    to users expectation. There will never be any perfect encoding for all
    solutions.

    > Storage:
    >
    > I believe it would help if input immediately was transformed to and text was saved in NFD, because this would make the need for uniform treatment more obvious.
    >
    > It would be cool if there was an ASCII-compatible encoding with variable length like UTF-8 that supported only NFD (or NFKD) and was optimized for a small storage footprint, e.g. from U+00C0–017F only a handful would have to be coded separately. Sadly, though, it is unrealistic to have a unique single byte code for each combining diacritic, because there are so many of them: even just ranges U+0300–036F and U+1DC0–1DFF are 176 positions together, although some are still unassigned; that is more than you can encode with 7 bits or less.

    The most common software practice has been since long to use the NFC
    form. NFD is just for some internal technical uses, but in fact no
    longer justified given the way that most sofwares communicate between
    each other in more heterogeneous systems.

    Forget NFKD (and NFKC) completely. This is definitely not for text
    input or editing (and probably not even for rendering as well), but
    only needed as a compatibility layer across interfaces with now old
    software modules (most of them non-Unicode aware), notably as a helper
    for transcoding purposes to find a few possible fallbacks.

    >> The one use case that Plamen mentioned (a user manually deleting a base letter) is easily trained.
    >
    > Changing people is harder than changing software, in general.

    And I don't see why a single keystroke on the Backspace key would not
    delete the same thing as a single keystroke on the Delete key, if this
    causes two separate grapheme clusters to be suddenly partly joined
    together into a single one, with the normal text rendering where
    grapheme clusters (and all other joining types or ligatures) are
    rendered as a whole. For normal use, if you delete any base letter,
    you have to delete as well the diacritics encoded after it. The same
    will be true for mouse and keyboard selections and normal navigation
    in the text (using arrow keys possibly with key modifiers).

    And the editor should work and behave equivalently if the text in the
    background working buffers is encoded in NFC or NFD form or any other
    canonically equivalent non-normalized form.

    But for lots of reasons, editors that should save for output to
    heterogeneous environments should all contain an option to normalize
    the whole text when saving (most probably NFC by default, NFD is once
    again for some technical interfaces, but these same interfaces can
    implement the conversion to NFD themselves if they really depend on
    it), simply because it will work with many Unicode-unaware legacy
    softwares.

    The size of the encoded data is no more much an issue. Storage today
    is cheap, bandwidth is constantly cheapre too, and general-prupose
    compression schemes are used very efficiently now in so many domains
    that it occurs often transparently and without significant performance
    cost or additional security risks (when it uses standard open
    algorithms used since very long on gigantic amounts of data worldwide)
    : it just works very well ; that's why for example UTF-8 was so
    largely adopted even though it is a bit less efficient on the surface
    that many legacy encodings that were hardly interoperable and stable.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Fri May 20 2011 - 17:04:36 CDT