Re: Version linking? from Philippe Verdy via Unicode on 2017-08-24 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Fri, 25 Aug 2017 01:24:36 +0200

2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode <
unicode_at_unicode.org>:

> Thus, at the level of undisputable text, in Indic scripts there appears
> to be no provision for the ordering of multiple left matras that are
> to be stored in logical order (i.e. backing order) after the onset
> consonants. (Thus, this is not a problem for the Thai script.)
> Fortunately, there is no good evidence that the occurrence of multiple
> distinct left matras is anything but a typing error, though I can easily
> see how it might be used as a lexicographical convention on the fuzzy
> edge of plain text.
>
> In a similar vein, in Malayalam, we get repeats of the 2-part vowel
> U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at
> https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html
> ),
> but I'm not sure what the legitimate encodings of the example word
> കോോോ (typed here as <U+0D15, U+0D4B, U+0D4B, U+0D4B>) are.
>

Even if there were typing errors, the input method should either signal it
visually to the user (using canonical reordering), or the user could still
cancel this reordering (e.g. CTRL+Z for undoing it) and the input method
could still fix it and mainting the order by then inserting combining
joiners automatically even if the user did not enter them directly.

The joiners should better be removed transparently by the text editor
without requiring the user to perform complex selections or pressing
BACKSPACE multiple times, as I don't see any use of these joiners at end of
graphemes, or multiple joiners in a sequence.

Then the user can even click in the middle of the uncommon sequences of
matras, to correct a missing consonnant if needed: here also the joiner
that is encoded but hidden there would be dropped automatically.

If there are specific sequences requiring other uses of joiners for useful
distinction in some pairs of letters or diacritics, the input editor could
offer a way to enter the sequence directly or to change the encoding of
that pair with or without the joiner in the middle. Having to retype
completely the matra (using BACKSPACE deleting transparently the joiners,
or using normal text selection over full clusters) should be the exception.

If such special sequences requiring joiners are frequent, there should be a
way to enter that sequence directly for the target language, the input
editor could propose it with a point and clik/touch palette or some
function/control keys or contextual menu when selecting a candidate
occurance where alternate encodings are possible and known (possibly
registered by the user himself within his own input preferences or in his
personal lexical file of alternate words where they would have been when
they deviate from the most common orthographic rules). Which UI widget or
function key will be used by the input editor is left to the system or
application UI.

But the system should not decide alone that a sequence is invalid for some
orthographic system, when Unicode provides valid ways to deviate from any
ortographic system and to bypass the common canonical equivalences by
adding some transparent controls.

Even for Latin, one can freely enter SHY controls at any place within
words, even if they are not at correct syllabic separations: this will
impact the rendering if there are linebreaks, but this is done on purpose,
and still easy to correct if this was made by error (a spell checker could
also help locate these uncommons errors in existing texts but would not
automatically correct them without instruction given by the user and a user
can also choose to ignore/discard these signals and store the text as is).

Whever the text with uncommon sequences will be easy to render correctly is
not the problem, the editor will jsut attempt to give a best effort
representation, and if this approximative representation is too frequent,
fonts and renderers will be updated later to support and reder correctly
the "uncommon" sequence (without even needing to change the Unicode
standard itself). But inputing such text will not be blocked.

The case of confusable two-part vowels in Indic scripts however causes a
problem of interpretation and it's not reasonable to think that users will
use one sequence instead of the other, when both would render the same with
the existing typographic rules implemented in renderers, but they collate
differently (this may be a problem for plain-text searches if we look for
distinctions, or sorting, but this can be fixed by definining collation
strengths or search flags to apply or not some collation equivalences, by
enabling or disabling some tailorings), and then this can help setup a
spell checker to signal or ignore some suggested corrections.
Received on Thu Aug 24 2017 - 18:26:32 CDT

This archive was generated by hypermail 2.2.0 : Thu Aug 24 2017 - 18:26:33 CDT