Re: Editing Sinhala and Similar Scripts from Richard Wordingham on 2014-03-23 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 23 Mar 2014 13:07:27 +0000

On Sun, 23 Mar 2014 03:32:06 +0100
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2014-03-23 1:16 GMT+01:00 Richard Wordingham <
> richard.wordingham_at_ntlworld.com>:

>> The use of 'COMPOSE' indicates that one is not simply entering a
>> sequence of characters. 'COMPOSE, C, CEDILLA, ACUTE' should mean
>> an input process different to simply 'C, COMBINING CEDILLA,
>> COMBINING ACUTE'.

> Here again you reinterpret what I did not say. When U used DEADKEY or
> COMPOSE, I was evidently refering to keystrokes, not characters.

That is how I understood it.

> So I
> did not imply any encoding of characters (I was clear enough to say
> that these sequences of keystrokes was allowed to generate any
> canonically equivalent encoding), so instrad I described the input
> (on keyboard or IME)

That is how I understood it.

> and the expected output (an encoded text that
> should be canonically equivalent).

I think you mean that you have only specified the generated character
output up to equivalence. An actual implementation would have to chose
one specific sequence, though there might conceivably be a mechanism to
select this sequence.

> I have NOWHERE intended to force the use of CGJ (you seem to imply
> that these keys will generate separate combining diacritics/joiners,
> one or two, for each key...

The input method and the editing of backing store are generally done by
separate processes. For IPA and Tai Tham input I have written my own
input methods. If I frequently had to use a process editing backing
store as you recommend, I would be strongly tempted to write a variant
that protected marks with non-zero combining class by inserting CGJ.

> This is wrong, the IME or keyboard driver handles the state of
> keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does
> not matter, and so it won't feed the encoded text with streams of
> characters as long as the state is not complete enough:

This is certainly not true of Keyman for Linux (KMFL), and I don't
believe it is true of Tavultesoft Keyman for Windows either. This
does require that the input method have a way of cancelling
previously provided input. Now, if you use a method with a COMPOSE key
or a DEAD key, you are generally unlikely to get tentative entries.
However, one could write an input method that simulated a dead key but
actually generated an output for it so as to imitate a typewriter
differently.

> In fact this input with a compose key does not work:
> COMPOSE, C, CEDILLA, ACUTE
> simply because the composed sequence is areaddy terminated after the
> cedilla modifier key. So when you would type the acute modifier key it
> would not be associated.

I would not be at all surprised to find that someone has it working.

> That's another reson why dead keys are
> working: the state is not complete as long as you have not *finally*
> input the base letter. But let's suppose that the driver must
> generate something, then for the ACUTE key it would need to output
> the combining character, possibly with a preceding CGJ if the intent
> is to have the acute accent ordered relatively with the cedilla (this
> is very unusual).

Another method would be to generate, one character at a time, the
sequence <U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA, U+0008,
U+1E08 LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE>. The NFD
decomposition of U+1E08 is <U+0043, U+0327, U+0301>. The use of CGJ
would apply to 'COMPOSE, C, ACUTE, CEDILLA', for which I would again
expect to see the output U+1E08.

> The effect of Backspace entered just after it would delete
> simulatenously CGJ and the diacritic characters. It does not need to
> depend on the input state of the driver or the IME. In all cases,
> nothing in the keyboard mapping or IME will generate a CGJ character
> isolately, ir will be always followed by something.

If backspace is not modified by the input method - and Marc Durdin has
suggested that the input method should sometimes modify it - its
effect will depend on the process controlling the backing store, which
in general will work with multiple input methods, even during the
course of a single editing session. You might not write an input
method that generates a single CGJ, but I do. Do you insist on a soft
hyphen when writing 'Llangollen' so that it will collate after
'Llanberis' in Welsh? (I typed the place names in English; the names
are spelt the same way in English and Welsh in hardcopy, though of
course the letter counts differ.)

> But what would happen if you would type the compose sequence
> generating CGJ with COMPOSE where you forget to press the initial
> base letter, or type COMPOSE after the base letter ?
> C, COMPOSE, ACUTE
> you get the characters <C, CGJ, combining ACUTE> you cannot type
> another CEDILLA after it without pressing COMPOSE again before it, to
> get <C, CGJ, combining ACUTE, CGJ, combining CEDILLA>.
> The result is clearly abusing the use of CGJ when the input output
> should just be canonically equivalent to
> <C, combining ACUTE, combining CEDILLA> (i.e. without any CGJ at all)

Lower case specimenː c͏́͏̧ (this was in NFD as I edited it)
Actually, I would prefer to avoid the first, unnecessary CGJ.
Lower case specimen: ć͏̧ (the was in NFD as I edited it)

> Your system would be even less meaningful, it would break in most
> renderers

Some, not all. It renders fine in Firefox, though one can of course
set up input forms so that not even Thai renders properly.

> and spell checkers.

Most of the stuff I currently write with two combining marks of
non-zero ccc already fails with spell checkers.

> It would break in IDNA domain names.

No, it wouldn't. If you consult Table B.1 in
http://tools.ietf.org/html/rfc3454#appendix-B 'Stringprep', you will
see that CGJ is stripped out. For example, the URL
http://www.c%cd%8f%cc%81%cd%8f%cc� .com, using the first specimen above, successfully
reached http://www.%e1%b8%89.com/ when I used Firefox.

> would not match in plain text search unless they are tuned so that
> ther collators discard the CGJs to look for fuzzy matches (fuzzy
> matches would also look for strings that are compatibility equivalent
> under NFKD, or could search at collation levels 2, or at collation
> level 1 ignoring all diacritics and CGJ wherever they are).

Collation Level 3 searches would work for what I type. Level 2 can
have a problem with diacritics frozen in the wrong order.

> So compose keys cause more confusion to native users than dead keys
> that are smarter as they can record more internal states and also
> allow arbitrary order of input for unordered diacritics (like acute
> plus cedilla : you can press their dead key in any order, the IME or
> driver handles the case and generates them, preferably in canonical
> order with growing combining classes; the drive or IME alos generates
> them in an input state where it also knows the base letter to ouput,
> it can precombine the diacritics and so it will output C WITH
> CEDILLA, followed by COMBINING ACUTE, as expected, and still without
> needing any CGJ).

A better easy solution is for backspace just to delete the previous
character, so the user will often find what he wants. There is
then no need for the extra CGJ. Commands to step into a cluster would
be helpful, but are more difficult.

One thing that bothers me is that no-one has come forward with the
conventions that an application must follow to work with Tavultesoft
Keyman and its derivatives and imitations.

Richard.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sun Mar 23 2014 - 11:05:38 CDT

This archive was generated by hypermail 2.2.0 : Sun Mar 23 2014 - 11:05:38 CDT