Re: Counting Codepoints from Richard Wordingham on 2015-10-13 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 13 Oct 2015 20:04:49 +0100

On Tue, 13 Oct 2015 12:17:43 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2015-10-13 8:36 GMT+02:00 Richard Wordingham <
> richard.wordingham_at_ntlworld.com>:

> > For
> > example, a MSKLC keyboard will deliver a supplementary character in
> > two WM_CHAR messages, one for the high surrogate and one for the low
> > surrogate.

> I have not tested the actual behavior in 64-bit versions of Windows :
> is the message field of the WM_CHAR returned by the 64-bit version
> of the API still requires returning two messages and not a single one
> if that field has been extended to 64-bit ?

In Unicode applications, WM_CHAR still delivers one UTF-16
codepoint. I suspect if delivers just one byte in multibyte 'ANSI'
encodings. There is a WM_UNICHAR message that delivers whole Unicode
characters, but reportedly Microsoft does not use it.

> The actual behavior is also tricky as the basic layouts built with
> MSKLC will have its character data translated "transparently" to
> other "OEM" encodings according to the current input code page of the
> console (using one of the codepage mapping tables installed
> separately): the transcoder will also need to translate the 16-bit
> Unicode input from WM_CHAR messages into the 8-bit input stream used
> by the console, and this translation will need to read both
> surrogates at once before sending any output.

This only applies to 'ANSI' applications. I am not aware of any ANSI
codepages that contain supplementary characters. For a Unicode
application, no translation from Unicode occurs.

Richard.
Received on Tue Oct 13 2015 - 14:05:49 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 14:05:49 CDT