Re: UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 26 2011 - 13:22:26 CST

Next message: Thomas Cropley: "UTF-c, UTF-i"

Previous message: Philippe Verdy: "Re: UTF-c"
In reply to: William_J_G Overington: "Re: UTF-c"
Next in thread: William_J_G Overington: "Re: UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

2011/2/26 William_J_G Overington <wjgo_10009@btinternet.com>:
> Philippe Verdy <verdy_p@wanadoo.fr> wrote:
>
>> Note that the scalar values range 0xD800..0xDFFF reserved for surrogates code points MUST be excluded to be a conforming UTF (these code points must not be representable, to allow full bidirectional compatibility with UTF-16 ; this is unlike all other codepoints assigned to non-characters which SHOULD still be representable).
>
> How do you arrive at the conclusion about the surrogates please?

Surrogates have no scalar values. Only the Unicode code points that
have a scalar value should be encodable in a conforming UTF (but all
of them, including those assigned to non-characters).

> Is it because there are some rules somewhere that require that a surrogate pair copied from a UTF16 sequence must first be combined to produce one codepoint and then that codepoint must be compressed, rather than that the two codepoints be each individually compressed?

No. It's independant of that (see BOCU-1, which also does not allow
encoding isolated or incorrectly paired surrogates, but will still
encode all other code points in any plane as an unbreakable sequence).

> If so, do those rules necessarily apply to utf-c2? If so, would they apply if the format were denoted by a name that does not include the sequence utf?
> Would compressing the surrogate codes separately make the design of the format simpler?

No. It would make it longer and still not simpler.

> Could sequences starting 10.000000 and 10.000001 be used for switching codes?

No. You've not correctly read. They encode non-final bytes of
characters (or special codes) encoded as multi-byte sequences. The
sequence must still be parsed as a whole in order to compute the
offseted scalar value that it represents. From this scalar value
(which may be altered by the BASE), you can deduce if it's the scalar
value of a valid code point (in 0x0000..0xD7FF or 0xE000..0x10FFFF) or
a special code (for any other value : this includes switch codes, and
all other unassigned and reserved values).

Next message: Thomas Cropley: "UTF-c, UTF-i"
Previous message: Philippe Verdy: "Re: UTF-c"
In reply to: William_J_G Overington: "Re: UTF-c"
Next in thread: William_J_G Overington: "Re: UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 26 2011 - 13:25:20 CST