Re: "A Programmer's Introduction to Unicode"

From: Steffen Nurpmeso <steffen_at_sdaoden.eu>
Date: Wed, 15 Mar 2017 11:40:54 +0100

"Doug Ewell" <doug_at_ewellic.org> wrote:
 |Philippe Verdy wrote:
 |>>> Well, you do have eleven bits for flags per codepoint, for example.
 |>>
 |>> That's not UCS-4; that's a custom encoding.
 |>>
 |>> (any UCS-4 code unit) & 0xFFE00000 == 0
 |
 |(changing to "UTF-32" per Ken's observation)
 |
 |> Per definition yes, but UTC-4 is not Unicode.
 |
 |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
 |held in 1989?
 |
 |> As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not
 |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
 |> would allow 32 planes instead of just the 17 first ones).
 |
 |I used bitwise arithmetic strictly to address Steffen's premise that the
 |11 "unused bits" in a UTF-32 code unit were available to store metadata
 |about the code point. Of course UTF-32 does not allow 0x110000 through
 |0x1FFFFF either.
 |
 |> I suppose he meant 21 bits, not 11 bits which covers only a small part
 |> of the BMP.
 |
 |No, his comment "you do have eleven bits for flags per codepoint" pretty
 |clearly referred to using the "extra" 11 bits beyond what is needed to
 |hold the Unicode scalar value.

It surely is a weak argument for a general string encoding. But
sometimes, and for local use cases it surely is valid. You could
store the wcwidth(3) plus a graphem codepoint count both in these
bits of the first codepoint of a cluster, for example, and, then,
that storage detail hidden under an access method interface.

--steffen
Received on Wed Mar 15 2017 - 05:41:28 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 15 2017 - 05:41:29 CDT