Re: "A Programmer's Introduction to Unicode" from Alastair Houghton on 2017-03-14 (Unicode Mail List Archive)

From: Alastair Houghton <alastair_at_alastairs-place.net>
Date: Tue, 14 Mar 2017 08:51:18 +0000

On 14 Mar 2017, at 02:03, Richard Wordingham <richard.wordingham_at_ntlworld.com> wrote:
>
> On Mon, 13 Mar 2017 19:18:00 +0000
> Alastair Houghton <alastair_at_alastairs-place.net> wrote:
>
>> IMO, returning code points by index is a mistake. It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
>
> The problem is that UTF-16 based code can very easily overlook the
> handling of surrogate pairs, and one very easily get confused over what
> string lengths mean.

Yet the same problem exists for UCS-4; it could very easily overlook the handling of combining characters. As for string lengths, string lengths in code points are no more meaningful than string lengths in UTF-16 code units. They don’t tell you anything about the number of user-visible characters; or anything about the width the string will take up if rendered on the display (even in a fixed-width font); or anything about the number of glyphs that a given string might be transformed into by glyph mapping. The *only* think a string length of a Unicode string will tell you is the number of code units.

Kind regards,

Alastair.

--
http://alastairs-place.net

Received on Tue Mar 14 2017 - 03:51:37 CDT

This archive was generated by hypermail 2.2.0 : Tue Mar 14 2017 - 03:51:37 CDT