Re: Origin of Ellipsis from Bjoern Hoehrmann on 2013-09-16 (Unicode Mail List Archive)

From: Bjoern Hoehrmann <derhoermi_at_gmx.net>
Date: Mon, 16 Sep 2013 17:53:26 +0200

* Philippe Verdy wrote:
>2013/9/16 Stephan Stiller <stephan.stiller_at_gmail.com>
>> That's exactly what happens when people confuse "code point" with "scalar
>value" ;-) Hmm, whom might we blame? :-)
>
>Actually you never count scalar values. You are confusing tham with code
>units. Twitter was orignally counting UTF-16 code units, but now counts
>code points.
>
>Scalar values are unrelated, they are properites assigned to code points so
>that all code points have a scalar value but the reverse is true only with
>the valid range 0 to 0x1FFFFF. Scalar values are only used if you need to
>perform arithmetic to compute code points from others. This genreally does
>not work well within the UCS except in a few very small ranges (like
>decimal digits). The scalar value is also needed to convert from one
>standard UTF to another.

Well,

  UTF-16 code unit: integer in 0 .. 0x00FFFF
  Unicode code point: integer in 0 .. 0x10FFFF
  Surrogate code point: integer in 0xD800 .. 0xDFFF
  Unicode scalar value: integer in 0 .. 0xD7FF or 0xE000 .. 0x10FFFF

When you say "code point" it is usually unclear whether values like
0xDEAD or 0xAFFFE are being considered, with your counting example,
whether there are proper surrogate pairs in the string and whether
you would count them as 1 or 2 in the total. Such confusion is less
likely when referencing scalar values.

-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Mon Sep 16 2013 - 10:55:31 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 16 2013 - 10:55:32 CDT