Re: Unicode String Models

From: David Starner <prosfilaes_at_gmail.com>
Date: Fri, 20 Jul 2012 15:01:42 -0700

On Fri, Jul 20, 2012 at 1:31 PM, Mark Davis ☕ <mark_at_macchiato.com> wrote:
> I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome.
> Macchiato »
> Many programming languages (and most modern software) have moved to Unicode model of text. Text coming into the system might be in legacy encodings like Shift-JIS or Latin-1, and text being pushed out...

I had a few comments for general discussion:

"That means that it is best to optimize for BMP characters (and as a
subset, ASCII and Latin-1), and fall into a ‘slow path’ when a
supplementary character is encountered."

I'm concerned about the statement/implication that one can optimize
for ASCII and Latin-1. It's too easy for a lot of developers to test
speed with the English/European documents they have around and test
correctness only with Chinese. I see the argument in theory and
practice, but it's a tough line to walk, especially if you're not
familiar with i18n.

I can see for i in range (1, 1000) do a := " "; a +:= "龜"; done being
way slower than necessary (especially for non-trivially optimized away
cases), for example.

"Interfacing with most software libraries can avoid conversions in and out"

I'm curious about this. I won't dismiss it off hand, but besides ICU,
what libraries are we talking about that haven't already been
rewritten for GTK, Java, Python, take your pick.

"The string class is indexed by code unit, and is UTF-32. Used by: glibc?"

I haven't poked at it, but Ada 2012 (in pre-standard editorial-changes
only stage) has Latin-1, UCS-2 (the standard is not clear here about
UTF-16 vs. UCS-2) and UTF-32 (UCS-4--it mentions 2147483648 code
points) strings. There are functions in the standard to store a
Unicode string in the Latin-1 strings as UTF-8 and in the UCS-2
strings as UTF-16, but there is a choice to use straight UTF-32.

"The question of whether to allow non-ASCII characters in variables is open."

I don't see why. Yes, a lot of organizations will use ASCII only, but
not all programming is done large international organizations. For
personal hacking, or small mononational organizations, Unicode
variables may be much more convenient. It's not like Chinese variables
with Chinese comments is going to be much harder to debug for the
English speaker then English variables (or bad English variables) with
Chinese comments, and ASCII-romanized Chinese variables may be the
worst of all worlds.

--
Kie ekzistas vivo, ekzistas espero.
Received on Fri Jul 20 2012 - 17:05:44 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 20 2012 - 17:06:19 CDT