RE: Unicode String Models

From: Murray Sargent <murrays_at_exchange.microsoft.com>
Date: Fri, 20 Jul 2012 23:16:17 +0000

Mark wrote: “I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome.”

Nice article as far as it goes and additions are forthcoming. In addition to multiple code units per character in UTF-8 and UTF-16, there are variation selectors, combining marks, ligatures, and clusters, all of which imply handling variable-length sequences even for UTF-32. Handling the variable length code points in UTF-8 and UTF-16 is actually considerably easier than dealing with these other sources of variable length. For all cases, you need to be able to find "character entity" boundaries for an arbitrary code-unit index.

My latest blog post “Ligatures, Clusters, Combining Marks and Variation Sequences<http://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx>” discusses some of these complications.

One amusing thing is that where I work it’s common to use cp to mean “character position”, which more precisely is “UTF-16 code-unit index”, whereas in Mark’s post, cp is used for codepoint.

Murray


Received on Fri Jul 20 2012 - 18:21:35 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 20 2012 - 18:21:42 CDT