L2/00-262

Draft Unicode Technical Report #xx

Handling Surrogates

Revision

1

Authors

Andrew Hodgson (lyricom@magma.ca)

Date

This Version

http://www.unicode.org/unicode/reports/tr22/tr22-2

Previous Version

http://www.unicode.org/unicode/reports/tr22/tr22-1

Latest Version

http://www.unicode.org/unicode/reports/tr22/

Summary

This document describes the surrogates area of the Unicode standard and provides guidelines for the proper support of these characters.

Status

This document contains material which has been considered and approved by the Unicode Technical Committee for publication as a Proposed Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative material or as normative specification. Please mail corrigenda and other comments to the author.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.

Contents


1 Introduction

1. The surrogates area of the Unicode standard has been defined since Unicode version 2.0, but up to and including Unicode 3.0 there have been no characters assigned to these code points. This has meant those developing Unicode-based products could ignore these code points, and the difficulties in dealing with them, entirely.

2. This situation will change with Unicode 3.1 and following releases. New scripts will be placed in the surrogates area and all Unicode-compliant applications will have to provide at least a minimal level of support for these characters.

2 Contents

1. The proposed contents of the surrogates area as of Unicode release 4.0 are given in the following table. Encodings are shown in both UTF-32 and UTF-16 formats.

Description
count
UTF-32
UTF-16

Several archaic scripts

     

Old Italic Script

(for Etruscan, Oscan, Umbrian, ...)

48

10300…1032F

D800:DF00…D800:DF2F

Gothic Script (for Gothic language)

32

10330…1034F

D800:DF30…D800:DF4F

Deseret Script

(formerly used by the Mormon church)

80

10400…1044F

D801:DC00…D801:DC4F

Two sets of musical symbols

     

Byzantine Musical Symbols

256

1D000…1D0FF

D834:DC00…D834:DCFF

Western Musical Symbols

224

1D100…1D1DF

D834:DD00…D834:DDDF

Mathematical symbols

     

Mathematical Alphanumeric Symbols

1024

1D400…1D7FF

D835:DC00…D835:DFFF

Han Characters

     

Vertical Extension B

(~42,800 unified Han characters)

~42,800

20000…2A7??

D840:DC00…D869:DF??

CNS Compatibility

(527 compatibility Han characters)

535

2F800…2FA16

D87E:DC00…D87E:DE16

Other

     

Plane 14 Tag Characters

128

E0000…E007F

DB40:DC00…DB40:DC7F

Private Use Area

65,536

F0000…FFFFF

DB80:DC00…DBFF:DFFF

3 Character Representations

1. Surrogates can be represented in code in a number of different ways. In all cases, however, more than 16 bits are required.

3.1 UTF-8

2. UTF-8 representation requires the use of four bytes. The transformation between UTF-8 and UTF-16 is described in appendix A2 of the Unicode standard (version 2 - not sure where this is in V3 yet).

3.2 UTF-16

3. UTF-16 representation is accomplished by using two UTF-16 values, the first taken from the high-surrogate range (U+D800 to U+DBFF) and the second from the low-surrogate range (U+DC00 to U+DFFF). These two values combine to represent a single code-point.

3.3 UTF-32

4. All UTF-32 representation is done using 4-byte values. The transformation from UTF-16 to and from UTF-32 is described in section 3.7, paragraph D28 of the Unicode standard.

4 Support Requirements

4.1 Who Should Support Surrogates

1. Many of the characters being assigned to the surrogates area are of a specialized nature. These include archaic scripts as well as mathematical and musical symbols. These symbols are not needed in all text applications and some applications may choose to provide only minimal support for surrogates.

2. The Han Vertical Extension B contains characters that are actively used in Asia today. Any Asian application that will need to deal with proper names will have to be able to fully support these characters.

3. Any general-purpose application development tools that support Unicode must support these characters.

4.2 Minimum Support Level

1. Applications are required to conform to the standards outlined in section 2.7 of the Unicode Standard. There are some additional requirements for the handling of surrogates.

2. Surrogate characters must be treated as complete units. Any transformations or editing operations must be applied to the surrogate as a whole.

3. Editing and transformation operations must never separate a high-surrogate from the following low-surrogate. If your application already deals with the need to avoid insertions in the middle of combining character sequences or inside CR LF sequences, then this should not be a difficult addition.

5 Handling Character Properties

5.1 Information Table Format

1. Information on these new characters will be added to the appropriate data properties files as they are defined. Code points in the character properties files will be represented in UTF-32. Four digits will be used for values in the BMP, 5 digits for Planes 1..15, and 6 digits for Plane 16. Thus you will see entries in UnicodeData.txt such as:
...
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
...
10330;GOTHIC LETTER AHSA;Lo;0;L;;;;;N;;;;;
5.2 Handling Property Tables

Property information can be efficiently held in a two-level table structure known as a "trie". The paper "Efficient Storage and Use of Unicode Property Data" by Lloyd Honomichl, which has been presented at several Unicode conferences, describes this technique.

The addition of space for an additional 1,048,576 character codes would, at first, seems to create a huge problem for implementations which need access to detailed character properties. However, in the immediate future it is expected that 95% of the space will remain unassigned. Furthermore, over 95% of the space which is used, is assigned to Han characters. This suggests that compression techniques such as the "trie" will continue to be very successful in the foreseeable future.

 

6 Existing Application Support

6.1 Java

6.2 Others

6.3 Specification of These Characters

The latest draft C and C++ standards are introducing the following methods of specifying these characters: "\uhhhh" for characters within the BMP and "\Uhhhhhhhh", capable of specifying any Unicode character. The 'h' stands for a hexadecimal digit. The first form must hold exactly 4 hex digits. The second must hold exactly 8.

Perl and Kermit are using a variable length hexadecimal specification delimited by braces: "\x{hh..h}".

XML supports the specification of these characters and HTML has supported it since version 4.0.

6.3 Fonts

6.4 Input Methods

Modification History

Acknowledgments

Thanks to Kenneth Whistler, Peter Constable, Michel Suignard, Asmus Freytag, Marcus Scherer and Micheal Everson for providing content and feedback for this document.

Annex

I indicated that there were some approaches which were fairly simple extensions of a unichar-oriented lookup system. I thought it might be helpful to provide a few C code snippets to illustrate what I was thinking of. Perhaps these will give people some ideas about how to adapt their own table lookups.

First, an example of a completely vanilla 8-bit/8-bit trie table lookup, using a single unichar parameter (UCS-2):

typedef PROPTYPE PROPTABLE[256];   
PROPTYPE GetProperty( unichar c )
{
  PROPTABLE *tbl;
  tbl = (PROPTABLE *) firstLevelTrie[ c >> 8 ];
  if ( tbl == NULL )
  {
    return ( UNASSIGNED );
  }
  else
  {
    return ( tbl[ c & 0x00FF ] );
  }
}

This is extremely simple, but breaking up the table this way already gives you good compression, since the tbl pointer calculated at the first level can be collapsed for all the blocks in Han (CJK Unified Ideographs and CJK Unified Ideographs Extension A) as well as Hangul and Yi, where entire blocks of 256 characters have the same properties. And performance is excellent. At the cost of one additional PROPTABLE filled with the UNASSIGNED values, it's possible to define the following single line macro version of GetProperty that doesn't even use branching.

#define GetProperty(c)  (((PROPTABLE *) firstLevelTrie[ (c) >> 8 ])[ (c) & 0xFFFF ])

/******************************************************/

Because of the way the character tables in Unicode are arranged, you will commonly find that the contents of a PROPTABLE look something like this: 000000xyxxyzz00zxxxyyxxxyyuu000000000, where 0 stands for a default value, or the value applied to unassigned charaters. In other words, the start or end of each PROPTABLE are often quite redundant. By storing the default value as well as the start and end of the range of interest in the proptable, you can get muchbetter compression of the table at the second level of the trie. And by storing default values in the second level of the trie, you can also play further compression games and/or get more robust behavior for an implementation when it hits future versions of Unicode.

This next example shows that the code does not get that much more complex:

struct PROPTABLE
{
    PROPTYPE defaultValue;
    int      startOffset;
    int	     finalOffset;
    PROPTYPE values[1];       // variable length array follows here
};
 
PROPTYPE GetProperty( unichar c )
{
  PROPTABLE *tbl;
  int lsb;
  tbl = (PROPTABLE *) firstLevelTrie[ c >> 8 ];
  lsb = c & 0x00FF;
  if ( ( lsb < tbl->startOffset ) || ( lsb > tbl->finalOffset ) )
  {
    return ( tbl->defaultValue );
  }
  else
  {
    return ( tbl->values [ lsb - tbl->startOffset ] );
  }
}
 

Full UTF-16 surrogate support

Now let's take the second version, and adapt it for full UTF-16 surrogate support.

First of all, the assignments of all 1024 high surrogates can be compressed into a 32 x 32-bit integer array, using one bit for each high surrogate character:

unsigned int surrogateAssignments[32] = { ... };

Next, the basic property API needs to take a pointer to unichar, rather than a unichar parameter. This enables it to look ahead to pick up the low surrogate, when it detects a high surrogate. (Of course, the appropriate preconditions have to be met by the calling code, so you don't end up with access violation bugs; this particular approach works well for code that is using 16-bit null terminated Unicode strings, a la C strings, since the function can always determine an unmatched high surrogate at the end of a string by finding the terminating 16-bit null following it. Tune for your particular approach to Unicode strings.)

PROPTYPE GetPropertyForUTF16( unichar* c )
{
  PROPTABLE *tbl;
  int lsb;
  tbl = (PROPTABLE *) firstLevelTrie[ *c >> 8 ];
  if ( tbl->highSurrogate )
  {
    int surrogateBits;  /* a 1K index, 0..0x3FF */
    int assignedSurrogateBlock;
    unsigned int temp;
    surrogateBits = *c & 0x03FF;
    temp = surrogateAssignments [ ( surrogateBits & 0x03E0 ) >> 5 ];
    assignedSurrogateBlock = ( temp & ( 0x10000000 >> ( surrogateBits & 0x1F ) ) != 0 );
  
    if ( assignedSurrogateBlock )
    {
      return ( GetSurrogateProperty ( surrogateBits, *( c + 1 ) ) );
    }
    else
    {
      return ( tbl->defaultValue );
    }
  }
  lsb = *c & 0x00FF;
  if ( ( lsb < tbl->startOffset ) || ( lsb > tbl->finalOffset ) )
  {
    return ( tbl->defaultValue );
  }
  else
  {
    return ( tbl->values [ lsb - tbl->startOffset ] );
  }
}

Note that the for the main (UCS-2) path through this function, the only difference is a test on a highSurrogate boolean that can be stored with the tables. (You can do the same thing with a range check on 0xD8..0xDB, of course.) So the efficiency of this property lookup function is effectively no different from the 2nd version for UCS-2 only -- and since the BMP characters are going to constitute 99.99% of data for nearly all applications, this approach will get surrogate support at virtually no performance cost.

Inside the surrogate branch, I've just shown some sketches to how you would approach the processing. The first three lines show the bit accessing using the low ten bits of the high surrogate code point to determine whether a high surrogate block is assigned at all. GetSurrogateProperty() could, of course, be inlined, with appropriate table definitions. But I just show it here as an abstract call, to emphasize the point that effectively you are just entering a new trie space, with two 10-bit numbers (the low 10 bits off the high surrogate, and the low 10 bits off the low surrogate character that is presumed to follow the high surrogate). In the example code above, I pass the entire following character to GetSurrogateProperty(), so that that function can do the appropriate error processing for detection of a non low surrogate or an anomalous end-of-string (0x0000) condition.

After that, deciding what to do with the two 10-bit numbers is up to one's ingenuity. The simple approach is to simply set up a 10-bit/10-bit trie, with a table of 1024 pointers to tables with 1024 entries. Even this simple approach gets you 99.32% compression of surrogate code space, since you only need 7 second-level tables to deal with everything so far in the pipeline for Planes 1, 2, and 14. (1 high surrogate for the 3 alphabets, 1 for the musical symbols, 1 for the math alphanumerics, 2 for vertical extension B -- 1 for all the full blocks and one for the non-full block, 1 for the compatibility Han, and 1 for Plane 14 tags.) But you can really do much better than that, since you don't need a full 1024 table pointers at the first level, and since the second-level tables themselves can be compressed with the startOffset and finalOffset tricks that apply to the BMP. And since the blocks of stuff encoded in Planes 1, 2, and 14 contain little "texture" -- they tend to be swatches of similar stuff, it really should be no big problem to exceed 99.9% compression for the surrogate code space property tables, with a little ingenuity.

UTF-32 access API's can be even simpler, of course. They can build their trie strategy directly off the bit twiddling of the code point, without having the extra fiddling with surrogate pairs. But in either case, support of characters off the BMP is just a matter of a little ingenuity in table design. It doesn't have to be a major cost either in performance or complexity for the main API's that deal with character properties.

D R A F  T

PROPTYPE GetPropertyForUTF32( utf32char c )
{
  PROPTABLE *tbl;
  int lsb;
  tbl = (PROPTABLE *) firstLevelTrie[ *c >> 8 ];
  if ( tbl->highSurrogate )
  {
    int surrogateBits;  /* a 1K index, 0..0x3FF */
    int assignedSurrogateBlock;
    unsigned int temp;
    surrogateBits = *c & 0x03FF;
    temp = surrogateAssignments [ ( surrogateBits & 0x03E0 ) >> 5 ];
    assignedSurrogateBlock = ( temp & ( 0x10000000 >> ( surrogateBits & 0x1F ) ) != 0 );
  
    if ( assignedSurrogateBlock )
    {
      return ( GetSurrogateProperty ( surrogateBits, *( c + 1 ) ) );
    }
    else
    {
      return ( tbl->defaultValue );
    }
  }
  lsb = *c & 0x00FF;
  if ( ( lsb < tbl->startOffset ) || ( lsb > tbl->finalOffset ) )
  {
    return ( tbl->defaultValue );
  }
  else
  {
    return ( tbl->values [ lsb - tbl->startOffset ] );
  }
}


Copyright © 2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.