UTF-8 and UTF-16 issues

From: OLeary, Sean (NJ) (oleary@msmail.awii.com)
Date: Fri Jun 16 2000 - 16:45:44 EDT


The following is from a document I had put together following the last San
José Unicode conference. I would be interested in writing a more complete
document with more issues added. Please send me any recommendations you
might have.
Sean

============================================================================
=================
UTF-8 and UTF-16 - When to use each
Document history
1999-09-07 Sean O'Leary - original document
Introduction
At the 15th Unicode Conference in San José, there was a panel discussion
discussing when to use UTF-8 encoding and when to use UTF-16. The panel
consisted of:
* Rick McGowan - Apple Computers
* Tex Texin - Progress Software
* Paul Hoffman - Internet Mail Consortium
* Carl Hoffman - Basis Technology
and was chaired by Sean O'Leary of Automated Wagering International.

The panel was formed in response to a series of emails on the Unicode list
that had strong proponents for either UTF-8 or UTF-16. The consensus of the
panel is that both UTF-8 and UTF-16 have their good and bad points and it is
up to the individual design teams to decide which form is correct for their
own projects. Towards that goal, the panel presented some criteria to use in
choosing either UTF-8 or UTF-16. After a brief statement from each panel
member, questions were fielded from the audience. This document is a summary
of the items discussed.

Basic definitions
From the Unicode Standard 2.0, page A-7: "The term UTF-8 stands for the UCS
Transformation Format, 8-bit form."
This encoding allows all plane 0 characters to be encoded in anywhere from 1
to 3 bytes. Surrogates for characters in planes 1 through 16 can be encoded
in exactly 4 bytes.

UTF-16 is the 16-bit encoding of Unicode that includes the use of
surrogates. This is essentially a fixed width encoding.

It is important to remember that both of these encodings allow the use of
exactly the same characters, that is, the full set of Unicode characters.

Encoding attributes
Some of the attributes of the encodings are:
UTF-8
Pros:
* ASCII transparency - All ASCII characters are encoded in their usual 8 bit
form.
* "Internet ready" - UTF-8 is the most common non-ASCII encoding recognized
by Internet pages.
* No issues of endian-ness. This allows safe file transfer between different
computer platforms.
* Some standard string library functions will work correctly, i.e. strlen().
Cons:
* Multibyte encodings require more processing especially when doing things
like pointer increments to the next character.
* CJK characters are stored less efficiently than in UTF-16.
* Most characters need to expanded into a UTF-16 form prior to table lookups
for character properties or codepage mappings.

UTF-16
Pros:
* Fixed width of 16 bits makes most character processing easier.
* CJK characters are stored more efficiently than in UTF-8.
Cons:
* When exchanging data between different computer platforms, big-endian and
little-endian issues need to be dealt with.

Both encodings have to deal with issues like surrogates and combining marks,
so there is no clear advantage to either on this issue. Although UTF-16
first appears to be a fixed width encoding, surrogates and combining marks
force the software to deal with multiple "character element" issues.

Basic design issues
When designing a software system that uses Unicode, the first 2 questions
that come up are usually:
1. How is Unicode to be used internally to the application?
This includes the processing done inside of the application as well as the
file storage outside of the application. It is certainly feasible to use
UTF-16 inside an application and use UTF-8 as the file storage encoding.

2. How is Unicode to be used externally to the application?
This area usually involves the data exchange between different computer
systems. As an example, the Internet has established UTF-8 as the de facto
encoding method. If your application needs to send data "over the wire",
then you should probably be using UTF-8. If, however, your application only
talks to similar computers, you might decide that UTF-16 provides for better
performance despite the endian issues.

Other Design Issues
After reviewing the basic design issues, a more detailed analysis should be
done. Here are some issues to consider:
1. Architecture of the legacy code and database.
* If there is a substantial amount of legacy code written for 8-bit
characters, it might be easier to add multibyte processing rather than using
UTF-16. In this case, char pointer increments and decrements would need to
be reviewed as well as your current string library support of multibyte
chars. Some functions would still work correctly, things like strlen() for
example. Other functions would need to be updated or replaced, things like
strchr().
* If you are working on a new code base, it is usually easier to use UTF-16
characters as your basic character type. Under some environments, like
Windows NT, turning on UNICODE support can result in almost invisible
processing of 16-bit Unicode characters.
* If the existing database is 8-bit ASCII, then it may not be cost effective
to convert it to UTF-16.
 
2. Data handling issues like:
* Data transfer between computer systems or databases. The big issue here is
usually big-endian versus little-endian. Use of UTF-8 can eliminate endian
issues.
* Data compression is usualy more effective with 8-bit characters rather
than 16-bit because you have better distribution of the byte patterns.
* Conversion of existing databases may be a deciding factor, especially if
disk space is already limited. If your existing database is Latin based
characters and your new database will have a significant amount of CJK ,
then conversion to UTF-16 might make sense. If the data is mostly ASCII,
then UTF-8 is usually the more cost effective encoding.

3. Interface to third party tools and APIs.
* If you are adding in third party string library support for Unicode, you
may be limited to either UTF-8 or UTF-16 based on what the vendors offer. If
you are locked into using third party library support, then you need to
evaluate the tools early in the decision process because the tools' APIs can
affect your coding dramatically.
* Resource file editing tools may only be available in UTF-16 or UTF-8.

4. Coding for UTF-16 versus UTF-8.
* In most cases, coding for fixed width UTF-16 characters is easier than
coding for multibyte UTF-8 characters. Its true that surrogates and
combining characters complicate this, but you need to check for these
conditions anyways. Fixed width characters simplify all your character
pointer manipulation as well as table lookups.

Some comments about Byte Order Mark (BOM)
Whenever there are discussions about UTF-16, the issue of dealing with the
BOM always comes up. Here are some points to remember about BOM processing:
* The BOM was developed as strictly a file level encoding that could be used
to resolve whether a file is in big-endian or little-endian order. As such,
the BOM was intended to be used in 16-bit encodings like UTF-16, not in
UTF-8.
* The BOM was never intended to be used on the record level. Therefore, the
BOM should never be stored in a database's record. The endian-ness of a
UTF-16 database should be stored at a higher level within the database
itself.
* UTF-8 encoding has no endian-ness issues, therefore the use of a leading
BOM sequence in a UTF-8 file is discouraged. A possible exception to this is
in a UTF-8 encoded file that is known to contain non-ASCII characters. A
UTF-8 enabled application should be able to handle a BOM sequence. If it
cannot, then chances are that it also cannot handle the non-ASCII
characters. It may be useful to the application to immediately fail but this
use is outside of the original intention of the BOM.
* If a BOM sequence is found in use outside of the first characters of a
file, then the proper interpretation of the byte sequence should as a
Zero-Width-No-Break-Space character, not a BOM.
* Code should only add a BOM at the beginning of a file if it is absolutely
needed. the practice of adding a BOM to files has broken many applications
that worked correctly without the BOM.
============================================================================
=================

Sean O'Leary
oleary@awii.com

Automated Wagering International
1255 Broad Street
Clifton, NJ 07013-4219

973-594-5077



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT