Comments on ASCII, Unicode

From: Hart, Edwin F. (Edwin.Hart@jhuapl.edu)
Date: Thu Jun 03 1999 - 10:15:51 EDT


Gerard Rinckes wrote:

I'm Gerard Rinckes of the Meertens Institute in Amsterdam, the Netherlands.
We're doing research on the dutch language and culture. Therefore, we
maintain large databases. To maintain this information for over 20 years
we're converting the databases to SGML.
My question is: how long will ASCII be the standard code? Can we expect
major changes in the standard code? Or: is the information safe in ASCII?

I don't know wheter you're the right organisation to ask these questions? If
not, sorry for the time.

- - - - - - - - - -

Mr. G. Adam Stanislav replied:

No one can know how long a standard will last. However, ASCII is included
in various standards and is so widely used that any changes to it would
break a lot of software currently in existence.

What can happen is that while the codes of ASCII will remain in common use,
it is conceivable that more and more programs will store text as 16-bit or
32-bit data. Even if that happens, it would be quite elementary to write a
conversion program.

It is more likely that whatever hardware you are using to store your
database on will be a piece of museum history 20 years from now. I would be
more concerned about making sure it is easy to transfer your data to
different storage media than about ASCII being still around. I would also
make sure several up-to-date copies of the database always exist in
different parts of the world so it is not destroyed in some natural
disaster or war.

Dear Mr. Rinckes,

I personally recommend that you consider using the 16-bit/32-bit ISO/IEC
10646-1 standard or at least one of the parts of 8-bit ISO/IEC 8859 to
encode the data. The 10646 standard encodes a much richer set of characters
than can be found in 7-bit ASCII or 8-bit ISO/IEC 8859-1. Moreover, if you
use the UTF-8 form of 10646, you will likely find little additional overhead
to transmit and store the characters you need.

The Unicode Standard Version 2.0 is a compliant profile of ISO/IEC 10646-1
(UTF-16, Level 3, full character set (BMP)). The Unicode Consortium
contributes to the development and maintenance of ISO/IEC 10646 and is
committed to keeping the Unicode Standard compliant with ISO/IEC 10646. The
Unicode Standard Version 2.0 book provides details necessary for
implementation (for example, character property tables, equivalences, and an
algorithm for handling bi-directional text). Both ISO and Unicode plan to
publish new editions of their respective standards by the end of 1999.
However, you will likely find the Unicode Standard Version 3.0 book to be
much less costly than the ISO standard.

Let me return to your original question. The organization responsible for
the US 7-bit ASCII standard (ANSI X3.4:1986) is the L2 Committee of the
National Committee for Information Technology Standards (NCITS/L2). I am on
the L2 committee and the editor for the 7-bit ASCII standard. (The Unicode
Consortium and many of the members of Unicode are also members of the L2
Committee.)

My personal opinion is that 7-bit ASCII will be supported for a long time
because the standard is so widely implemented. Mr. Stanislav noted the
inclusion of ASCII as part of other standards (C Programming Language, SMTP
email standard, etc.) In 1991 and again in 1996, the L2 Committee voted to
reaffirm X3.4-1986 and to make no changes. At the last meeting (February,
1999), the L2 Committee voted to stabilize the standard with the next vote
for reaffirmation in 2001.

The 7-bit ASCII code positions are also firmly entrenched in other
coded-character-set standards. The International Reference Version of
ISO/IEC 646:1991 is 7-bit ASCII without the control characters. In 1992,
the US adopted ISO 8859-1:1987 as the 8-bit ASCII standard. The graphic
characters of 7-bit ASCII constitute the lower half (code positions 0x20 to
0x7F) of all parts of ISO 8859. (I believe that ISO has published a second
edition of 8859-1 in 1998. For all practical purposes, the new edition is
the same as the 1987 version.) Finally, the 7-bit ASCII characters also are
the first set of graphic characters in ISO/IEC 10646-1 (code positions
0x0020 to 0x007F). Thus, conversion of the characters in 7-bit ASCII to and
from all of the parts of ISO/IEC 8859 and ISO/IEC 10646-1 is merely a matter
of adding or removing zeroes (0) in the high-order bits.

Best regards,
Ed Hart

Edwin F. Hart
Applied Physics Laboratory
11100 Johns Hopkins Road
Laurel, MD 20723-6099
+1-240-228-6926 (from Washington, DC area)
+1-443-778-6926 (from Baltimore area)
+1-240-228-1093 (fax)
edwin.hart@jhuapl.edu <mailto:edwin.hart@jhuapl.edu>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT