Re: Chapter on character sets

From: Keld Jørn Simonsen ([email protected])
Date: Thu Jun 15 2000 - 17:26:55 EDT

Next message: Markus Scherer: "Re: Multilingual Support with Servlet,jdbc"
Previous message: Keld Jørn Simonsen: "Re: The mother of all collation schemes"
Maybe in reply to: Keld Jørn Simonsen: "Re: Chapter on character sets"
Next in thread: Antoine Leca: "Re: Chapter on character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, Jun 15, 2000 at 09:49:14AM -0800, Mike Brown wrote:
>
> The ANSI X3.4 "ASCII" standard from 1968 defines character assignments for
> hex numbers 20 through 7E, and is pretty much just the things you see on
> American keyboards, minus the control functions like Shift, Enter, etc. In
> an 8-bit encoding scheme, the byte sequences used to represent the ASCII
> numbers are single bytes with the same value as the numbers themselves.

Well, AFAIK the 1968 version of ASCII also included control characters
00-1f and 7F.

>
> The ISO 646 standard was formalized in 1972, and provided variants of ASCII
> for different countries (ISO 646-XX, where XX is one of about a dozen
> country codes).

ISO 646 did not include national variants , but provided for it. 12
positions were unassigned. Then a number of national bodies made their
national variant of the 6464 standard.

> In addition to the 20 through 7E range, it also includes the
> C0 control set for non-displayable characters assigned to 00 through 1F, and
> the delete character at 7F. If the ECMA-6 standard is as equivalent to ISO
> 646 as I am led to believe, then some leeway is allowed for currency
> symbols: hex position 23 can be # or £, and 24 can be $ or ¤.

Yes 646 had the pisotions 23 and 24 as you describe, no other
characters could be assigned to these positions.

There was an International Reference Version (IRV) which was almost
identical to ASCII in then 1972 version, and then was changed to
be exactly ASCII in (say) 1991.
>
> The character set defined by the ISO 646-US standard is now known as
> "US-ASCII" due to its IANA registration for use on the Internet. It defines
> hex position 23 to be # and 24 to be $. It is a subset of all character
> encodings except IBM's EBCDIC, which is an encoding for mainframes that was
> supposedly easier to read on punch cards.

About the subset, this is not true. There are charsets in use today,
like the national 646 variants, that differ (in the 12 unassigned
positions). Not much used, but I get some emails in these encodings still.
Also Japanese and Chinese 14-bit encodings have
discrepancies (mostly the ¥)

>
> The ISO/IEC 8859-1 "Latin-1" standard defines character assignments for hex
> number A0 through FF, covering the characters used in the major (Western)
> European languages and that are not already covered by ASCII, and a few
> international symbols. This includes characters with diacritical
> marks/accents, «French quotation marks», non-breaking space, copyright
> symbol, etc.

8859-1 also defines 20-7E.
>
> "ISO-8859-1" (note the extra hyphen) is the IANA-registered character set
> that covers hex positions 00 to FF, subsetting US-ASCII and the C1 control
> set (80 to 9F).

It is only C0 and C1 which are added. (and 7F)
>
> The ISO/IEC 10646-1 "Universal Character Set" standard, which from a user's
> standpoint is equivalent to The Unicode Standard, defines character
> assignments for hex numbers 00 through 10FFFF, although not in a completely
> contiguous range. Since the range goes beyond FF, it cannot simply imply 1
> byte per character like its predecessors. Thus, among other things, it
> introduces a distinction between the assignment of characters to numbers,
> and the conversion of numbers to sequences of bytes or other fixed-bit-width
> code values.

There are a number of character sets, quite old, that had 2 bytes
per character, as in east-asian charsets,and a family of 8/16 bit
charsets, like ISO 6937, T.61 and some bibliographic charsets.
RFC 1345 has a number of these.

10646 is not defined like you describe it, the distinction is
not described in 10646. There is a distinction between (abstract)
characters and encoding, though, but that was always present in
ISO coded character sets.
>
> The UTF-8 amendment to the ISO/IEC 10646-1 standard defines an algorithm for
> converting the ISO 10646-1 character numbers to sequences of 1 to 4 8-bit
> bytes. It has also been formalized in the IETF's RFC 2279. "UTF-8" is also
> an IANA-registered character set.

I belive UTF-8 is 1 to 6 octets.
>
>
> In other ways, UTF-8 is problematic, because most people aren't aware that
> ISO 8859-1 range characters don't enjoy the same 1-to-1 byte mapping, and
> they end up having problems when they try to work with those characters and
> their ISO-8859-1 byte values. It would help people to understand character
> encoding issues better if every UTF-8 sequence were always gibberish. This
> is actually the case for the people who don't use ASCII at all.

I think that is a 8859-1 centric view. Anyway UTF-8 was explicitely
made to preserve all codes in US-ASCII. There are encodings around
of 10646 that are more like gibberish in 8859-1; you could use that
if this is a goal (...).

CEN/TC304 has made a guide on character sets with some more info
on both 7/8-bit and 10646, but it is a bit European centric, with
not much mention of encodings used in East-Asia and other parts
of the world. It will be available via http://ww.stri.is/tc304/
but I am not sure it is there yet.

Kind regards
Keld

Next message: Markus Scherer: "Re: Multilingual Support with Servlet,jdbc"
Previous message: Keld Jørn Simonsen: "Re: The mother of all collation schemes"
Maybe in reply to: Keld Jørn Simonsen: "Re: Chapter on character sets"
Next in thread: Antoine Leca: "Re: Chapter on character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT