Re: Can a single text document use multiple character encodings? from Doug Ewell on 2013-08-28 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Wed, 28 Aug 2013 19:31:56 -0600

Richard Wordingham wrote:

> Just to complicate matters, most documents encoded using ISO/IEC 2022
> rely on default initial settings, and so to interpret them it is not
> enough to say it is in an ISO/IEC 2022 encoding, but instead one must
> specify the particular encoding, which then defines the initial
> states.

ISO 2022 does require a particular initial state, but the ones Richard
is talking about are specific to ISO 2022-based encodings, such as
ISO-2022-CN or ISO-2022-JP. Those are really different encodings from
generic ISO 2022; in addition to the secret magic initial state, they
may also allow certain shortcuts in the switching characters which
aren't allowed in fully conformant 2022.

Asmus Freytag wrote:

> ISO 2022 allows switching among sets in mid stream, but as far as I
> remember (haven't had to think about this since Unicode came around)
> the code unit is still a byte, except that sometimes pairs of bytes
> are used. As I remember, ISO 2022 was still far from widely supported
> in the late 80's and practically not at all on the fast growing PC
> sector.

ISO 2022 code units are indeed bytes, even for the double- or
(theoretical) triple-byte sets, and it was indeed almost never used on
PCs.

I think it's important to remember that Roger's original question to the
list was "Can a single text document use multiple character encodings?"
He didn't ask if such a practice was common, or confusing, or a good
idea, though perhaps those were underlying questions.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

Received on Wed Aug 28 2013 - 20:33:36 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 28 2013 - 20:33:37 CDT