Re: Can a single text document use multiple character encodings?

From: Ilya Zakharevich <>
Date: Fri, 30 Aug 2013 06:05:41 -0700

On Wed, Aug 28, 2013 at 07:07:23PM +0000, Costello, Roger L. wrote:

> For example, can some text be encoded as UTF-8 while other text is encoded as UTF-16 - within the same document?

I think it is a very interesting question. A Perl program is
(obviously) a text document. On the other hand, in two minutes I
could deduce a few ways to mix many different encodings into the same
document. My current record is 5 different encodings; some of them
are arbitrary, some of them should satisfy certain compatibility
requirements (something like
 =cut CR
 =pod CR
being encoded the same in two encodings). And, on top of this, is yet
another way to mix encodings arbitrarily.

The tricks are threefold:

    ◌ First, a Perl program is actually a mixture of 3 different
      documents: the program stream, the data-for-the-program stream,
      and the documentation stream. There are certain rules for
      interleaving them (except for DATA which should be at the end!),
      and there are documented way to specify encodings of the

    ◌ Second, the string and regular-expression literals are
      “interpreted” by the lexer: there is a way for the program to
      specify a way to “massage” the literals before they are handled
      to interpreter. This gives yet other ways to have strings
      and/or regular expressions to be in a different encoding. (Note
      that this may lead to “doubly encoded” phenomena if the
      “ambient” encoding is not “raw”.)

    ◌ Third, there is a way to switch the encoding of a Perl program
      on the fly (at the end-of-line of current encoding).

To be honest, I should have better tested all this before
posting — but I did not. On the practical side, how is this useful?
Having different encoding for DATA and the program, and/or
documentation and the program may be quite widely used. The other
hacks may have been used at least in the (enormous!) Perl test suite.

Received on Fri Aug 30 2013 - 08:08:55 CDT

This archive was generated by hypermail 2.2.0 : Fri Aug 30 2013 - 08:08:58 CDT