Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

From: Martin J. Dürst (duerst@it.aoyama.ac.jp)
Date: Wed Nov 10 2010 - 21:54:57 CST

Next message: Bjoern Hoehrmann: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"

Previous message: Jim Monty: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
In reply to: Mark Davis ☕: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
Next in thread: Johannes Rössel: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
Reply: Johannes Rössel: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2010/11/11 6:28, Mark Davis ☕ wrote:

> That is actually not the case. There are superset relations among some of
> the CJK character sets, and also -- practically speaking -- between some of
> the windows and ISO-8859 sets. I say practically speaking because in general
> environments, the C1 controls are really unused, so where a non ISO-8859 set
> is same except for 80..9F you can treat it pragmatically as a superset.

Yes, except that the terms superset/subset (and set in general)
shouldn't be used unless you really strictly speak about the repertoire
of characters, and not the encoding itself. So e.g. the repertoire of
iso-8859-1 is a subset of the repertoire of UTF-8. However, iso-8859-1
is not a subset of UTF-8, not because you can't label some text encoded
as iso-8859-1, but because subset relationships among the encodings
themselves don't make sense).
Also, US-ASCII is not a subset of UTF-8, because when you just use the
names of the character encodings, you mean the character encodings, and
character encodings don't have subset relationships.

It may as well be possible to use (create?) the term sub-encoding,
saying that an encoding A is a sub-encoding of encoding B if all (legal)
byte sequences in encoding A are also legal byte sequences in encoding B
and are interpreted as the same characters in both cases. In this sense,
US-ASCII is clearly a sub-encoding of UTF-8, as well as a sub-encoding
of many other encodings. You can also say that iso-8859-1 is a
sub-encoding of windows-1252 if the former is interpreted as not
including the C1 range.

Regards, Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Next message: Bjoern Hoehrmann: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
Previous message: Jim Monty: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
In reply to: Mark Davis ☕: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
Next in thread: Johannes Rössel: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
Reply: Johannes Rössel: "Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 10 2010 - 21:59:24 CST