From: Martin J. Dürst (email@example.com)
Date: Wed Nov 10 2010 - 21:54:57 CST
On 2010/11/11 6:28, Mark Davis ☕ wrote:
> That is actually not the case. There are superset relations among some of
> the CJK character sets, and also -- practically speaking -- between some of
> the windows and ISO-8859 sets. I say practically speaking because in general
> environments, the C1 controls are really unused, so where a non ISO-8859 set
> is same except for 80..9F you can treat it pragmatically as a superset.
Yes, except that the terms superset/subset (and set in general)
shouldn't be used unless you really strictly speak about the repertoire
of characters, and not the encoding itself. So e.g. the repertoire of
iso-8859-1 is a subset of the repertoire of UTF-8. However, iso-8859-1
is not a subset of UTF-8, not because you can't label some text encoded
as iso-8859-1, but because subset relationships among the encodings
themselves don't make sense).
Also, US-ASCII is not a subset of UTF-8, because when you just use the
names of the character encodings, you mean the character encodings, and
character encodings don't have subset relationships.
It may as well be possible to use (create?) the term sub-encoding,
saying that an encoding A is a sub-encoding of encoding B if all (legal)
byte sequences in encoding A are also legal byte sequences in encoding B
and are interpreted as the same characters in both cases. In this sense,
US-ASCII is clearly a sub-encoding of UTF-8, as well as a sub-encoding
of many other encodings. You can also say that iso-8859-1 is a
sub-encoding of windows-1252 if the former is interpreted as not
including the C1 range.
-- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:firstname.lastname@example.org
This archive was generated by hypermail 2.1.5 : Wed Nov 10 2010 - 21:59:24 CST