Re: Terminology: does the term "codepoint" apply to non-Unicode character sets?

From: Bjoern Hoehrmann <derhoermi_at_gmx.net>
Date: Tue, 01 Jan 2013 23:21:38 +0100

* Costello, Roger L. wrote:
>Does the term "codepoint" apply to non-Unicode character sets?
>
>For example, are there codepoints in iso-8859-1? In Windows-1252?

There is no "the term". The term "boot" might refer to footwear or it
might refer to a part of a vehicle, or a number of other things. Such
ambiguities are usually resolved by way of mathematical rigor, you'd
reference a particular definition for the term or otherwise specify a
context (like "british english" for the example above).

RFC 6365 offers this:

   code point

      A value in the codespace of a repertoire. For all common
      repertoires developed in recent years, code point values are
      integers (code points for ASCII and its immediate descendants were
      defined in terms of column and row positions of a table).

while http://www.w3.org/TR/charmod/#def-CCS has

  Each character in the repertoire is then associated with a
  (mathematical, abstract) non-negative integer, the code point

And http://en.wikipedia.org/wiki/Codepoint currently has

  In character encoding terminology, a code point or code position
  is any of the numerical values that make up the code space.

So, is a "code point" necessarily a non-negative integer? No, it depends
on which definition you use. Personally I would expect some people to be
confused when you say something like "The code points in Windows-1252",
with some people assuming that refers to the repertoire of "characters"
that can be encoded using Windows-1252, while others will assume it re-
fers to, essentially, indices in a mapping table from Windows-1252 bytes
to Unicode scalar values. A related discussion would be

  http://lists.w3.org/Archives/Public/public-i18n-core/2012JanMar/0107

Where the definition starts out with "A code point is a Unicode code
point ..." which would make your question "For example, are there
Unicode code points in iso-8859-1?" which in context would require to
say how "iso-8859-1" relates to "non-Unicode character set" and for that
we would need a definition for "character set" to begin with. So I think
yours is not a good question to ask. If you write new text that does not
need to talk about "code points" that are not "Unicode code points" then
avoid the term.

That aside, sure, people use the term for "non-Unicode character sets"
and talk of code points "in Windows-1252".

-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Tue Jan 01 2013 - 16:25:56 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 16:25:57 CST