EURO SIGN in IBM coded character sets

From: Uma Umamaheswaran (umavs@ca.ibm.com)
Date: Mon Aug 24 1998 - 11:17:47 EDT


There has been a flurry of notes on the unicode discussion list regarding EURO
SIGN in IBM Code pages etc. and I would like to forward some explanatory notes
and points to keep in mind (from IBM NLTC) for your information.

===============================================================================================
Note 1:

According to

  http://ps.boulder.ibm.com/pbin-usa-ps/getobj.pl?/pdocs-usa/euro.html

IBM has changed code pages 850 and 857 to include the euro sign.

Therefore I suggest to update in the mapping files

  ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT
  ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP857.TXT

the lines

  0xd5 0x0131 #LATIN SMALL LETTER DOTLESS I
  0xd5 #UNDEFINED

which should both be replaced by

  0xd5 0x20AC #EURO SIGN

and the equivalent change should be made in CP857.TXT.

Markus

--
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org,  home page: <http://www.cl.cam.ac.uk/~mgk25/>

Response:

Not necessarily. I believe what it means to say (but doesn't) about CP850 is that a new code page, CP858, was created by copying CP850 and replacing dotless i with the Euro symbol. I don't think IBM has ever changed, or will ever change, the definition of a code page.

Similarly for CP857, but I don't know the CP number of its Euro twin.

CP923 is identical to ISO 8859-15. CP924 has the same repertoire as CP923 but with EBCDIC encoding. Other EBCDIC code page replacements include:

Current New Code Countries 37 1140 9F USA, Canada, Netherlands, Portugal, Brazil, Australia, NZ 273 1141 9F Austria, Germany 277 1142 5A Denmark, Norway 278 1143 5A Finland, Sweden 280 1144 9F Italy 284 1145 9F Spain, Latin America (Spanish) 285 1146 9F UK 297 1147 9F France 500 1148 9F Belgium, Canada, Switzerland 871 1149 9F Iceland

("Code" is the hex code point of the Euro symbol).

This according to documentation shipped by IBM with OS/2 Fix Paks. I'm sure the info can also be found somewhere in the maze of IBM web pages.

- Frank

Response from Lisa Moore:

Hold on to your hats (or something!), IBM has indeed changed code page 850. But we have also created a new code page which is 858 (850 with the dotless i, 0xD5, replaced by euro). The OS/2 platform wants to stick to using 850, but some of our middleware products want to have the new code page to help maintain data integrity. 857 is still 857 because adding the euro did not replace an existing character.

It's a brave new world,

Lisa ====================================================== Comment from National Language Technical Centre, IBM Toronto Lab. (Umamaheswaran and Rick Pond)

We understand the confusion that has arisen concerning the new euro code pages.

It is true that IBM does not change the characters in a code page, although we will add characters to previously unassigned locations in a code page. The CCSID (Coded Character Set Identifier - see IBM's Character Data Representation Architecture - CDRA - publication) is used to unambiguously label a code page and character set combination. IBM's code page definition practice is to assign a new code page identifier if the content of the code page is CHANGED / MODIFIED in such a manner that a currently assigned code position gets a new / replacement character. If the code page content GROWS - i.e. a previously unassigned position gets a new character assignment, the same code page identifier can be used -- this is the case with code page 857 and several of Microsoft's 125x code pages. Distinguishing between OLD and NEW code page contents is NOT possible by knowledge of a code page identifier alone. IBM's CDRA addresses this problem and has defined CCSIDs to help identify more precisely the content of coded character sets. Some products such as DOS had started using the Code Page identifier before the CDRA's CCSID was defined. In the simplest of cases - i.e. a FULL never changing content of code pages, this was more than sufficient. OS/2 and Windows built on top of DOS's code page identification. On such operating systems other ingenious means have to be employed if there is a need to distinguish the old code page from the new.

For the Latin-1 OS/2 and DOS, IBM has registered a new code page, 858, which is identical to 850 except the dotless i at 0xd5 is replaced by the euro sign.

Because of the urgency of supporting the euro, it was necessary for OS/2 and DOS -2000 to support this new code page USING THE OLD IDENTIFIER, 850. Hence, there is some difficulty when using data from these operating systems in determining whether it is really code page 850 or code page 858. Other IBM products are using various means to minimize any data integrity exposures. For some time there will be systems without the EURO SIGN support installed and these will have the CP 850 as it is today.

As to identifying conversion tables from code page 850 on the Unicode data base, our suggestion is to leave 850 to UCS conversion tables alone, define new 858 to UCS conversion tables and select the right one depending on knowledge of which flavour of 850 is employed as souce and target. IBM's CCSIDs can help in at least the proper identification leaving the selection to some process of defaulting or to some other way of finding out which version is relevant. Ironically, one of the ways of checking if EURO SIGN is supported is to check the local operating system-provided mapping to and from 850 to UCS for code position x20AC of UCS, and if it returns Undefined - EURO SIGN is NOT supported

The problem of identifying what you have uniquely is not restricted to the case of 850 but also for others -- including being able to tell apart pre and post Euro code pages (code page id is retained and these are all growing code pages with still more room to grow), UCS versions / amendments etc. Currently we have implementations of Unicode 2.0 progressing to Unicode 2.1 (to get the Euro Sign support) and the x20AC mapping will be impacted knowing which Unicode version is being mapped to, which fixpack or version of operating system is installed etc. Over time Unicode 2.1 will get replaced by Unicode 3.0 as the next stable thing as a target and the mappings (for newer characters) have to be identified from the NEW UCS to the others -- a code page identifier by itself to recognize the difference (in this case character set repertoires of UCS) is not sufficient. In addition appropriate level of tagging and tags conveying sufficient information (or reliable equivalnces of tagging) is needed to be able to unambigously manage the interchanges without loss of information.

The information given in a previous note on the unicode list, concerning new code pages 923, 924 and the 114x series is correct.

For further information about IBM's code pages, or CCSID definitions, please send an inquiry to nltc@ca.ibm.com.

I hope the above explanations help in understanding (though the solutions are not easy) the situation.

V.S. Umamaheswaran, Ph.D. National Language Technical Centre, IBM Toronto Lab. 3R/D979, 1150 Eglinton Ave. E, Toronto, ON, Canada, M3C 1H7; +1 416 448 3474 (TL778); Fax: 448 4414; Internet: umavs@ca.ibm.com;Notes: umavs@ibmca; VM: umavs@torolab6



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT