L2/00-248

Character encoding mappings and related files

This file consists of tables with links to mapping data files available. For the most current information please refer to the Unicode ftp site for mapping data (ftp://ftp.unicode.org/Public/MAPPINGS/).

This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been provided on optical media by Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.

Unicode, Inc. hereby grants the right to freely use the information supplied in this file in the creation of products supporting the Unicode Standard, and to make copies of this file in any form for internal or external distribution as long as this notice remains attached.

  Date of last update: 2000-08-03

  Revision history:

    1. 1999-10-07: created for Unicode 3.0

    2. 1999-10-08: editorial changes

    3. 1999-10-09: further changes

    4. 2000-08-03: addition of GSM/7-bit and fix of table headings

1. ASCII based

1.1 Unicode, ISO/IEC 10646

Various preexisting line ending conventions are used with these, but use of PARAGRAPH SEPARATOR and LINE SEPARATOR is recommended. All commonly occurring line ending conventions should be properly interpreted (even if mixed in the same file). See also UTR 13 (Unicode Newline Guidelines) regarding line/paragraph ending/separation, and UTR 14 (Line Breaking Properties) and its associated Unicode database file regarding line breaking, as well as UTR 9 (The Bidirectional Algorithm).

Character encoding

Mapping to Unicode

Date of last update

Remark

Unicode/UTF-8 (UTF-8)

Given by algorithm (normative)

 

In Unicode UTF-8 is limited to planes 0-16

Unicode/UTF-16 (UTF-16, UTF-16BE)

Given by algorithm (normative)

 

Identical to ISO/IEC 10646/UTF-16

Unicode/UTF-16 (UTF-16LE)

Byte pair swap if serialised into octets

 

Does not conform to 10646 if used in serialisation into octets

Unicode/SCSU

Given by algorithm (UTR 6)

 

Standard Compression Scheme for Unicode; Big endian

Unicode/UTF-32 (UTF-32, UTF-32BE)

Given by algorithm (UTR 19)

 

UCS-4 limited to planes 0-16

Unicode/UTF-32 (UTF-32LE)

Byte quintet reversal if serialised into octets; then given by algorithm (UTR 19)

 

Does not conform to 10646 if used in serialisation into octets

[Unicode/UTF-7 WITHDRAWN]

Was given by algorithm in Unicode 2.0. Not included in Unicode 3.0.

 

Was intended only for e-mail. Withdrawn and obsolescent.

 

ISO/IEC 10646/UTF-8

Given by algorithm (normative) for planes 0-16

 

Suitable for "8-bit clean" ASCII oriented programs

ISO/IEC 10646/UTF-16

Given by algorithm

 

UCS-2 extended to planes 0-16; Big endian when serialised into octets

ISO/IEC 10646/UCS-2

Identity

 

UTF-16 restricted to plane 0 (BMP); Big endian when serialised into octets; Stepping stone to UTF-16

ISO/IEC 10646/UCS-4

Given by algorithm (normative) for planes 0-16

 

Big endian when serialised into octets

[ISO/IEC 10646/UTF-1 WITHDRAWN]

Was given by algorithm

 

Withdrawn and obsolete.

1.2 Other character encodings from ISO, IEC, ISO/IEC, ECMA, ETSI

See also iso8859/readme.txt.

Line ending convention for these is often LINE FEED.

Character encoding

Mapping to Unicode

Date of last update

Remark

ISO/IEC 646:1991-IR

(By implicit algorithm)

 

'7-bit' ASCII; US-ASCII.

 

ETSI 03.38 '7-bit' default alphabet

ETSI/GSM0338.txt

 

GSM/SMS (UCS-2 can also be used for GSM/SMS)

 

ISO/IEC 8859-1:1998

iso8859/8859-1.txt

1999 July 27

Latin-1 (Western Europe, no Euro, not French)

ISO/IEC 8859-2:1999

iso8859/8859-2.txt

1999 July 27

Latin-2 (Central Europe)

ISO/IEC 8859-3:1999

iso8859/8859-3.txt

1999 July 27

Latin-3

ISO/IEC 8859-4:1998

iso8859/8859-4.txt

1999 July 27

Latin-4

ISO/IEC 8859-5:1999

iso8859/8859-5.txt

1999 July 27

Latin/Cyrillic

ISO/IEC 8859-6:1999

iso8859/8859-6.txt

1999 July 27

Latin/Arabic; L-to-R storage?

ISO/IEC 8859-7:1987

iso8859/8859-7.txt

1999 July 27

Latin/Greek

ISO/IEC 8859-8:1999

iso8859/8859-8.txt

1999 July 27

Latin/Hebrew; L-to-R storage?

ISO/IEC 8859-9:1999

iso8859/8859-9.txt

1999 July 27

Latin-5

ISO/IEC 8859-10:1998

iso8859/8859-10.txt

1999 July 27

Latin-6

ISO/IEC 8859-11

 

 

Latin/Thai

12

 

 

Unused 8859 part number

ISO/IEC 8859-13:1998

iso8859/8859-13.txt

1999 July 27

Latin-7

ISO/IEC 8859-14:1998

iso8859/8859-14.txt

1999 July 27

Latin-8

ISO/IEC 8859-15:1999

iso8859/8859-15.txt

1999 July 27

Latin-9

1.3 Mac OS

MacOS 8.5 and onwards is Unicode enabled. See also vendors/apple/readme.txt.

Line ending convention for these is often CARRIAGE RETURN.

Character encoding

Mapping to Unicode

Date of last update

Remark

Mac OS Arabic

vendors/apple/arabic.txt

1999-Sep-22

Reading order storage?

Mac OS Central European

vendors/apple/centeuro.txt

1999-Sep-22

CP 10029

Mac OS Chinese Simplified

vendors/apple/chinsimp.txt

1999-Sep-22

 

Mac OS Chinese Traditional

vendors/apple/chintrad.txt

1999-Sep-22

 

Mac OS Croatian

vendors/apple/croatian.txt

1999-Sep-22

 

Mac OS Cyrillic

vendors/apple/cyrillic.txt

1999-Sep-22

CP 10007

Mac OS Devanagari

vendors/apple/devanaga.txt

1999-Sep-22

 

Mac OS Farsi

vendors/apple/farsi.txt

1999-Sep-22

 

Mac OS Greek

vendors/apple/greek.txt

1999-Sep-22

CP 10006

Mac OS Gujarati

vendors/apple/gujarati.txt

1999-Sep-22

 

Mac OS Gurmukhi

vendors/apple/gurmukhi.txt

1999-Sep-22

 

Mac OS Hebrew

vendors/apple/hebrew.txt

1999-Sep-22

Reading order storage?

Mac OS Icelandic

vendors/apple/iceland.txt

1999-Sep-22

CP 10079

Mac OS Japanese

vendors/apple/japanese.txt

1999-Sep-22

Apple Shift-JIS

Mac OS Korean

vendors/apple/korean.txt

1999-Sep-22

 

Mac OS Roman

vendors/apple/roman.txt

1999-Sep-22

CP 10000

Mac OS Romanian

vendors/apple/romanian.txt

1999-Sep-22

 

Mac OS Thai

vendors/apple/thai.txt

1999-Sep-22

 

Mac OS Turkish

vendors/apple/turkish.txt

1999-Sep-22

CP 10081

Mac OS Ukrainian

vendors/apple/ukraine.txt

1999-Sep-22

See vendors/apple/cyrillic.txt

 

CP 10007 MacCyrillic

vendors/micsft/mac/cyrillic.txt

04/24/96

See vendors/apple/cyrillic.txt

CP 10006 MacGreek

vendors/micsft/mac/greek.txt

04/24/96

See vendors/apple/greek.txt

CP 10079 MacIcelandic

vendors/micsft/mac/iceland.txt

04/24/96

See vendors/apple/iceland.txt

CP 10029 MacLatin2

vendors/micsft/mac/latin2.txt

04/24/96

See vendors/apple/centeuro.txt

CP 10000 MacRoman

vendors/micsft/mac/roman.txt

04/24/96

See vendors/apple/roman.txt

CP 10081 MacTurkish

vendors/micsft/mac/turkish.txt

04/24/96

See vendors/apple/turkish.txt

 

NEXTSTEP Encoding

vendors/next/nextstep.txt

1999 September 23

Line ending convention: LF

1.4 Windows

Windows NT is Unicode enabled. Windows 95 and onwards can output Unicode text.

Line ending convention for these is often CARRIAGE RETURN followed by LINE FEED.

Character encoding

Mapping to Unicode

Date of last update

Remark

CP 874

vendors/micsft/windows/cp874.txt

02/28/98

Latin/Thai

CP 932

vendors/micsft/windows/cp932.txt

04/15/98

MS Shift-JIS

CP 936

vendors/micsft/windows/cp936.txt

04/15/98

MS Chinese (Simpl.)

CP 949

vendors/micsft/windows/cp949.txt

04/15/98

MS Korean

CP 950

vendors/micsft/windows/cp950.txt

04/15/98

MS Big-5 (Trad. Chinese)

CP 1250

vendors/micsft/windows/cp1250.txt

04/15/98

Central Europe

CP 1251

vendors/micsft/windows/cp1251.txt

04/15/98

Latin/Cyrillic

CP 1252

vendors/micsft/windows/cp1252.txt

04/15/98

Extends on ISO/IEC 8859-1 Latin-1

CP 1253

vendors/micsft/windows/cp1253.txt

04/15/98

Latin/Greek

CP 1254

vendors/micsft/windows/cp1254.txt

04/15/98

Turkish

CP 1255

vendors/micsft/windows/cp1255.txt

04/15/98

Latin/Hebrew; Reading order storage?

CP 1256

vendors/micsft/windows/cp1256.txt

01/5/99

Latin/Arabic; Reading order storage?

CP 1257

vendors/micsft/windows/cp1257.txt

04/15/98

Baltic

CP 1258

vendors/micsft/windows/cp1258.txt

04/15/98

Vietnamese

1.5 DOS

Line ending convention for these is often CARRIAGE RETURN followed by LINE FEED.

See also the IBM README file (vendors/ibm/readme.txt) on encoding mappings.

Character encoding

Mapping to Unicode

Date of last update

Remark

CP 437 Latin (US)

vendors/micsft/pc/cp437.txt

04/24/96

Obsolescent

CP 737 Greek (A)

vendors/micsft/pc/cp737.txt

04/24/96

Obsolescent

CP 775 BaltRim

vendors/micsft/pc/cp775.txt

04/24/96

Obsolescent

CP 850 Latin (A)

vendors/micsft/pc/cp850.txt

04/24/96

Obsolescent

CP 852 Latin (B)

vendors/micsft/pc/cp852.txt

04/24/96

Obsolescent

CP 855 Cyrillic (A)

vendors/micsft/pc/cp855.txt

04/24/96

Obsolescent

CP 857 Turkish

vendors/micsft/pc/cp857.txt

04/24/96

Obsolescent

CP 860 Portuguese

vendors/micsft/pc/cp860.txt

04/24/96

Obsolescent

CP 861 Icelandic

vendors/micsft/pc/cp861.txt

04/24/96

Obsolescent

CP 862 Hebrew

vendors/micsft/pc/cp862.txt

04/24/96

Obsolescent; Reading order storage?

CP 863 Canada F

vendors/micsft/pc/cp863.txt

04/24/96

Obsolescent

CP 864 Arabic

vendors/micsft/pc/cp864.txt

04/24/96

Obsolescent; Reading order storage?

CP 865 Nordic

vendors/micsft/pc/cp865.txt

04/24/96

Obsolescent

CP 866 Cyrillic (B)

vendors/micsft/pc/cp866.txt

04/24/96

Obsolescent

CP 869 Greek (B)

vendors/micsft/pc/cp869.txt

04/24/96

Obsolescent

CP 874 Thai

vendors/micsft/pc/cp874.txt

04/15/98

See vendors/micsft/windows/cp874.txt

1.6 Other ASCII-based

Non-ISO encodings on Unixes, Adobe's encoding, non-MS PC encodings, non-Apple Mac encodings, RDS&DAB encodings, ...

Character encoding

Mapping to Unicode

Date of last update

Remark

Adobe Standard Encoding

vendors/adobe/stdenc.txt

30 March 1999

vendors/adobe/readme.txt

 

IBM CP 1006

vendors/misc/cp1006.txt

1999 July 27

ASCII+Arabic; Reading order storage?

CP 856

vendors/misc/cp856.txt

1999 July 27

ASCII+Hebrew; Reading order storage?

KOI 8-R (RFC 1489)

vendors/misc/koi8-r.txt

18 August 1999

ASCII+Cyrillic

 

JIS X 0201 (1976)

eastasia/jis/jis0201.txt

8 March 1994

 

Shift-JIS

eastasia/jis/shiftjis.txt

8 March 1994

 

Johab

eastasia/ksc/johab.txt

08/16/99

 

2. EBCDIC based

See also:vendors/ibm/readme.txt.

Except for Unicode, line ending convention for these is often NEXT LINE.

Character encoding

Mapping to Unicode

Date of last update

Remark

Unicode/UTF-EBCDIC

Given by algorithm (UTR 16)

 

Only for use where EBCDIC is required.

 

IBM EBCDIC CP 424 (Hebrew)

vendors/misc/cp424.txt

1999 July 27

L-to-R storage?

 

CP 037 IBM US Canada

vendors/micsft/ebcdic/cp037.txt

04/24/96

 

CP 500 IBM International

vendors/micsft/ebcdic/cp500.txt

04/24/96

 

CP 875 IBM Greek

vendors/micsft/ebcdic/cp875.txt

04/24/96

 

CP 1026 IBM Latin-5 Turkish

vendors/micsft/ebcdic/cp1026.txt

04/24/96

 

3. Others

East Asian without ASCII/EBCDIC, symbol, dingbat, private use area/corporate zone, character entities, cross-references, ...

Character encoding

Mapping to Unicode

Date of last update

Remark

IBM PC memory-mapped video graphics

vendors/misc/ibmgraph.txt

1999 July 27

Obsolescent

 

SGML character entities

vendors/misc/sgml.txt

25 July 1997

 

 

Adobe Symbol Encoding

vendors/adobe/symbol.txt

30 March 1999

vendors/adobe/readme.txt

Adobe Zapf Dingbats Encoding

vendors/adobe/zdingbat.txt

30 March 1999

 

 

Registry of Apple use of Unicode corporate-zone

vendors/apple/corpchar.txt

1999-Sep-22

Registry, not a mapping

Mac OS Dingbats

vendors/apple/dingbats.txt

1999-Sep-22

 

Mac OS Symbol

vendors/apple/symbol.txt

1999-Sep-22

 

 

TCVN-NSCII HyperCard stack

EASTASIA/TCVN/TCV-SEA.HQX

 

eastasia/tcvn/readme.txt

Unicode Han Character Cross-Reference

eastasia/cjkxref.txt

14 March 1994

 

Unihan database

eastasia/unihan.txt

23 September 1996

 

 

Korean Hangul Encoding Conversion

eastasia/ksc/hangul.txt

Oct 04, 1995

 

KS C 5601

eastasia/ksc/old5601.txt

6 December 1993

Note: For Unicode 1.1! Obsolete!

Unified Hangeul (KS C 5601-1992)

eastasia/ksc/ksc5601.txt

07/24/95

For Unicode 2.0 and onwards.

Unified Hangul (KS X 1001)

eastasia/ksc/ksx1001.txt

08/16/99

 

 

JIS X 0208 (1990)

eastasia/jis/jis0208.txt

8 March 1994

 

JIS X 0212 (1990)

eastasia/jis/jis0212.txt

8 March 1994

 

 

GB 12345-80

eastasia/gb/gb12345.txt

6 December 1993

 

GB 2312-80

eastasia/gb/gb2312.txt

6 December 1993

 

 

BIG5

eastasia/other/big5.txt

11 February 1994

 

CNS 11643-1986

eastasia/other/cns11643.txt

21 October 1994

 

The 'conscript' registry has a number of unofficial registrations of possible use of the private use areas, for those interested in constructed writing systems. The private use areas can be used for any experimental, temporary, or 'private' characters. There can by definition be no standard use of the private use areas. Non-standardised use of code points that are not designated as private use violates Unicode and ISO/IEC 10646 conformity. The "corporate zone" is part of the private use area in the BMP, but is not excluded from use by anyone.