MARBI CHARACTER SET SUBCOMMITTEE
Interim Report to MARBI
July 15, 1995
The Character Set Subcommittee was appointed in June 1994 (following 
MARBI discussion of Discussion Paper #73) with the following charge:
        *	To review the character set issues related to mapping between 
                USMARC and Unicode; 
        *	To formulate a proposal for review and comment by LC, MARBI, 
                and the USMARC Advisory Group; 
        *	To identify other issues related to character sets which should 
                be addressed by MARBI and/or the library community.
Members of the Subcommittee are:
        Joan Aliprand - RLG
        Randy Barry - LC
        Candy Bogar - DRA
        John Espley - VTLS
        Robyn Greenlund - Microlif
        Sally McCallum - LC
        Gary Smith - OCLC
        Paul Weiss - University of New Mexico
        Larry Woods - University of Iowa, Chair
The Subcommittee established five working principles to guide the 
mapping:
        1.	Round-trip mapping will be provided between USMARC characters 
                and Unicode characters in every possible case.
        2.	Transliteration tables will remain unchanged unless there is no 
                Unicode equivalent for a diacritical mark, in which case 
                a change to the trans-literation table may be considered 
                by the Library of Congress.
        3.	Accented letters (and vocalized consonants in Hebrew and 
                Arabic) will continue to be encoded as a base letter 
                and non-spacing marks. Use of precomposed accented 
                letters is not sanctioned at this stage.
        4.	Punctuation in the USMARC Hebrew, Cyrillic, and Arabic 
                character sets, and digits in the Hebrew and Cyrillic 
                sets, will be "unified" by being mapped to the characters in 
                the ASCII block of the Unicode standard (under further 		
                consideration).
        5.	Codes in the Private Use Area will be used only if necessary to 
                facilitate round-trip mapping.
The Subcommittee has completed mappings for the following USMARC 
character sets:
        *	Basic Latin (ASCII) and Extended Latin (ANSEL) except for one 
                character (the Right cedilla which is used in the 
                transliteration of Thai);
        *	Greek Symbols (the Greek lowercase letters Alpha, Beta and 
                Gamma);
        *	Subscript Characters; and
        *	Superscript Characters.
The agreed-upon mappings are listed in Appendix 1 
For the most part the mappings were straightforward and non-controversial. 
A few engendered discussion, and some recommendations were not unanimous. 
Those mappings are listed here along with a summary of the discussion:
        A3	D with crossbar uppercase
        to
        0110	Latin capital D with stroke
The USMARC Latin character A3 (Uppercase D with crossbar) is used to 
encode both Croatian and Vietnamese letters, transliterated Macedonian 
and Serbian, and is also considered to be the uppercase form of the Eth. 
The Unicode standard includes three "crossed D" characters. 
Because the Eth is generally regarded as a lowercase letter, the 
Subcommittee chose to map A3 to U+0110, on the basis of the most common 
usage (Croatian, Vietnamese, etc.).
        AA	Subscript patent mark
        to
        00AE	Registered trademark sign
It was felt that the loss of subscriptedness (U+00AE is not a subscripted 
character) was not crucial for this character.
        EB	Ligature first half
        to
        FE20	Combining ligature, left half
        EC	Ligature, second half
        to
        FE21	Combining ligature, right half
        FA	Double tilde, first half
        to
        FE22	Combining double tilde, left half
        FB	Double tilde, second half
        to
        FE23	Combining double tilde, right half
There were two possible mappings for these four characters: to a single 
character (which extends over two letters) or to a pair of characters 
corresponding to the "halves". Mapping to the "halves" was chosen.
        F7	Left hook with tail
        to
        0326	Combining comma below
This character is used in Latvian, Romanian, and Polish. The issue was 
whether mapping should be based on the appearance of the character, or on 
its function. The recommendation accepted by a majority of the 
Subcommittee was a mapping based on function, and supported with a 
reference to the use of a comma-like descender in Romanian typography. Other 
members felt that the graphic appearance was important.
        F8	Right Cedilla
        to
        ?
This is still being investigated with assistance from Thailand. It is 
used only in Thai romanization.
The Subcommittee recommended mapping the three Greek letters in USMARC to 
the corresponding Greek script characters in Unicode rather than try to 
retain the "latinness" of those characters by some other mapping (e.g. to 
values in the Private Use Area).
A Proposal on the mapping outlined in Appendix 1 will be brought to MARBI 
at Midwinter 1996.
Work on Basic and Extended Cyrillic, Hebrew and Basic and Extended Arabic 
is continuing and will be followed by work on the East Asian Character 
Code (EACC).  
For Cyrillic, Hebrew and Arabic USMARC characters, the Subcommittee 
plans to address mapping issues in three phases:
        1.	 Mapping of Cyrillic, Hebrew and Arabic letters and Arabic 
                (traditional "Hindi") digits, all of which are non-
                controversial;
        2.	ASCII "clones" in each character set (punctuation and digits in 
                Cyrillic and Hebrew, punctuation in Arabic);
        3.	Other items:
        a.	Hebrew Holam which serves in USMARC as both the vowel point 
                holam and the sin dot. The holam and sin dot are both discrete 
                Unicode characters.
        b.	Several Arabic letters which are in the USMARC Extended 
                Arabic character set but not in the Unicode standard.
The items in (1) should be straightforward. The items in (2) and (3a) 
will require research by the Subcommittee during the Fall of 1995. The 
Arabic letters in (3b) should be proposed as additions to the Unicode 
Standard and to ISO/IEC 10646. Documentation  to support their addition 
needs to be gathered.
Glossary and Conventions:
UCS = Universal Character Set (the proper title of International 
Standard ISO/IEC 10646).
U+nnnn = An individual Unicode value, where nnnn is a four digit number 
expressed in hexadecimal notation.
Private Use Area = Unicode values in the range U+E000 through U+F8FF. 
Codes in this range are for the use of software developers and end users 
who need a special set of characters for their applications. The code 
points in this area do not have defined, interpretable semantics except 
by private agreement.
Appendix 1
========================================================================
        Author: Joan Aliprand
        Revised: 9/12/92
        Revised: 12/17/93
        Revised: 5/26/94
        Revised: 6/25/95
        Revised: 6/29/95
                                        
 
 
Mapping of USMARC Characters to Unicode/UCS Values
 
Sources:
 
USMARC sources:
        USMARC Specifications for Record Structure, Character Sets, and 
        Exchange Media. 1994 edition.  Washington, D.C., Library of Congress, 
        1994.
        
        MARBI Proposal No. 93-10, as approved in February 1994.
 
Unicode sources:
        The Unicode Standard, Version 1.0.  Vol. 1, 1991.
        The Unicode Standard, Version 1.1.  Prepublication edition.  1993.
 
The Unicode Standard, Version 1.1 and the Basic Multilingual Plane (BMP) 
of International Standard ISO/IEC 10646-1:1993 are identical in character
repertoire and code-point assignment.  The Unicode standard is a profile 
of UCS-2, the two-octet form of the Universal Character Set.
 
Previous versions of this mapping used this UCS source: ISO DIS 10646-
1.2.
Both USMARC and Unicode/UCS names should properly be in uppercase 
letters. Upper and lowercase have been used in the following table for 
ease of reading. Any amendments to UCS names after publication of 
ISO/IEC 10646:1 have not been included.
 
ASCII (BASIC LATIN) AND ANSEL (EXTENDED LATIN) CHARACTER SETS
 
USMARC	Character			Unicode/UCS	Character
Code	Name				Code		Name
 
 1B	ESCAPE				001B  ESCAPE
 1D	RECORD TERMINATOR		001D  GROUP SEPARATOR
 1E	FIELD TERMINATOR		001E  RECORD SEPARATOR
 1F	SUBFIELD DELIMITER		001F  UNIT SEPARATOR
 
 20	SPACE (BLANK)			0020  SPACE
 21	EXCLAMATION MARK		0021  EXCLAMATION MARK
 22	QUOTATION MARK			0022  QUOTATION MARK
 23	NUMBER SIGN			0023  NUMBER SIGN
 24	DOLLAR SIGN			0024  DOLLAR SIGN
 25	PERCENT SIGN			0025  PERCENT SIGN
 26	AMPERSAND			0026  AMBERSAND
 27	APOSTROPHE			0027  APOSTROPHE
 28	OPENING PARENTHESIS		0028  LEFT PARENTHESIS
 29	CLOSING PARENTHESIS		0029  RIGHT PARENTHESIS
 2A	ASTERISK			002A  ASTERISK
 2B	PLUS SIGN			002B  PLUS SIGN
 2C	COMMA				002C  COMMA
 2D	HYPHEN-MINUS			002D  HYPHEN-MINUS
 2E	PERIOD (DECIMAL POINT)		002E  FULL STOP
 2F	SLASH 				002F  SOLIDUS
 
 30	DIGIT ZERO			0030  DIGIT ZERO
   THROUGH                                      THROUGH
 39	DIGIT NINE			0039  DIGIT NINE
 3A	COLON				003A  COLON
 3B	SEMICOLON			003B  SEMICOLON
 3C	LESS-THAN SIGN			003C  LESS-THAN SIGN
        (OPENING ANGLE BRACKET)
 3D	EQUALS SIGN			003D  EQUALS SIGN
 3E	GREATER-THAN SIGN		003E  GREATER-THAN SIGN
        (CLOSING ANGLE BRACKET)
 3F	QUESTION MARK			003F  QUESTION MARK
 40	COMMERCIAL AT			0040  COMMERCIAL AT
 41	CAPITAL A			0041  LATIN CAPITAL A
   THROUGH                                      THROUGH
 5A	CAPITAL Z			005A  LATIN CAPITAL Z
 5B	OPENING SQUARE BRACKET		005B  LEFT SQUARE BRACKET
 5C	REVERSE SLASH 			005C  REVERSE SOLIDUS
 5D	CLOSING SQUARE BRACKET 		005D  RIGHT SQUARE BRACKET
 5E	SPACING CIRCUMFLEX		005E  SPACING ACCENT
 5F	SPACING UNDERSCORE		005F  SPACING LOW LINE
 60	SPACING GRAVE			0060  GRAVE ACCENT 
 61	SMALL A				0061  LATIN SMALL A
   THROUGH                                   THROUGH
 7A	SMALL Z 			007A  LATIN SMALL Z
 7B	OPENING CURLY BRACKET		007B  LEFT CURLY BRACKET
 7C	VERTICAL BAR (FILL)		007C  VERTICAL LINE
 7D	CLOSING CURLY BRACKET		007D  RIGHT CURLY BRACKET
 7E	SPACING TILDE			007E  TILDE
 
 A1	UPPERCASE POLISH L 		0141  LATIN CAPITAL LETTER L WITH STROKE
 A2	UPPERCASE SCANDINAVIAN O 	00D8  LATIN CAPITAL LETTER O WITH STROKE
 A3	UPPERCASE D WITH CROSSBAR 	0110  LATIN CAPITAL LETTER D WITH STROKE
 A4	UPPERCASE ICELANDIC THORN	00DE  LATIN CAPITAL LETTER THORN 
                                              (Icelandic)
 A5	UPPERCASE DIGRAPH AE		00C6  LATIN CAPITAL LIGATURE AE
 A6	UPPERCASE DIGRAPH OE		0152  LATIN CAPITAL LIGATURE OE
 A7	SOFT SIGN (PRIME) 		02B9  MODIFIED LETTER PRIME
 A8	DOT IN MIDDLE OF LINE		00B7  MIDDLE DOT
 A9	MUSICAL FLAT 			266D  MUSIC FLAT SIGN
 AA	SUBSCRIPT PATENT MARK		00AE  REGISTERED SIGN
 AB	PLUS OR MINUS			00B1  PLUS-MINUS SIGN
 AC	UPPERCASE O-HOOK		01A0  LATIN CAPITAL LETTER O WITH HORN
 AD	UPPERCASE U-HOOK		01AF  LATIN CAPITAL LETTER U WITH HORN
 AE	ALIF				02BE  MODIFIER LETTER RIGHT HALF RING
 
 B0	AYN				02BF  MODIFIER LETTER LEFT HALF RING
 B1	LOWERCASE POLISH L		0142  LATIN SMALL LETTER L WITH STROKE
 B2	LOWERCASE SCANDINAVIAN O	00F8  LATIN SMALL LETTER O WITH STROKE
 B3	LOWERCASE D WITH CROSSBAR	0111  LATIN SMALL LETTER D WITH STROKE
 B4	LOWERCASE ICELANDIC THORN	00FE  LATIN SMALL LETTER THORN 
                                              (Icelandic)
 B5	LOWERCASE DIGRAPH AE		00E6  LATIN SMALL LIGATURE AE
 B6	LOWERCASE DIGRAPH OE		0153  LATIN SMALL LIGATURE OE
 B7	HARD SIGN (DOUBLE PRIME)	02BA  MODIFIER LETTER DOUBLE PRIME
 B8	LOWERCASE TURKISH I		0131  LATIN SMALL LETTER DOTLESS I
 B9	BRITISH POUND			00A3  POUND SIGN
 BA	LOWERCASE ETH			00F0  LATIN SMALL LETTER ETH (Icelandic)
 BC	LOWERCASE O-HOOK		01A1  LATIN SMALL LETTER O WITH HORN
 BD	LOWERCASE U-HOOK		01B0  LATIN SMALL LETTER U WITH HORN
 
 C0	DEGREE SIGN			00BO  DEGREE SIGN
 C1	LOWERCASE SCRIPT L		2113  SCRIPT SMALL L
 C2	PHONO COPYRIGHT MARK		2117  SOUND RECORDING COPYRIGHT
 C3	COPYRIGHT MARK			00A9  COPYRIGHT SIGN 
 C4	SHARP				266F  MUSICAL SHARP SIGN
 C5	INVERTED QUESTION MARK		00BF  INVERTED QUESTION MARK
 C6	INVERTED EXCLAMATION MARK	00A1  INVERTED EXCLAMATION MARK
 
 E0	PSEUDO QUESTION MARK		0309  COMBINING HOOK ABOVE
 E1	GRAVE				0300  COMBINING GRAVE ACCENT (Varia)
 E2	ACUTE				0301  COMBINING ACUTE ACCENT (Oxia)
 E3	CIRCUMFLEX			0302  COMBINING CIRCUMFLEX ACCENT
 E4	TILDE				0303  COMBINING TILDE
 E5	MACRON				0304  COMBINING MACRON
 E6	BREVE				0306  COMBINING BREVE (Vrachy)
 E7	SUPERIOR DOT			0307  COMBINING DOT ABOVE
 E8	UMLAUT (DIAERESIS)		0308  COMBINING DIAERESIS (Dialytika)
 E9	HACEK				030C  COMBINING CARON 
 EA	CIRCLE ABOVE (ANGSTROM)		030A  COMBINING RING ABOVE
 EB	LIGATURE, FIRST HALF		FE20  COMBINING LIGATURE LEFT HALF
 EC	LIGATURE, SECOND HALF		FE21  COMBINING LIGATURE RIGHT HALF
 ED	HIGH COMMA, OFF CENTER		0315  COMBINING COMMA ABOVE RIGHT
 EE	DOUBLE ACUTE			030B  COMBINING DOUBLE ACUTE ACCENT
 EF	CANDRABINDU			0310  COMBINING CANDRABINDU
 
 F0	CEDILLA				0327  COMBINING CEDILLA
 F1	RIGHT HOOK (OGONEK)		0328  COMBINING OGONEK
 F2	DOT BELOW			0323  COMBINING DOT BELOW
 F3	DOUBLE DOT BELOW		0324  COMBINING DIAERESIS BELOW
 F4	CIRCLE BELOW			0325  COMBINING RING BELOW
 F5	DOUBLE UNDERSCORE 		0333  COMBINING DOUBLE LOW LINE
 F6	UNDERSCORE			0332  COMBINING LOW LINE
 F7	LEFT HOOK (COMMA BELOW)		0326  COMBINING COMMA BELOW
 F8	RIGHT CEDILLA			(No recommendation yet)
 F9	UPADHMANIYA			032E  COMBINING BREVE BELOW
 FA	DOUBLE TILDE, FIRST HALF 	FE22  COMBINING DOUBLE TILDE LEFT HALF
 FB	DOUBLE TILDE, SECOND HALF	FE23  COMBINING DOUBLE TILDE RIGHT HALF
 
 FE	HIGH COMMA, CENTERED		0313  COMBINING COMMA ABOVE (Psili)
 
 
GREEK LETTERS
 
USMARC	Character			Unicode/UCS	Character
Code	Name				Code		Name
 
 61	ALPHA				03B1  GREEK SMALL LETTER ALPHA
 62	BETA 				03B2  GREEK SMALL LETTER BETA
 63	GAMMA				03B3  GREEK SMALL LETTER GAMMA
 
 
SUBSCRIPTS
 
USMARC	Character			Unicode/UCS	Character
Code	Name				Code		Name
 
 28	SUBSCRIPT OPENING PARENTHESIS	208D  SUBSCRIPT LEFT PARENTHESIS
 29	SUBSCRIPT CLOSING PARENTHESIS	208E  SUBSCRIPT RIGHT PARENTHESIS
 2B	SUBSCRIPT PLUS			208A  SUBSCRIPT PLUS SIGN
 2D	SUBSCRIPT MINUS 		208B  SUBSCRIPT HYPHEN-MINUS
 30	SUBSCRIPT 0 			2080  SUBSCRIPT 0
 31	SUBSCRIPT 1			2081  SUBSCRIPT 1
 32	SUBSCRIPT 2			2082  SUBSCRIPT 2
 33	SUBSCRIPT 3			2083  SUBSCRIPT 3
 34	SUBSCRIPT 4			2084  SUBSCRIPT 4
 35	SUBSCRIPT 5			2085  SUBSCRIPT 5
 36	SUBSCRIPT 6			2086  SUBSCRIPT 6
 37	SUBSCRIPT 7			2087  SUBSCRIPT 7
 38	SUBSCRIPT 8			2088  SUBSCRIPT 8
 39	SUBSCRIPT 9			2089  SUBSCRIPT 9
 
 
SUPERSCRIPTS
 
USMARC	Character			Unicode/UCS	Character
Code	Name				Code		Name
 
 28	SUPERSCRIPT OPENING PARENTHESIS	207D	SUPERSCRIPT LEFT PARENTHESIS
 29	SUPERSCRIPT CLOSING PARENTHESIS	207E	SUPERSCRIPT RIGHT PARENTHESIS
 2B	SUPERSCRIPT PLUS		207A	SUPERSCRIPT PLUS SIGN
 2D	SUPERSCRIPT MINUS		207B	SUPERSCRIPT HYPHEN-MINUS
 30	SUPERSCRIPT 0			2070	SUPERSCRIPT 0
 31	SUPERSCRIPT 1			00B9	SUPERSCRIPT 1
 32	SUPERSCRIPT 2			00B2	SUPERSCRIPT 2
 33	SUPERSCRIPT 3			00B3	SUPERSCRIPT 3
 34	SUPERSCRIPT 4			2074	SUPERSCRIPT 4
 35	SUPERSCRIPT 5			2075	SUPERSCRIPT 5
 36	SUPERSCRIPT 6			2076	SUPERSCRIPT 6
 37	SUPERSCRIPT 7			2077	SUPERSCRIPT 7
 38	SUPERSCRIPT 8			2078	SUPERSCRIPT 8
 39	SUPERSCRIPT 9			2079	SUPERSCRIPT 9
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT