L2/01-423

2001-10-31

Suggested reference name and annotation updates

 

Kent Karlsson

 

 

Reference names for C0 and C1 control characters in UCD

 

The Fifth edition of ECMA-48, 1991, a.k.a. ISO/IEC 6429:1992 (Third edition), did some name changes to the names for the C0 and C1 “control” characters so as to internationalise the names somewhat.  In particular, references to horizontal, vertical, up and down, were changed to refer to character, line, backward and forward respectively.  Further, the file, group, record and unit separators appear to have been generalized (I don’t have access to the fourth edition) to information separator one, two, three, and four, so as not to always imply a hierarchy, or at least not the particular hierarchy of files, groups, records, and units.

 

The Fifth edition of ECMA-48 is the only edition of ECMA-48 now available (online), and the Third edition of 6429 is the only edition now available (from ISO), and the name changes referred to above even predate the first edition of 10646-1. Therefore, the current ECMA-48/6429 names are the names that should be used in the UnicodeData.txt data file and other UCD data files.

 

Annotations for C0 and C1 control characters in NamesList.txt

 

The Fourth edition names for C0 and C0 control characters, which are now used in UnicodeData.txt, should be preserved as alias names in NamesList.txt.

 

Annotations for some other characters in NamesList.txt

 

Some additional new cross references, and additional short explanations for some characters are also included below, in particular in relation to UAX 13, soft hyphen, and the new (or, rather, outdated, but new addition to Unicode) scan line characters, as well as the new word joiner character.

 

Suggested NamesList.txt excerpt

 

The new parts are marked with bold red below.

 

 

 

0009       <control>

           = CHARACTER TABULATION

           = HORIZONTAL TABULATION (HT)

           * the name was changed in 1991 to a more international name (lines may be vertical)

 

000A       <control>

           = LINE FEED (LF)

           = new line (nl), end of line

           * see UAX 13

           x (carriage return - 000D)

           x (next line - 0085)

           x (line separator - 2028)

           x (paragraph separator - 2029)

 

000B       <control>

           = LINE TABULATION

           = VERTICAL TABULATION (VT)

           * the name was changed in 1991 to a more international name (lines may be vertical)

           * see UAX 13

           x (line separator - 2028)

 

000C       <control>

           = FORM FEED (FF)

           = next page, end of page

           * see UAX 13

           x (line separator - 2028)

 

000D       <control>

           = CARRIAGE RETURN (CR)

           * see UAX 13

           x (line feed - 000A)

           x (next line - 0085)

           x (line separator - 2028)

           x (paragraph separator - 2029)

 

001A       <control>

           = SUBSTITUTE

           * used in the place of a character that has been found to be invalid or in error

           * intended to be introduced by automatic means

           x (replacement character - FFFD)

 

001C       <control>

           = INFORMATION SEPARATOR FOUR

           = FILE SEPARATOR

 

001D       <control>

           = INFORMATION SEPARATOR THREE

           = GROUP SEPARATOR

 

001E       <control>

           = INFORMATION SEPARATOR TWO

           = RECORD SEPARATOR

 

001F       <control>

           = INFORMATION SEPARATOR ONE

           = UNIT SEPARATOR

 

 

 

 

0020       SPACE

           * sometimes considered a control code

           * other space characters: 2000-200A

           x (no-break space - 00A0)

           x (zero width space - 200B)

           x (ideographic space - 3000)

           x (zero width no-break space - FEFF)

           x (word joiner – 2060)

 

 

 

 

0082       <control>

           = BREAK PERMITTED HERE

           * used to indicate a point where a line break may occur when text is formatted

           * zero width (no streach)

           x (zero width space - 200B)

           x (soft hyphen - 00AD)

           x (mongolian todo soft hyphen - 1806)

 

0083       <control>

           = NO BREAK HERE

           * used to indicate a point where a line break shall not occur when text is formatted

           x (zero width no-break space - FEFF)

           x (word joiner - 2060)

 

0085       <control>

           = NEXT LINE (NEL)

           * see UAX 13

           x (line feed - 000A)

           x (carriage return - 000D)

           x (line separator - 2028)

           x (paragraph separator - 2029)

 

008B       <control>

           = PARTIAL LINE FORWARD

           = PARTIAL LINE DOWN

           * the name was changed in 1991 to a more international name (lines may be vertical)

 

008C       <control>

           = PARTIAL LINE BACKWARD

           = PARTIAL LINE UP

           * the name was changed in 1991 to a more international name (lines may be vertical)

 

 

 

 

 

 

 

00A0       NO-BREAK SPACE

           x (space - 0020)

           x (figure space - 2007)

           x (narrow no-break space - 202F)

           x (zero width no-break space - FEFF)

           x (word joiner – 2060)

           # <noBreak> 0020

 

00AD       SOFT HYPHEN

           = discretionary hyphen

           * zero width, unless there is an (automatic or explicit) line break after it whence it is imaged as a hyphen

           * when zero width, a soft hyphen may suppress the display of the following character in some cases for some languages (e.g. webb<SHY>bläddrare displays as webbläddrare, and remiss<SHY>svar as remissvar)

           x (mongolian todo soft hyphen - 1806)

           x (hyphen – 2010)

           x (non-breaking hyphen - 2011)

 

 

 

 

00B7       MIDDLE DOT

           = midpoint (in typography)

           = Georgian comma

           = Greek middle dot

           x (greek ano teleia - 0387)

           x (bullet - 2022)

           x (one dot leader - 2024)

           x (hyphenation point - 2027)

           x (bullet operator - 2219)

           x (dot operator - 22C5)

           x (katakana middle dot - 30FB)

 

 

 

 

2010       HYPHEN

           x (hyphen-minus - 002D)

           x (soft hyphen – 00AD)

 

2011       NON-BREAKING HYPHEN

           x (hyphen-minus - 002D)

           x (soft hyphen – 00AD)

           # <noBreak> 2010

 

 

 

 

 

 

2028       LINE SEPARATOR

           * may be used to represent this semantic unambiguously

           * see UAX 13

 

2029       PARAGRAPH SEPARATOR

           * may be used to represent this semantic unambiguously

           * see UAX 13

 

 

 

2060       WORD JOINER

           * does not join multiple words, but joins inside words

           * unambiguous replacement for FEFF ZERO WIDTH NO-BREAK SPACE

           x (zero width no-break space - FEFF)

 

 

 

 

23BA       HORIZONTAL SCAN LINE-1

           * the scan line numbers here refer to old low-resolution technology for terminals, with only 9 scan lines per fixed-size (ASCII) character glyph

 

23BB       HORIZONTAL SCAN LINE-3

 

23BC       HORIZONTAL SCAN LINE-7

 

23BD       HORIZONTAL SCAN LINE-9

 

 

 

 

FEFF       ZERO WIDTH NO-BREAK SPACE

           = BYTE ORDER MARK (BOM)

           * may be used to detect byte order by contrast with FFFE which is not a character

           x (<not a character> - FFFE)

           x (zero width space - 200B)

           x (word joiner - 2060)

           x (no break here – 0083)