CEN/TC304 N858

Subject: Fallback 1st draft

Date: 2 November 1998

Source: CEN/TC304 Fallback PT,

Status: A first draft to be presented to TC304 meetings in Brussels on Tuesday 24 November and in plenary meetings. Also to be distributed on paper to members of CEN/TC304.

Action: To be discussed in Brussels. Comments are invited.

EUROPEAN PRESTANDARD First WORKING DRAFT

PRÉNORME EUROPÉENNE prEN _____

EUROPÄISCHE VORNORM

ICS: 35.040

Descriptors: Data processing, information interchange, text processing, text communication, graphic characters, character sets, representation of characters, coded character sets, conversion, fallback

English version

Information Technology -

Character repertoire and coding transformations:

European fallback rules

Technologies de lāinformation- Informations technologie -

Character repertoire and Character repertoire and

coding transformations: coding transformations:

European European conversion and

fallback rules ÷ N¼ 1 fallback rules ÷ Nr. 1

NOTE: THIS DRAFT IS THE FIRST WORKING DRAFT. IT IS SUBMITTED TO MEMBERS OF TC304 TO COMMENT. AFTER ADOPTION OF COMMENTS THE DOCUMENT WILL BE SENT TO FORMAL VOTE FOR AN EN.

This draft European Standard is submitted to CEN members for Formal Vote. It has been drawn up by the Technical Committee CEN/TC 304.

If this draft becomes a European Standard, CEN members are bound to comply with the CEN/CENELEC Internal Regulations, which stipulate the conditions for giving this European Standard status of a national standard without any alteration.

This draft European Standard was established by CEN in one official version (English). A version in another language made by translation under the responsibility of a CEN member into its own language and notified to the Central Secretariat has the same status as the official version.

CEN members are the national bodies of Austria, Belgium, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, and United Kingdom.

CEN

European Committee for Standardisation

Comité Européen de Normalisation

Europäisches Komite für Normung

Central Secretariat: rue de Stassart 36,B-1050 Brussels

© CEN 1999 Copyright reserved to all CEN members

Ref.No. prEN _____:1999 E

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This page has been left intentionally blank.

 

 

 

 

 

 

Contents

Foreword 4

0. Introduction 5

0.1 Rationale for the standard provision of fallback rules 5

0.2 Requirements 5

0.3 Satisfying the requirements 5

1. Scope

2. Normative references

3. Definitions and abbreviations

4. Conformance

5. Fallback rules specification

ANNEX 1. (Normative) The list of fallback specification per character

ANNEX II (Informative) Bibliography

 

 

Foreword

 

 

This European standard was prepared by CEN/TC304 European Localisation Requirements. It was approved in the month of_____ 1999. No other organisation than CEN took part in the preparation of this standard.

This standard does not cancel or replace another standard.

There is no known identical national standard in Europe.

According to the CEN/CENELEC Common Rules the following countries are bound to announce the existence of this European standard: Austria, Belgium, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, and United Kingdom.

 

0. Introduction

 

0.1 Rationale for the standard provision of fallback rules

In spite of the computers being able to process larger repertoires of graphic characters than ever before, there are cases where it is not possible to render all the characters of a processing repertoire on an output device. In these cases, not all the characters in a processing repertoire are available in an output repertoire. In order to cater for these situations, a widely applicable standard method of character substitution (fallback) is required which will allow an approximate rendition to be made of the unsupported characters of the processing repertoire for output.

There are key applications, such as search engines, which make make use of "fuzzy" search techniques based on the use of search terms which have diacritical marks missing and common substitutions for less frequently used letters of the Latin alphabet such as eth (ð), thorn (Þ), æ, ž, the German sharp s (ß).. A standard set of substitutions would be useful for such applications. The same apply for other two scripts used in Europe: Cyrillic and Greek.

Users who are trying to write text in a language which is not their mother language often wish to write that text using a character repertoire which does not contain all the letters needed for that language, especially those with diacritic marks. A standard method of character substitution would be useful for such users.

It is believed that individual European countries will wish to promulgate local standards for fallback. A European standard fallback specification can be used as a default in all relevant situations. It can be used as the basis for national standards with local preferences being used for specific substitutions defined by particular nation. It is expected that national standards for fallback will be registered in the international cultural registry as part of the national locales.

 

0.2 Requirements

It is desirable that a standard fallback specification have a very large field of application so that it may be used across a wide range of platforms. To achieve this, a fallback specification is needed for a processing repertoire which is a superset of a large proportion of existing processing repertoires. Also, a fallback repertoire is needed which is a subset of a large proportion of existing output repertoires.

The characters contained in the collection Multilingual European Subset No.1 (MES-1) specified in ISO/IEC 10646, have a wide usage across Europe. In many cases, MES-1 will become the processing repertoire of choice. MES-1 is a large repertoire for which there will be a need for a fallback specification.

 

0.3 Satisfying the requirements

MES-1 has been designed to be a superset of of a wide range of processing repertoires for commercial and administrative applications in office environments across the EEA.

It is a valid assumption that the minimum output repertoire that is implemented in computer systems is the invariant repertoire of ISO/IEC 646. Therefore a substitution with one or more characters from the invariant repertoire of ISO/IEC 646 will always be able to be rendered on an output device. The invariant repertoire of ISO/IEC 646 does not have letters with diacritic marks nor letters from national alphabets used in Europe.

This standard satisfies the requirements by providing a fallback specification which can represesent each character of MES-1 with one or more characters of the invariant repertoire of ISO/IEC 646.

 

1. Scope and field of application

 

1.1 Scope

This standard specifies the representation of the characters of the collection Multilingual European Subset No.1 (MES-1) with the characters from the repertoire of the invariant part of ISO/IEC 646 (83 graphic characters). Where a character of MES-1 is not available in the invariant set of ISO/IEC 646, a fallback representation for rendering is specified.

 

1.2 Field of application

This standard specifies a fallback conversion function which operates in the User Transformation Layer of the model for conversion defined in EN nnnn-1.

 

2. Normative references

2.1 ISO/IEC 10646-1:1993 Information technology÷Universal Multiple-Octet Coded Character Set (UCS)÷Part 1: Architecture and Basic Multilingual Plane.

2.2 ISO/IEC 10646-1:1993/Cor.1:1996 Information technology÷Universal Multiple-Octet Coded Character Set (UCS)÷Part 1: Architecture and Basic Multilingual Plane, Technical Corrigendum 1.

2.3 ISO/IEC 646:1991 Information technology - ISO 7-bit coded character set for information interchange.

 

3. Definitions and abbreviations

 

3.1 Basic definitions

For the purposes of this standard the basic definitions of ISO/IEC 10646-1 section 4 apply. The following are reproduced here for ease of reference:

3.1.1 character: A member of a set of elements used for organisation, control, or representation of data.

3.1.2 coded character: A character together with its coded representation.

3.1.3 coded character set: A set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation.

3.1.4 repertoire: a specified set of characters that are represented in a coded character set

3.1.5 presentation; to present; The process of writing, printing, or displaying of a graphic symbol.

3.1.6 graphic symbol: The visual representation of a graphic character or of a composite sequence.

3.1.7 graphic character: A character, other than a control function, that has a visual representation, normally hand-written, printed, or displayed.

 

3.2 Other definitions

Also, the following definitions apply:

3.2.1 diacritical mark: Any of a number of recurring graphical structures placed over, under, next to, or through a basic letter which does not significantly modify the shape of the basic letter itself and which in combination with that basic letter is a graphic character of the identified repertoire of MES-1.

3.2.2 letter with diacritical mark: letter which is constructed as the combination of a basic letter and a diacritical mark.

3.2.3 basic letter: a letter that is one of the letters of the repertoire of the IRV of ISO 646.

3.2.4 character stream: series of coded characters in sequence, sometimes referenced as a stream.

3.2.5 character classification: The characters of the repertoire characters are classified into letters, digits and special characters.

3.2.6 processing repertoire: This is the repertoire used by an application for processing character based information.

3.2.7 output repertoire: This the repertoire used by a computer system for external representation of character based information.

3.2 Abbreviations

The following abbreviations apply:

 

3.2.1 MES-1: Multingual European Subset No.1

 

4. Conformance

A claim of conformance to this standard shall imply during a fallback operation the usage of characters specified in Table 1, column III for representation of the characters of MES-1 specified in Table 1 column I.

 

5. Fallback rules specification

 

5.1 Basic concepts

This standard specifies how a source stream of coded characters from a processing repertoire is represented in a target stream of an output repertoire. The worst case that is covered is where the processing repertoire is MES-1 and the output repertoire is the invariant repertoire of ISO/IEC 646. The coding of the processing repertoire and the coding of the output repertoire are outside the scope of this standard.

Characters in the source stream which occur in the output repertoire are transferred directly to the target stream without substitution. Characters in the source stream which do not occur in the output repertoire are subject to substitution.

There are two types of substitution: when the target characters are represented in a way that disables the transformation of the target stream to the source stream because of loss of information. Avery common example of this type of presentation is when the Latin small letter e acute (é) is presented by Latin small letter e (e). The other type of presentation introduces special symbols that preserve the information about the original graphical symbol enabling transformation of the character stream to the original encoding. An example of this is the use of the SGML symbols (e.g. &eacute in the case above). This type of substitution is outside the scope of this standard.

The substiution with loss of information can have more forms, but two main classes are always recognised as basic:

-one to one when one graphical character of the source stream is substituted with one graphical character of the output repertoire in the target stream. This type of presentation is required in applications were the number of characters in the data entries or fields (e.g. in data bases or application forms) is fixed. It is anticipated that this class will have minority application.

-one to many when one graphical character of the source stream is substituted with more than one graphical character from the output repertoire in the target stream. An example of this type of presentation is Latin capital letter Æ presented as AE. This class is recommended for general use.

This standard provides two mappings for substitution, a mapping which is strictly one to one and a mapping which has some one to many substitutions. In arriving at the one-to many substitutions, each character which is a candidate for such substitution is treated on its merits and common practice.

5.2 Specification of the general fallback rules

The following general fallback rules are followed to arrive at the fallback tables.

Every Latin letter with diacritical mark is presented with the basic letter for both one to one and one to many substitutions. An example of this type of presentation is the small Latin charcater acute (é) presented as small characater Latin e (e).

Latin ligatures and digraphs (e.g. the Latin letter Ō) are substituted with one letter in the one to one case (for Ō this would be O) and with two letters in the one to many case (for Ō this would be OE). Here, the one to one case is not believed to be very useful.

There are cases where a letter does not have a diacritic mark, but is not in the invariant repertoire of ISO/IEC 646. Examples are Thorn (þ), the German sharp s (ß). In these cases, commonly used fallbacks will be used. Again the one to one substitution may not be very useful.

All special graphic characters from MES-1 not in the invariant repertoire of ISO 646 are substituted with the character SPACE in the one to one case. Otherwise, they are treated on a case by case basis. For instance ® is substituted with (R).

The letters of Cyrillic and Greek are substituted with Latin characters from the repertoire of ISO 646 IRV according to the national transliteration specifications where available, otherwise known schemes are applied.

 

ANNEX 1. (Normative) The list of fallback specification per character

Table I GENERAL FALLBACK TABLE - LETTERS

MES-1 CHARACTER NAME FALLBACK PRESENTATION

 

Table II ONE TO MANY FALLBACKS FOR LIGATURES AND DIGRAPHS

MES-1 CHARACTER NAME FALLBACK PRESENTATION

TABLE III ONE TO MANY FALLBACKS FOR SPECIAL CHARACTERS

MES-1 CHARACTER NAME FALLBACK PRESENTATION

 

ANNEX II (Informative) Bibliography

1. ISO/IEC 8879, 1986 (SGML), Standardized General Mark up Language