Character Repertoire and Coding Transformations � Part 1: General model

CEN/TC304 N888

Title: General model for graphic character transformations

Status: Second draft for an EN, for approval of the CEN/TC304 plenary in Tübingen 21-23 April 1999-03-08.

Action: Comments on this draft are invited. Those NSBs who wish to send formal comments on this draft in time for the open meeting so that they may be incorporated in the document presented to the TC plenary, are asked to send these comments to the TC304 secretariat, either in email or on Fax ([email protected] +354 520 7171) before 12 April 1999.

Note: Enclosed please find the second draft of the Model EN Character Repertoire and Coding Transformations - Part 1: General model for graphic character. It has been revised according to comments received in TC304 meetings in Brussels, November 1998. Comments to this draft are invited so that a final version may be forwarded for approval of the TC at the Tübingen plenary meeting .

The Project Team on the Model EN plans to present this draft and any written comment received in a meeting on 21 April 1999 in Tübingen Germany. Based on the results of that meeting, the PT plans to present a final document to the TC plenary on 23 April. Based on the level of consensus, a TC resolution will be proposed for a CEN enquiry to be launched. The CEN enquiry will be preceded by a translation into French and German. After a six-month CEN enquiry, comments can be incorporated before a final CEN ballot is issued.

Character Repertoire and Coding Transformations � Part 1: General model for graphic character transformations

Scope

General principles

This European Standard specifies a general model of the conceptual stages involved in the interchange of data composed of graphic characters between two end users. It identifies those aspects of this communication process that are amenable to further standardization and it provides terminology that permits such standards to specify their roles within this model. It is not intended as a guide to the implementation of such standards as in many cases the conceptual stages do not correspond to the practical stages involved in an efficient implementation.

The general model addresses both situations in which the intention is the interchange of data without alteration and situations in which transformation of the data during interchange is either required or acceptable. Examples of the latter situation are required transliteration and acceptable fallback.

The general model covers both transformations that affect the character content of the data and those that affect its coded representation. It addresses in particular the issues that arise when some system involved in the communication process is unable to handle all the characters that the end users wish to convey. The general model is not concerned with the meaning of the character data that is being communicated, such as its language, or with its rendition attributes such as font, size and weight. The general model is also applicable when the interchange takes place within a single system with the primary aim of data transformation, such as transliteration or language translation.

Character environments

A character is an abstract concept which is represented in different ways in different environments. To identify the character corresponding to some particular representation, it is necessary to know the environment concerned. The glyph, or symbol, �A� is no more the character with name LATIN CAPITAL LETTER A than is the hexadecimal code value �41�. The former represents this character in the Latin script and the latter does so in the registered character set ISO-IR 6 (ASCII), but they equally represent GREEK CAPITAL LETTER ALPHA in the Greek script and ISO-IR 150 respectively, and CYRILLIC CAPITAL LETTER A in the Cyrillic script and ISO-IR 146 respectively.

The general model for graphic character transformations concerns the processes that are involved in the interchange of graphic character data between two users by means of a transport process that provides a transparent transfer of binary data. The model may be applied to transformation within a single system by treating the binary transport process as an internal interface. The model identifies the following hierarchy of environments as being involved in the communication process:

User environment;

Application environment;

Interchange environment.

In a user environment, characters are normally presented as glyphs. In both application and interchange environments, characters are represented by bit combinations according to some encoding scheme. In an interchange of character data, in general both the user environment and the application environment of the receiving system will differ from the corresponding environments of the sending system, but both systems will generally have the same interchange environment. The underlying transport process is then used to transfer between the two systems those bit combinations that represent characters in their common interchange environment. If the interchange environments differ, transparent transfer of binary data will not result in transparent transfer of character data between the two environments. Other elements within the end systems concerned may, however, be able to compensate for the distortion of the character data so produced.

Code characteristics

The difference between application and interchange environments lies in the encoding schemes that are used. In an application environment it is potentially possible to use every available bit combination to represent a graphic character, so that the SPACE character and 255 other graphic characters can all be encoded in an 8-bit code. As an example, this potential is in fact realized in proprietary PC code pages. In such an environment, formatting and other control information is recorded separately from the graphic character data so that there is no need to reserve any code positions for control data. In an interchange environment, however, it is normally necessary to encode graphic characters and control characters together in a single binary stream. This requirement leads to the use in interchange environments of coded character sets in which certain code positions are reserved for control characters. The 8-bit codes used in such an environment generally follow the character code structure specified in ISO/IEC 2022, which reserves the hexadecimal code positions 00�1F and 80�9F for control characters with 20 being allocated to the SPACE character and 7F to the DELETE character. This leaves only 190 code positions available for graphic characters other than SPACE. The various parts of ISO/IEC 8859 specify coded graphic character sets with this structure, all of which have a common assignment of the �left hand� part, i.e. of the code positions 20�7E, in accordance with ISO-IR 6 (ASCII).

The application environment, however, normally has a requirement for fixed-length codes, i.e. coded character sets in which every character is represented by the same number of bits. Such codes simplify random access of stored data since the location of each coded character within a sequentially stored sequence is independent of the previous characters in the sequence. The interchange environment has no such requirement, so permitting characters with diacritical marks, for example, to be coded by the addition of further bits to the bit pattern that represents the base character. In this way the coded character set specified in ISO/IEC 6937 encodes 333 graphic characters including SPACE, yet it uses the 8-bit code values 20�7E for the same characters as does ISO/IEC 8859.

Modelling the environments

The differences in the natures of the user, application and interchange environments lead to differences in the character repertoires that they are capable of representing. This in turn leads to difficulties when character data is passed sequentially from one environment to another in the course of its transmission from one end user to another. The above descriptions of the different environments are, however, purely illustrative. The general model specified in this European Standard recognises the existence of these environments and the differences in their repertoires but the detailed features of the environments that lead to these repertoire differences are outside the scope of the standard. In particular, the character encoding used in the application environment is outside the scope of the standard; it is only the repertoire of this environment that enters the general model. The character encoding used in the interchange environment, however, is within the scope of the model since it is this encoding that provides the transformation from characters in the interchange environment to the binary data transferred by the transport process.

Definitions

For the purposes of this European standard, the following definitions apply. Unless otherwise specified, where a definition is followed by reference to an International Standard, it has been taken verbatim from that standard.

application environment: A system environment in which characters are represented by bit combinations for the purposes of an application process.

application process: An element within a real system which performs the information processing for a particular application.

NOTE: This differs from the definition in ISO/IEC 7498-1:1994, in which "real system" is strengthened to "real open system", since the real system here need not comply with the requirements of OSI standards in its communication with other real systems.

bit combination: An ordered set of bits used for the representation of characters. [ISO/IEC 2022:1994]

byte: A bit string that is operated upon as a unit. (Note: Each bit has the value either ZERO or ONE.) [ISO/IEC 2022:1994]

CC-data-element (Coded-Character-Data-Element): An element of interchanged information that is specified to consist of a sequence of coded representations of characters, in accordance with one or more identified standards for coded character sets. [ISO/IEC 2022:1994, ISO/IEC 10646-1:1993]

character: A member of a set of elements used for the organization, control, or representation of data. [ISO/IEC 2022:1994, ISO/IEC 10646-1:1993]

character string: A sequence of characters selected from a specified repertoire.

character transformation: A process which maps character strings of some source repertoire into character strings of a target repertoire. A character transformation yields a unique string of characters in the target repertoire from every string of characters in the source repertoire. It need not, however, act on the source string on a character-by-character basis and it need not preserve the number of characters in the string. It may, but need not, be reversible.

NOTE: Both transcription and transliteration are character transformations in this sense, as would be any deterministic scheme of language translation.

coded character: A character together with its coded representation. [ISO/IEC 10646-1:1993]

coded character set; code: A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their bit combinations. [ISO/IEC 2022:1994]

combining character: A member of an identified subset of a coded character set, intended for combination with the preceding or following graphic character, or with a sequence of combining characters preceded or followed by a non-combining character. [ISO/IEC 2022:1994]

composite sequence: A sequence of graphic characters consisting of a non-combining character followed by one or more combining characters. [ISO/IEC 10646-1:1993]

device: A component of information processing equipment which can transmit and/or receive coded information within CC-data-elements. (It may be an input/output device in the conventional sense, or a process such as an application program or gateway function.) [ISO/IEC 10646-1:1993]

environment: A correspondence between the characters of a specified repertoire and a set of objects used for the presentation or representation of those characters; examples of possible objects are glyphs, bit combinations and Braille patterns.

glyph: A recognisable abstract graphic symbol which is independent of any specific design. [ISO/IEC 9541-1:1991]

graphic character: A character, other than a control function, that has a visual representation normally handwritten, printed, or displayed. [ISO/IEC 10646-1:1993]

graphic symbol: The visual representation of a graphic character or of a composite sequence. [ISO/IEC 10646-1:1993]

interchange environment: A system environment in which characters are represented by bit combinations for the purposes of interchange.

interchange: The transfer of character coded data from one user to another, using telecommunication means or interchangeable media. [ISO/IEC 10646-1:1993]

presentation; to present: The process of writing, printing or displaying a graphic symbol. [ISO/IEC 10646-1:1993]

real system: A set of one or more computers, the associated software, peripherals, terminals, human operators, physical processes, information transfer means, etc. that forms an autonomous whole capable of performing information processing and/or information transfer. [ISO/IEC 7498-1:1994]

repertoire: A specified set of characters that are each represented by one or more bit combinations of a coded character set. [ISO/IEC 2022:1994]

NOTE: The characters contained in a repertoire need not all be represented in the same coded character set. The code extension techniques of ISO/IEC 2022, in particular, permit the construction of a CC-data-element which makes use of coded representations selected from any number of coded character sets and which therefore can represent the characters of any repertoire.

script: A set of graphic characters used for the written form of one or more languages. [ISO/IEC 10646-1:1993]

system environment: An environment determined by the capabilities of the computers, software, human operators etc. that form part of a real system. The system environments of a particular real system describe the maximum capabilities of that system to handle characters for one or more of input, output and processing, as appropriate. Where the same environment is concerned with more than one of these aspects of a real system, it describes the common capabilities of all aspects with which it is concerned.

transcription: The process whereby the pronounciation of a given language is noted by the system of signs of a conversion language. A transcription system is of necessity based on the orthographical conventions of the conversion language. Transcription is not strictly reversible. [ISO 3602:1989]

transliteration: The process which consists of representing the characters of an alphabetical or syllable writing system by the the characters of a conversion alphabet. In principle, this conversion should be made character by character. [ISO 3602:1989]

NOTE: Transliteration is a reversible process.

user: A person or other entity that invokes the service provided by a device. (This entity may be a process such as an application program if the "device" is a code converter or a gateway function, for example.) [ISO/IEC 10646:1993]

user environment: A system environment in which characters are represented by, or presented as, objects capable of identification by a particular user; examples for such objects are glyphs when the user is a person or an optical character reader, and bit combinations when the user is an application process.

Abbreviations

For the purposes of this European Standard, the following abbreviations apply.

	ASCII	American Standard Code for Information Interchange
	HTML	Hypertext Markup Language
	OSI	Open Systems Interconnection
	SGML	Standard Generalized Markup Language
	URL	Uniform Resource Locator
	UCS	Universal Multiple-Octet Coded Character Set

Layer structure

The three layer stack

The general model specified by this European Standard represents each real system of the communcation process as a three layer stack composed of

a User Transformation Layer;

an Application Transformation Layer;

an Interchange Transformation Layer.

This is illustrated in figure 1, which also shows the underlying binary transport service and the peer-level transformations described in clause 5 below. All layers are permitted to have an internal sub-layer structure which represents their overall transformation as the result of a sequence of separately specified transformations.

Figure 1 � Layer structure of the transformation model

Character data is conceptually passed through the stack of the sending system from top to bottom, emerging as a byte stream which is transported transparently to the receiving system, where it is passed through the receiving stack from bottom to top to emerge once again as character data.

A single real system normally possesses devices both for input and output and has a bidirectional communications system. Such a system is modelled with common application and interchange transformation layers for sending and receiving but with a separate user transformation layer for each device. The user transformation layers will share a common application environment, as illustrated in figure 2. In modelling a particular real system there may be ambiguity as to whether a sublayer representing some aspect of the system should be considered as part of the user or the application transformation layer. If the system is bidirectional, this can be resolved, with reference to figure 2, by considering whether the sublayer is invoked in both, or only one of, sending and receiving.

Figure 2 � Layer structure of a bidirectional system

Nature of binary transport

The processes involved in the binary transport service may involve compression and encryption, unreliable links and associated error recovery processes, and any other processes which are performed without regard to the interpretation of the bytes being transported. All such processes are outside the scope of this Standard, as are means of addressing within any network that may be involved. If any intermediate system involved in the transport process, however, performs operations that depend on the interpretation of the byte stream as characters then that intermediate system should be considered as a relay system composed conceptually of two linked end systems, each of which falls within the scope of the model. The link between these end systems is through a common Application Environment which passes characters transparently between the two Application Transformation Layers as illustrated in figure 3. The User Transformation Layers are absent in such a relay. The content dependent processing that occurs in the relay is then represented within the model through the sending and receiving stacks of the model operating according to different standards. In particular the interchange environments of the two systems comprising the relay will generally differ, in contrast to those of two systems communicating as in figure 1 where the interchange environments are generally identical.

Figure 3 � Relay system within the binary transport service

The overall effect of such a relay is the conversion of character data from one coded representation to another, and as such it may be referred to as a coding transformation. An implementation of such a transformation will normally involve a direct conversion of one binary data stream to another through the use of look-up tables, and as such it will not be broken down into the separate stages modelled in figure 3. This does not detract from the fact that the model represents the processes logically involved. In particular the character strings represented by the binary input and output streams may differ as a result of the input and output codes not having identical repertoires. The character transformation involved forms a distinct part of the model in figure 3 even though in an implementation it will not normally be separable from the code conversion process.

User transformation layer

The highest layer is the User Transformation Layer. This layer models the real input and output devices, such as keyboards and display screens, that convert between the binary representation of characters and their presentation by means such as glyphs that are recognisable to a human user. It encompasses any changes that occur between the character string that the end user intends to send, or interprets as being received, and the corresponding character string as coded for the use of the application process. Such changes may have been made either by the end user or by the device and its associated software; if both occur then this layer may be modelled as two sublayers, the upper one representing the transformation performed by the end user and the lower one representing that performed by the device.

In the sending system this layer transforms character strings in the user environment into character strings in the application environment, and vice versa in the receiving system. A standardized specification of this layer should specify

a source repertoire;

a target repertoire;

whether the specification is applicable to a sending system, a receiving system, or both;

the mapping function of the transformation;

the field of application of the specification.

For a specification to be applied to a sending (respectively receiving) system its source repertoire must be a superset of the repertoire of the user (respectively application) environment of the system and its target repertoire must be a subset of the repertoire of the application (respectively user) environment of the system. The mapping function shall meet the requirements for being a character transformation. In the simplest situation, when the user is a person the mapping of the User Transformation Layer will either be performed by that person during the input or output process or by software associated with the device. A sub-layer structure is required to model both of these within a single system. The manner of operation of the user device is outside the scope of the model.

NOTE 1: The repertoire of the application environment consists of those characters which can be both handled by the user devices and processed by the application process. If the device can generate characters for input that cannot be handled by the application process then it may possess software that implements a user transformation layer with a target repertoire that can in its entirety be handled by the application process. Suppose for example that the character �é� (LATIN SMALL LETTER E WITH ACUTE) is in the repertoire of the user environment and is able to be generated by some keyboard device but the application process can only handle ISO-IR 6 (ASCII). The character is therefore not in the repertoire of the application environment. Software associated with the device could convert this character to, say, the sequence �e/� in accordance with a suitable transformation. The manner in which the user generates the �é�, however, is outside the scope of the model. There may be a keystroke specifically for this character or a sequence of keystrokes such as <CONTROL + /> <E> may be used; the latter case is not one of a graphic character transformation, it is just part of the operating procedure of the device. If the keyboard device were not able to generate �é� but all else remained the same, the human keyboard operator could perform the same transformation and explicitly enter the sequence �e/�. The end effect is the same in both cases but one transformation has been performed by software and the other by a person.

A common situation is for a sending user to restrict usage to characters present in the application environment and for a receiving user to be prepared to accept all characters in that environment. In these cases the transformation in this layer is normally trivial, i.e. character strings are passed through unchanged. This layer addresses the problems that arise when the user wishes to send or receive characters not present in the application environment. There is an additional possibility on the receiving side, however, namely that the receiving user may not be prepared to accept all characters of the application environment, such as those of a script unknown to the user. This situation can also be addressed by this layer, such as by a sublayer that performs transcription. There is no equivalent sending situation since a user will not attempt to send unfamiliar characters. Bidirectional transcription can be modelled within the Application Transformation Layer.

NOTE 2: Where a transformation is needed and is to be performed on input by a person, there are benefits in it being simple and mnemonic if possible. As an example, suppose the repertoire of the user environment is Latin Alphabet No.2 (repertoire of ISO/IEC 8859-2) but that of the application environment is Latin Alphabet No.1 (repertoire of ISO/IEC 8859-1). In a suitable mnemonic scheme the Czech word �vzru�ující� in the user envoironment could be input as �vzrus<ující� to the application environment, preserving the accented letters common to both alphabets and mnemonically transforming those that are not in common. Now suppose the repertoire of the application environment of the receiver is Latin Alphabet No.7 (repertoire of ISO/IEC 8859-13), which contains the character LATIN SMALL LETTER S WITH CARON that caused problems in the sender but which does not contain LATIN SMALL LETTER I WITH ACUTE. Then the word may finish up displayed as �vzru�uji/ci/� and interpreted by the user, in the knowledge of the mnemonic scheme, as �vzru�ující�, so completing a transparent transfer of the data from one user to the other.

Application Transformation Layer

The middle layer is the Application Transformation Layer. In the sending system this transforms character strings in the application environment into character strings in the interchange environment, and vice versa in the receiving system. A standardized specification of this layer should specify

a source repertoire;

a target repertoire;

whether the specification is applicable to a sending system, a receiving system, or both;

the mapping function of the transformation;

the field of application of the specification.

For a specification to be applied to a sending (respectively receiving) system its source repertoire must be a superset of the repertoire of the application (respectively interchange) environment of the system and its target repertoire must be a subset of the repertoire of the interchange (respectively application) environment of the system. The mapping function shall meet the requirements for being a character transformation. This layer is intended to handle incompatibilities between the capabilities of the application process and the requirements for interchange, regardless of whether it is the application environment or the interchange environment that is the most restrictive. In a sending system with a restrictive application environment, this layer may for example perform the inverse of the transformation of the User Transformation Layer, so resulting in a transparent passage of characters from the user environment to the interchange environment within that system. This would be possible, for example, if transliteration or a reversible mnemonic scheme is used in the User Transformation Layer. In a sending system with a restrictive interchange environment the two communicating end systems may use the same reversible transformation, for example by means of the SGML entitites of ISO 8879, so resulting in a transparent passage of characters from the application environment of the sender to that of the receiver. The means of negotiating a common transformation is outside the scope of this standard.

NOTE: In its interchange environment the only requirement for a real system is to be able to process the coded representations of characters. In its application environment, however, greater capabilities are normally required, such as the ability to display glyph images to the user. It is therefore reasonable to construct real systems that can handle characters, or even entire scripts, in their interchange environment that they cannot handle in their application environment.

Sub-layers of the User and Application Transformation Layers

The User and Application Transformation Layers may both be modelled with an internal structure of sub-layers, each with its own mapping function. Such sub-layers introduce intermediate environments whose repertoires have no external visibility. When such sub-layers are introduced, the mapping function for the layer will be the composition of the mapping functions for the various sub-layers. A standardized specification of either layer may define such sub-layers and their associated environments and may use them in specifying the mapping function for the layer in terms of a composition of mappings of the sub-layers.

NOTE 1: Since the repertoire of an intermediate environment has no external visibility, it permits auxiliary coded characters to be introduced that are not part of any standardized coded character set. This may be useful, for example, as an intermediate step in the specification of a transliteration or transcription mapping.

Alternatively two or more layer standards may be combined by an implementation by treating each standard as a sub-layer within the overall layer actually implemented.

NOTE 2: This permits an implementation to have an upper sub-layer of its Application Transformation Layer related to its User Transformation Layer and a lower sub-layer related to its Interchange Transformation Layer. In a sending system the upper sub-layer could be, for example, the inverse of a mnemonic transformation in the User Transformation Layer and the lower sub-layer could be a mapping of characters to the corresponding SGML entities of ISO 8879. To continue the example of note 2 of 4.3, the character string �vzrus<ující� in the application environment could therefore be transformed to �vzru&scaron;ující� in the interchange environment via an intermediate environment that reconstructs the original string �vzru�ující�. Although SGML entities are also mnemonic, this character string is less easily interpreted by a person than are those in note 2 of 4.3. This, however, is irrelevant in the interchange environment as it is not intended for human readability. The Application Transformation Layer of the receiver, aware both that it is receiving SGML entities and of the restriction of its application environment to Latin Alphabet No.7, could then transform this received string into �vzru�uji/ci/� for presentation to the user, again via an intermediate environment in which the original string is reconstructed.

Interchange Transformation Layer

The lowest layer is the Interchange Transformation Layer. This provides a reversible mapping between character strings and byte sequences. A standardized specification of this layer shall specify

a source repertoire;

the mapping function of the transformation;

the field of application of the specification.

For a specification to be applied to a sending (respectively receiving) system its source repertoire must be a superset (respectively subset) of the repertoire of the interchange environment of the system. The only requirement on the mapping function is that it shall provide a one-to-one mapping between character strings in the repertoire of the interchange environment and byte sequences for interchange, i.e. each character string has exactly one representation as a byte sequence and no two character strings yield the same byte sequence. The mapping need not operate on a character-by-character basis, so that the byte sequence need not be the concatenation of coded representations of individual characters. Although for meaningful communication both the sending and receiving systems should normally use the same mapping function, the means of negotiation to achieve this are outside the scope of this Standard.

NOTE 1: This specification permits state dependence in the encoding scheme, such as the use of locking shifts. In particular it permits the use of the 8-bit code structure of ISO/IEC 4873, which in turn makes use of the code extension techniques of ISO/IEC 2022.

This layer separates specification of the repertoire of the interchange environment from the coding system used for interchange. The Full-Latin-8 repertoire specified in EN 1923, for example, may be represented either in the 8-bit code structure of ISO/IEC 4873 or in the UCS multiple-octet code structure of ISO/IEC 10646-1. In the latter case it would also be necessary to specify which of the UCS coding schemes is to be used. Both the fixed-length UCS canonical forms UCS-2 (two octets) and UCS-4 (four octets) and the variable-length UCS transformation formats UTF-8 and UTF-16 satisfy the requirements of this Standard for the mapping function of the Interchange Transformation Layer.

NOTE 2: Even if the application and interchange environments use the same character coding scheme, such as UCS-2, they need not use the same repertoire. If the application process makes use of characters with diacritical marks, such as LATIN SMALL LETTER E WITH ACUTE, and the transport system requires the use of equivalent composite sequences, such as LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT, this would be described in the general model of this Standard as a difference in repertoire between the application and interchange environments. The Application Transformation Layer would map such characters into the equivalent composite sequences and the latter would then be coded by the Interchange Transformation Layer.

Sub-layers of the Interchange Transformation Layer

The Interchange Transformation Layer may be trivial in an implementation, as characters will already be represented in the interchange environment in their coded form. The existence of this layer in the model represents the fact that the encoding scheme used in the interchange environment lies within the scope of the model and is subject to standardization, whereas that used in the application environment lies outside the scope of the model and does not need standardization in order for there to be effective interchange of character data.

This layer may, however, perform further processing on the coded representation of the character data before passing it to the binary transport process. If this further processing is specified directly as a transformation of one byte stream to another, as is the case with the UCS transformation formats UTF-8 and UTF-16, then it can be modelled simply as part of the specification of the mapping function of the Interchange Transformation Layer. If the specification of this further processing involves interpretation of the byte stream as characters, then it is best represented as a sub-layer structure by the integration into the Interchange Transformation Layer of a relay system modelled as in 4.2 above. An example of this situation, taken from actual practice, is given in A.2 of annex A below.

Peer-level transformations

The environment transformations

The general model is specified in terms of transformations which occur between two different environments of a single real system, or in the case of the lowest layer, between an environment and the interface to the binary transport process. But the motivation for specifying such transformations normally arises from considering their overall effect on character data passed between one environment in the sending system and the same environment in the receiving system. These transformations are said to be peer-level transformations and are known as User, Application or Interchange Environment Transformations as appropriate. They are composites of the transformations in both the sending and receiving systems of all layers below the environment concerned.

The peer-level transformations are denoted in figure 1 by horizontal arrows drawn between the environments involved. Conceptually one may follow these arrows as short-cuts, so instead of following the character data through all the layers one may, for example, take character data from the user environment of the sender successively downwards through the sending User Transformation Layer, across to the receiver by the Application Environment Transformation and then upwards through the receiving User Transformation Layer.

User Environment Transformation

The User Environment Transformation describes the relationship between the character data as perceived by the users of the two communicating end systems, regardless of what manipulations may be made to it within each system. It may, but need not, be the intention that character data is conveyed transparently between these users. Other desirable aims include transliteration, transcription, or even language translation. All of these are possible within the model. An automatic translator within the communications system can be modelled as a relay system in which the translation occurs within the Application Transformation Layer of one of the two end systems within the relay.

It is also possible for there to be acceptable, or even deliberate, loss of information in the User Environment Transformation. Acceptable loss may occur when a fallback transformation, such as �accent dropping�, is required due to limitations within the receiving system. Deliberate loss through such a fallback transformation may occur when data is being sent to an application, such as a search engine, with the intention of matching all character strings with the same fallback representation.

Application Environment Transformation

The Application Environment Transformation describes the relationship between coded character data in the application environments of the two communicating end systems. This transformation takes account of the limitations of the real devices and application processes in the two systems.

This transformation is not, however, concerned with any limitations in the capabilities of the binary transport process since it is the role of the Application Transformation Layer to meet any problems caused by such limitations.

Interchange Environment Transformation

The Interchange Environment Transformation describes the relationship between coded character data in the interchange environments of the two communicating end systems. Meaningful communication normally requires this to be an identity transformation, i.e. characters are transferred unchanged between the two interchange environments. This does not necessarily require the two environments to use the same encoding scheme, but if they do not then the repertoires of these environments need to be restricted to sub-repertoires that are encoded in the same way.

NOTE: It is not an absolute requirement that the Interchange Environment Transformation is the identity, since if it is a known reversible transformation then a sub-layer in the Application Transformation Layer of the receiver could use the reverse transformation to recover the data originally sent. An example would be the use, in the Interchange Transformation Layers of the two systems, of different national variants of ISO/IEC 646. Provided that each system was aware of the variant being used by the other system and the application environment of each system included the characters concerned then the changes generated by the difference in variants could be reversed.

Annex A

(informative)

Examples

HTML

The general model described in this European Standard is applicable to text conveyed within an HTML document. The HTML markup notation itself is outside the scope of the model. Text within an HTML document makes use of symbolic SGML entities to represent characters that are not present in the repertoire of ASCII, and also to represent certain characters in ASCII that have special significance within the HTML markup syntax.

If I type into my word processor the sentence

When CEN/TC304 was created, Þorvarður Kári Ólafsson of the Icelandic standardization body STRÍ was appointed as Secretary.

and I save it as an HTML document, I can read the saved document as a text file to discover that it has been converted into the symbolic text

When CEN/TC304 was created, Þorvarður Kári Ólafsson of the Icelandic standardization body STRÍ was appointed as Secretary.

The original text is written in the repertoire of the application environment and is stored within my system in some encoding that is irrelevant to the conversion process. It has been converted in the Application Transformation Layer to a form that uses only the smaller repertoire of the interchange environment.

This converted text is stored within my system in the same encoding as is used for the original text. If it is accessed as an HTML document by a remote user of the World Wide Web, it will be encoded in ASCII by the Interchange Transformation Layer of my system to produce a byte stream. This will be passed by the Binary Transport System to the remote user and decoded as ASCII text by the remote Interchange Transformation Layer to recover the symbolic converted text given above. The remote Web browser provides the Application Transformation Layer that reconstitutes my original text before it is displayed to the remote user.

In this example the User Transformation Layers of both end systems are considered to be transparent, in that I can enter the original text into my word processor and the system of the remote user can display that text to its user. It does not matter that I used a combination of keyboard input and character selection from a menu by means of a mouse, as the operation of the input devices is outside the scope of the model. If, however, I had a simpler system which did not have the capability of editing HTML documents, I could have entered the symbolic text with its SGML entities directly. In this case the same character transformation would take place but this time it would be in my User Transformation Layer and the Application Transformation Layer would be transparent.

RFC 2130 Architecture

The Internet document RFC 2130 provides an architectural model for the handling of character data that is comprised of three elements:

a) a Coded Character Set

A Coded Character Set (CCS) is defined here as a mapping from a set of abstract characters to a set of integers. The standards ISO/IEC 10646 and the parts of ISO/IEC 8859 are considered as being coded character sets in this sense.

b) a Character Encoding Scheme

A Character Encoding Scheme (CES) is a mapping from a CCS or several CCS's to a set of octets. The code extension techniques of ISO/IEC 2022 and the ISO/IEC 10646 transformation format UTF-8 are considered to be character encoding schemes in this sense.

c) a Transfer Encoding Syntax

A Transfer Encoding Syntax (TES) is a transformation applied to character data encoded using a CCS and possibly a CES to allow it to be transmitted. This may be absent, i.e. the identity transformation, but two non-identity transformations are specified in RFC 2045, namely quoted-printable and base64.

In the model of this European Standard the Coded Character Set element corresponds to the repertoire of the Interchange Environment and the Character Encoding Scheme corresponds to the Interchange Transformation Layer, or to an upper sub-layer of it if a TES is present. A non-identity Transfer Encoding Syntax corresponds to a relay system integrated into the Interchange Transformation Layer as described in 4.7 above. The first two of these identifications in terms of the model are straightforward but that for the TES perhaps needs further elaboration.

The specifications of both the quoted-printable and base64 TES's are given in terms of the transformation of an arbitary octet string into a character string from the repertoire of ASCII. This character string is then encoded using ASCII for onward transmission. This corresponds directly to a relay system in which the transformation to the character string is performed by a receiving Interchange Transformation Layer that is specific to the TES concerned and the re-encoding using ASCII is performed by the corresponding layer of the sending system within the relay. The Application Transformation Layers within the relay pass characters through transparently. The operation of the receiving Interchange Transformation Layer in each of these TES's is in outline as follows, certain refinements and exceptions being omitted for simplicity.

In the quoted-printable TES, any octet other than part of a CRLF line break may be converted into the three-character sequence consisting of an EQUALS SIGN followed by the hexadecimal code value of the octet, e.g. "=3D". Octets with a hexadecimal value in the ranges 21 to 3C and 3E to 7E may alternatively (and would normally) be converted into the character represented in ASCII by that code value; octets outside these ranges have only the three-character form available.

In the base64 TES three successive octets are concatenated into a 24-bit sequence which is then broken up instead into four successive 6-bit bytes. These are then translated into characters according to a table that forms part of the base64 specification.

The Multipurpose Internet Mail Extensions (MIME) follow this architecture and include tags in the message header to specify the CCS, CES and TES that have been used in encoding the message. One tag specifies the CCS and CES jointly, with a separate tag being used for the TES.

RFC 2130 Presentation Options and RFC 1345

In addition to the architectural model described in A.2 above, the Internet document RFC 2130 also considers the need for alternative representations for the input and output of characters when the equipment in use has limited capabilities, although it elaborates only on the situation for output. It states that possible options for the representation of characters that cannot be displayed include:

a) transliteration;

b) use of the short identifiers of the form u+xxxx specified in ISO/IEC 10646-1:Amd.9;

c) use of the representations given in RFC 1345.

It is implicit that for the successful use of such representations, not only must they be generated but also the user must understand their meaning. In the case a) of transliteration the end user may need, or may be satisfied with, the transliterated text. There is then no need for further interpretation. In the cases b) and c) the end user will normally wish to recover the characters that could not be displayed directly.

The transformation to the symbolic form, in accordance with any of the above options, would normally be modelled as a sub-layer of the User Transformation Layer of the receiver when it is software in the the receiving system that is dealing with the limitations of its display capabilities. It would be possible, however, for this transformation to occur in the sending system if the receiving system had no means of circumventing its display limitations and this could be either performed by its user in the User Transformation Layer or by software in the Application Transformation Layer. The interpretation of the display by its human user, to recover the original text, constitutes an upper sub-layer of the User Transformation Layer of the receiving system with the recovered text, which may exist only in the mind of the user, being in the User Environment.

NOTE: If the receiving system is in fact bidirectional and possesses input devices that share the limitations of its display device, the same means for representation of characters can be used for both input and output. In this case the transformation to symbolic form on receiving, and from symbolic form on sending, would be modelled within the Application Transformation Layer in accordance with 4.1 above.

The degree of complexity of transformations intended for automated processing is of little significance, but transformations for use in the User Transformation Layer are intended to be performed by human operators. Such transformations benefit substantially from being mnemonic, i.e. designed to be easily memorised, in so far as this is possible. Scheme b) above does not meet this criterion since it depends on knowledge of the coding system of ISO/IEC 10646. Transliteration schemes as in a) above meet this criterion in their particular sphere of applicability. Scheme c) uses the system given in RFC 1345, which is applicable to a large part of the repertoire of ISO/IEC 10646 and which is intended to be mnemonic in so far as most alphabetic and special characters are concerned.

The character representations of RFC 1345 use only the repertoire of the invariant subset of ISO/IEC 646, with AMPERSAND "&" being reserved as an introducer to signify that the following string is composed of character representations in accordance with the scheme. The characters covered by RFC 1345 are divided into two groups:

a) A group with two-character mnemonics, primarily intended for alphabetic scripts like Latin, Greek, Cyrillic, Hebrew and Arabic, and special characters;

b) A group with variable-length representations that are not mnemonic to any reasonable extent, primarily intended for ideographic scripts but also used for some accented letters and special characters.

The alphabetic characters with two-character mnemonics are represented by a base letter as the first letter, followed by second character that represents an accent or relation to a non-Latin script. Non-Latin letters are transliterated to Latin letters, following transliteration standards as closely as possible. This is also done with certain Latin letters such as ETH, THORN, and A WITH RING ABOVE. The second characters have the following significance:

Exclamation mark	!	Grave accent
Apostrophe	'	Acute accent
Greater-Than sign	>	Circumflex accent
Question Mark	?	tilde
Hyphen-Minus	-	Macron
Left parenthesis	(	Breve
Full Stop	.	Dot Above
Colon	:	Diaeresis
Comma	,	Cedilla
Underline	_	Underline
Solidus	/	Stroke
Quotation mark	"	Double acute accent
Semicolon	;	Ogonek
Less-Than sign	<	Caron
Zero	0	Ring above
Two	2	Hook
Nine	9	Horn
Equals sign	=	Cyrillic
Asterisk	*	Greek
Percent sign	%	Greek/Cyrillic special
Plus	+	smalls: Arabic, capitals: Hebrew
Three	3	some Latin/Greek/Cyrillic letters
Four	4	Bopomofo
Five	5	Hiragana
Six	6	Katakana

Under this scheme, for example, the representations a!, a= and a* correspond to LATIN SMALL LETTER A WITH GRAVE, CYRILLIC SMALL LETTER A and GREEK SMALL LETTER ALPHA.

The Netherlands Transformation Scheme

Part 5 of the manual of Standards for the electronic exchange of personal data, produced by the Netherlands Ministry of the Interior, is devoted to character sets. Annex 8 of that document specifies a transformation scheme whose aim is similar to that of the presentation options of RFC 2130 described in A.3 above. Like that of RFC 1345, it is intended as a mnemonic scheme but its scope is restricted to the characters of the Latin alphabets of the various parts of ISO/IEC 8859. It is intended for input as well as presentation and so may operate in the User Transformation Layers of both sending and receiving systems. It represents letters not in the basic 26-letter Latin alphabet by a base letter preceded by a special character, in contrast to RFC 1345 in which the modifier character follows the base letter. Its special characters and their significance is as follows:

Solidus	/	Acute accent
Reverse Solidus	\	Grave accent
Greater-Than sign	>	Circumflex accent
Percent sign	%	Diaeresis
Tilde	~	Tilde
Asterisk	*	Caron
Number sign	#	Breve
Plus sign	+	Double acute accent
Commercial AT	@	Ring above, or Dot above, or Dotless I
Equals sign	=	Macron, or Stroke (but not on O)
Dollar sign	$	Cedilla, or Ogonek, or Stroke (only on O)
Ampersand	&	Ligature or special form

These special characters are doubled when they are required to be interpreted literally as themselves. The document includes tables which give the complete representation for all characters to which the scheme is applicable, including punctuation marks and other special characters not covered by the above summary.

The TERENA C3 Code Conversion System

The Trans-European Research and Education Networking Association (TERENA) has under development the C3 System for coded character set conversion between a source and a target character set chosen independently from a number of coded character sets, 13 in the current implementation. Since the repertoires of these character sets differ, the system has to address the fact that not all characters have a straightforward conversion between the two sets concerned. It makes available three distinct types of conversion to meet this problem:

a) Conversion type 1

This is a one-to-one conversion that preserves the length of lines and data fields by always transforming one source character to one target character, for example:

	LATIN CAPITAL LIGATURE AE	® A
	COPYRIGHT SIGN	® C
	MICRO SIGN	® u
	AMPERSAND	® &

b) Conversion type 2

This is a legible conversion that gives the user as much information about the original character as possible by means of a legible representation, sometimes using more than one character, for example:

	LATIN CAPITAL LIGATURE AE	® AE
	COPYRIGHT SIGN	® (c)
	MICRO SIGN	® u
	AMPERSAND	® &

c) Conversion type 3

This is a reversible conversion which uses a one-to-many-representation of characters not available in the target character set which is designed to make reverse conversion to the original character set possible without any information loss or distorsion, for example:

	LATIN CAPITAL LIGATURE AE	® &AE
	COPYRIGHT SIGN	® &(C
	MICRO SIGN	® &mu
	AMPERSAND	® &&

In the model of this European Standard the conversions of the C3 system are modelled as a relay system integrated within the Interchange Transformation Layer, as for the Transfer Encoding Syntax in A.2 above. In this case, however, the Application Transformation Layers are not both transparent. The Interchange Transformation Layer in the receiving system of the relay decodes the source binary data using the source coded character set, so recovering, in the Interchange Environment, the character string that it represents. This character string is passed transparently to the common Application Environment within the relay. The sending Application Transformation Layer performs the appropriate conversion of the type requested, to produce, in the Interchange Environment of the sending system, a character string within the repertoire of the target coded character set. This is encoded, using the target code, by the Interchange Transformation Layer of the sender to produce the output binary data of the conversion.

In this application of the model it is the Application Transformation Layer of the relay sender that carries the key aspect of the conversion, namely how characters are handled that are not common to the repertoires of both source and target codes. The Interchange Transformation Layers are almost trivial, in that they are merely straightforward uses of the source and target codes themselves. Note how this contrasts with the relay system that models the Transfer Encoding Syntax in A2 above. There it is the Interchange Transformation Layer of the relay receiver that carries the key aspect of the conversion.

Annex B

(informative)

Bibliography

prEN 1923, Information technology � European character repertoires and their coding � 8 bit single byte encoding.

ISO/IEC 646:1991, Information technology � ISO 7-bit coded character set for information interchange.

ISO/IEC 2022:1994, Information technology � Character code structure and extension techniques.

ISO 2375:1985, Data processing � Procedure for registration of escape sequences.

NOTE: The Registration Authority for ISO 2375 is the Information Processing Society of Japan/Information Technology Standards Commission of Japan (IPSJ/ITSCJ), Room 308-3, Kikai Shinko Kaikan Building, 3-5-8 Shiba-koen, Minato-ku, Tokyo 105-0011, Japan, which maintains the ISO International Register of Coded Character Sets to be used with Escape Sequences available at the URL http://www.itstc.ipsj.or.jp/ISO-IR/.

ISO 3602:1989, Documentation � Romanization of Japanese (kana script).

ISO/IEC 4873:1991, Information technology � ISO 8-bit code for information interchange � Structure and rules for implementation.

ISO/IEC 6937:1994, Information technology � Coded graphic character set for text communication � Latin alphabet.

ISO/IEC 7498-1:1994, Information technology � Open Systems Interconnection � Basic Reference Model: The Basic Model.

ISO/IEC 8859-1:1998, Information technology � 8-bit single byte coded graphic character sets � Part 1: Latin alphabet No.1.

ISO/IEC DIS 8859-2, Information technology -- 8-bit single-byte coded graphic character sets -- Part 2: Latin alphabet No. 2.

ISO/IEC DIS 8859-13 Information technology -- 8-bit single-byte coded graphic character sets -- Part 13: Latin alphabet No. 7.

ISO 8879:1986, Information processing � Text and office systems � Standard Generalized Markup Language (SGML).

ISO/IEC 10646-1:1993, Information technology � Universal Multiple-Octet Coded Character Set (UCS) � Part 1: Architecture and Basic Multilingual Plane.

ISO/IEC TR 15285:1998, Information technology � An operational model for characters and glyphs.

RFC 1345: 1992, Request for Comments: Character Mnemonics & Character Sets.

RFC 2045: 1996, Multipurpose Internet Mail Extensions (MIME). Part One: Format of Internet Message Bodies.

RFC 2130: 1997, Request for Comments: Report of the IAB Character Set Workshop held 29 February - 1 March, 1996.