Christmas gift from the IETF: stringprep

From: Stephane Bortzmeyer (bortzmeyer@nic.fr)
Date: Thu Dec 26 2002 - 04:15:08 EST

Next message: Michael \(michka\) Kaplan: "Re: Coptic II?"

Previous message: Michael Everson: "Re: Coptic II?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Those who will want to actually use it may see the libstringprep
library <URL:gttp://www.josefsson.org/libstringprep/>.

Network Working Group P. Hoffman
Request for Comments: 3454 IMC & VPNC
Category: Standards Track M. Blanchet
Viagenie
December 2002

Preparation of Internationalized Strings ("stringprep")

Status of this Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements. Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol. Distribution of this memo is unlimited.

Abstract

   This document describes a framework for preparing Unicode text
   strings in order to increase the likelihood that string input and
   string comparison work in ways that make sense for typical users
   throughout the world. The stringprep protocol is useful for protocol
   identifier values, company and personal names, internationalized
   domain names, and other text strings.

   This document does not specify how protocols should prepare text
   strings. Protocols must create profiles of stringprep in order to
   fully specify the processing options.

Table of Contents

   1. Introduction....................................................3
     1.1 Terminology..................................................4
     1.2 Using stringprep in protocols................................4
   2. Preparation Overview............................................6
   3. Mapping.........................................................7
     3.1 Commonly mapped to nothing...................................7
     3.2 Case folding.................................................8
   4. Normalization...................................................9
   5. Prohibited Output..............................................10
     5.1 Space characters............................................11
     5.2 Control characters..........................................11
     5.3 Private use.................................................12
     5.4 Non-character code points...................................12
     5.5 Surrogate codes.............................................13
     5.6 Inappropriate for plain text................................13
     5.7 Inappropriate for canonical representation..................13
     5.8 Change display properties or deprecated.....................13
     5.9 Tagging characters..........................................14
   6. Bidirectional Characters.......................................14
   7. Unassigned Code Points in Stringprep Profiles..................15
     7.1 Categories of code points...................................16
     7.2 Reasons for difference between stored strings and queries...17
     7.3 Versions of applications and stored strings.................18
   8. References.....................................................19
     8.1 Normative references........................................19
     8.2 Informative references......................................19
   9. Security Considerations........................................19
     9.1 Stringprep-specific security considerations.................19
     9.2 Generic Unicode security considerations.....................20
   10. IANA Considerations...........................................21
   11. Acknowledgements..............................................22
   A. Unicode repertoires............................................23
     A.1 Unassigned code points in Unicode 3.2.......................23
   B. Mapping Tables.................................................31
     B.1 Commonly mapped to nothing..................................31
     B.2 Mapping for case-folding used with NFKC.....................32
     B.3 Mapping for case-folding used with no normalization.........61
   C. Prohibition tables.............................................78
     C.1 Space characters............................................78
       C.1.1 ASCII space characters..................................78
       C.1.2 Non-ASCII space characters..............................79
     C.2 Control characters..........................................79
       C.2.1 ASCII control characters................................79
       C.2.2 Non-ASCII control characters............................79
     C.3 Private use.................................................80
     C.4 Non-character code points...................................80
     C.5 Surrogate codes.............................................80
     C.6 Inappropriate for plain text................................80
     C.7 Inappropriate for canonical representation..................81
     C.8 Change display properties or are deprecated.................81
     C.9 Tagging characters..........................................81
   D. Bidirectional tables...........................................81
     D.1 Characters with bidirectional property "R" or "AL"..........81
     D.2 Characters with bidirectional property "L"..................82
   Authors' Addresses................................................90
   Full Copyright Statement..........................................91

1. Introduction

   Application programs can display text in many different ways.
   Similarly, a user can enter text into an application program in a
   myriad of fashions. Internationalized text (that is, text that is
   not restricted to the narrow set of US-ASCII characters) has many
   input and display behaviors that make it difficult to compare text in
   a consistent fashion.

   This document specifies a framework of processing rules for Unicode
   text. Other protocols can create profiles of these rules; these
   profiles will allow users to enter internationalized text strings in
   applications and have the highest chance of getting the content of
   the strings correct. In this case, "correct" means that if two
   different people enter what they think is the same string into two
   different input mechanisms, the strings should match on a character-
   by-character basis.

   This framework does not describe how data is transcoded from other
   character sets into Unicode. In systems that uses non-Unicode
   character sets, the transcoding algorithm is a critical part of
   enabling secure and "correct" operation of internationalized text
   strings.

   In addition to helping string matching, profiles of stringprep can
   also exclude characters that should not normally appear in text that
   is used in the protocol. The profile can prevent such characters by
   changing the characters to be excluded to other characters, by
   removing those characters, or by causing an error if the characters
   would appear in the output. For example, because the backspace
   character can cause unpredictable display results, a profile can
   specify that a string containing a backspace character would cause an
   error.

   A profile of stringprep converts a single string of input characters
   to a string of output characters, or returns an error if the output
   string would contain a prohibited character. Stringprep profiles
   cannot both emit a string and return an error.

   Stringprep profiles cannot account for all of the variations that
   might occur or that a user might expect. In particular, a profile
   will not be able to account for choice of spellings in all languages
   for all scripts because the number of alternative spellings of words
   and phrases is immense. Users would probably expect all spelling
   equivalents to be made equivalent, or none of them to be. Examples
   of spelling equivalents include "theater" vs. "theatre", and
   "hemoglobin" vs. "h<U+00E6>moglobin" in American vs. British English.
   Other examples are simplified Chinese spellings of names (for
   example,"<U+7EDF><U+4E00><U+7801>") vs. the equivalent traditional
   Chinese spelling (for example, "<U+7D71><U+4E00><U+78BC>").
   Language-specific equivalences such as "Aepfel" vs. "<U+00C4>pfel",
   which are sometimes considered equivalent in German, may not be
   considered equivalent in other languages.

...

Next message: Michael \(michka\) Kaplan: "Re: Coptic II?"
Previous message: Michael Everson: "Re: Coptic II?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 26 2002 - 04:54:47 EST