Christmas gift from the IETF: stringprep

From: Stephane Bortzmeyer (bortzmeyer@nic.fr)
Date: Thu Dec 26 2002 - 04:15:08 EST

  • Next message: Michael \(michka\) Kaplan: "Re: Coptic II?"

    Those who will want to actually use it may see the libstringprep
    library <URL:gttp://www.josefsson.org/libstringprep/>.

    Network Working Group P. Hoffman
    Request for Comments: 3454 IMC & VPNC
    Category: Standards Track M. Blanchet
                                                                    Viagenie
                                                               December 2002

            Preparation of Internationalized Strings ("stringprep")

    Status of this Memo

       This document specifies an Internet standards track protocol for the
       Internet community, and requests discussion and suggestions for
       improvements. Please refer to the current edition of the "Internet
       Official Protocol Standards" (STD 1) for the standardization state
       and status of this protocol. Distribution of this memo is unlimited.

    Copyright Notice

       Copyright (C) The Internet Society (2002). All Rights Reserved.

    Abstract

       This document describes a framework for preparing Unicode text
       strings in order to increase the likelihood that string input and
       string comparison work in ways that make sense for typical users
       throughout the world. The stringprep protocol is useful for protocol
       identifier values, company and personal names, internationalized
       domain names, and other text strings.

       This document does not specify how protocols should prepare text
       strings. Protocols must create profiles of stringprep in order to
       fully specify the processing options.

    Table of Contents

       1. Introduction....................................................3
         1.1 Terminology..................................................4
         1.2 Using stringprep in protocols................................4
       2. Preparation Overview............................................6
       3. Mapping.........................................................7
         3.1 Commonly mapped to nothing...................................7
         3.2 Case folding.................................................8
       4. Normalization...................................................9
       5. Prohibited Output..............................................10
         5.1 Space characters............................................11
         5.2 Control characters..........................................11
         5.3 Private use.................................................12
         5.4 Non-character code points...................................12
         5.5 Surrogate codes.............................................13
         5.6 Inappropriate for plain text................................13
         5.7 Inappropriate for canonical representation..................13
         5.8 Change display properties or deprecated.....................13
         5.9 Tagging characters..........................................14
       6. Bidirectional Characters.......................................14
       7. Unassigned Code Points in Stringprep Profiles..................15
         7.1 Categories of code points...................................16
         7.2 Reasons for difference between stored strings and queries...17
         7.3 Versions of applications and stored strings.................18
       8. References.....................................................19
         8.1 Normative references........................................19
         8.2 Informative references......................................19
       9. Security Considerations........................................19
         9.1 Stringprep-specific security considerations.................19
         9.2 Generic Unicode security considerations.....................20
       10. IANA Considerations...........................................21
       11. Acknowledgements..............................................22
       A. Unicode repertoires............................................23
         A.1 Unassigned code points in Unicode 3.2.......................23
       B. Mapping Tables.................................................31
         B.1 Commonly mapped to nothing..................................31
         B.2 Mapping for case-folding used with NFKC.....................32
         B.3 Mapping for case-folding used with no normalization.........61
       C. Prohibition tables.............................................78
         C.1 Space characters............................................78
           C.1.1 ASCII space characters..................................78
           C.1.2 Non-ASCII space characters..............................79
         C.2 Control characters..........................................79
           C.2.1 ASCII control characters................................79
           C.2.2 Non-ASCII control characters............................79
         C.3 Private use.................................................80
         C.4 Non-character code points...................................80
         C.5 Surrogate codes.............................................80
         C.6 Inappropriate for plain text................................80
         C.7 Inappropriate for canonical representation..................81
         C.8 Change display properties or are deprecated.................81
         C.9 Tagging characters..........................................81
       D. Bidirectional tables...........................................81
         D.1 Characters with bidirectional property "R" or "AL"..........81
         D.2 Characters with bidirectional property "L"..................82
       Authors' Addresses................................................90
       Full Copyright Statement..........................................91

    1. Introduction

       Application programs can display text in many different ways.
       Similarly, a user can enter text into an application program in a
       myriad of fashions. Internationalized text (that is, text that is
       not restricted to the narrow set of US-ASCII characters) has many
       input and display behaviors that make it difficult to compare text in
       a consistent fashion.

       This document specifies a framework of processing rules for Unicode
       text. Other protocols can create profiles of these rules; these
       profiles will allow users to enter internationalized text strings in
       applications and have the highest chance of getting the content of
       the strings correct. In this case, "correct" means that if two
       different people enter what they think is the same string into two
       different input mechanisms, the strings should match on a character-
       by-character basis.

       This framework does not describe how data is transcoded from other
       character sets into Unicode. In systems that uses non-Unicode
       character sets, the transcoding algorithm is a critical part of
       enabling secure and "correct" operation of internationalized text
       strings.

       In addition to helping string matching, profiles of stringprep can
       also exclude characters that should not normally appear in text that
       is used in the protocol. The profile can prevent such characters by
       changing the characters to be excluded to other characters, by
       removing those characters, or by causing an error if the characters
       would appear in the output. For example, because the backspace
       character can cause unpredictable display results, a profile can
       specify that a string containing a backspace character would cause an
       error.

       A profile of stringprep converts a single string of input characters
       to a string of output characters, or returns an error if the output
       string would contain a prohibited character. Stringprep profiles
       cannot both emit a string and return an error.

       Stringprep profiles cannot account for all of the variations that
       might occur or that a user might expect. In particular, a profile
       will not be able to account for choice of spellings in all languages
       for all scripts because the number of alternative spellings of words
       and phrases is immense. Users would probably expect all spelling
       equivalents to be made equivalent, or none of them to be. Examples
       of spelling equivalents include "theater" vs. "theatre", and
       "hemoglobin" vs. "h<U+00E6>moglobin" in American vs. British English.
       Other examples are simplified Chinese spellings of names (for
       example,"<U+7EDF><U+4E00><U+7801>") vs. the equivalent traditional
       Chinese spelling (for example, "<U+7D71><U+4E00><U+78BC>").
       Language-specific equivalences such as "Aepfel" vs. "<U+00C4>pfel",
       which are sometimes considered equivalent in German, may not be
       considered equivalent in other languages.

    ...



    This archive was generated by hypermail 2.1.5 : Thu Dec 26 2002 - 04:54:47 EST