L2/09-039 - Mark Davis

IDNAbis - Label Categorization

The following are a set of non-overlapping categorization of all labels of characters from [\-A-Za-z09], with examples. It is an elaboration of the distinctions made in defs.


Label Term
Pattern
Definition Examples
1
A-Label
xn--*
The * is valid punycode, passes IDN tests xn--bcker-gra ("bäcker")
2
Fails-IDN5
xn--*

The * is valid punycode <= 59 long, fails IDN Domain Name Lookup Protocol (Sec 5)

xn--g6h ("♥")
xn--bcker-gra ("Bäcker")
3
Fails-IDN4-only
xn--*

The * is valid punycode <= 59 long, fails IDN Registration Protocol (Sec 4) but not Domain Name Lookup (Sec 5)

xn-a-0hc ("aא")
4
Overlong Punycode
xn--*
The * is valid punycode but 60 bytes or more (invalid DNS). xn--o39a20gda89ku8a4mt2wn​ra67lzvaw9qrno41a245bf6am​0w14sdib7zvppbz309c6da
("가낗나뇲다댯라럈마먔ᄇ뱟사샷악얐ᄌ쟛차챴카컀)
5
Invalid PunyCode
xn--*
The * is invalid Punycode. xn--a
xn--
6
Invalid ACE Prefix
!x*--*
*!n--*
!x!n--*
The pattern has hyphens in position 3&4, but doesn't start with "xn" ab--g6h
7
Valid LDH

RFC 952

except above

length < 64,...
abc
8
Other ASCII
all but above

$a3&

Names for various subgroupings are also useful. For example, Terms 1-5 are all "putative A-Labels" or "ACE Prefix" labels. Terms 4-6 could be called "Broken IDN". Terms 2-6 could be called "Invalid IDN".

Relation between Unicode and Punicode

All Unicode strings are mapped (reversibly) by Punycode to one of the following (adding the ACE prefix):

  • A-Label
  • Fails-IDN5
  • Fails-IDN4-only
  • Overlong Punycode

Thus for each of 1-4 there is a corresponding Unicode String (Label):
  1. U-Label
  2. Unicode-Fails-IDN5
  3. Unicode-Fails-IDN4-only
  4. Overlong-Unicode.

Note that apparent Punycode strings might not map to Unicode, such as the "a" in "xn--a".

Inconsistency in current defs

The term "LDH label" is defined in:


2.3.1.2. LDH-label and Internationalized Label
 These specifications use the term "LDH-label" strictly to refer to an
all-ASCII label that obeys the preferred syntax (often known as
"hostname" (from RFC 952 [RFC0952]) or "LDH") conventions and that is
not an IDN.

That implies LDH = any valid LDH that is not an A-Label. In the diagram below, however (section 2.3.1.6 in defs), it shows LDH-Label as being neither an A-Label nor Broken IDN.


_______________________ _______________________
| ASCII Labels | | Non-ASCII |
| | | |
| ___________________| | __________________|
| |LDH-conforming (1)| | | U-label (2) |
| | | | |_________________|
| | ________________| | | |
| | | LDH-label | | | Binary Label |
| | |_______________| | | (including |
| | | A-label | | | high bit on) |
| | |_______________| | |_________________|
| | | | | | |
| | | Broken IDN | | | Bit String |
| | | e.g., xn--?,| | | Label |
| | | abc--def | | |_________________|
| | |_______________| |______________________|
| |__________________|
| ___________________|
| |Not-LDH-Conforming|
| | |
| | ________________|
| | |SRV & SRV-like |
| | | e.g., _tcp |
| | |_______________|
| | | Leading or |
| | | trailing |
| | | hyphens |
| | |_______________|
| | | Other non-LDH |
| | | ASCII chars |
| | | e.g., #$%&_ |
| | |_______________|
| |__________________|
|_____________________|


Inconsistency in protocol

In the following statement it says "U-Label". This is incorrect. The application of sections 5.1-5.5 do not guarantee that the result is a U-Label, since they do not require the application of BIDI or Context rules. Similarly, we can't use the term "A-Label" (Sec 5.6, 5.7) since the putative A-Label may not be one.

5.6. Punycode Conversion

 The validated string, a U-label, is converted to an A-label using the
Punycode algorithm with the ACE prefix added.