Network Working Group A. Costello
Request for Comments: 3492 Univ. of California, Berkeley
Category: Standards Track March 2003
Punycode: A Bootstring encoding of Unicode
for Internationalized Domain Names in Applications (IDNA)
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract
Punycode is a simple and efficient transfer encoding syntax designed
for use with Internationalized Domain Names in Applications (IDNA).
It uniquely and reversibly transforms a Unicode string into an ASCII
string. ASCII characters in the Unicode string are represented
literally, and non-ASCII characters are represented by ASCII
characters that are allowed in host name labels (letters, digits, and
hyphens). This document defines a general algorithm called
Bootstring that allows a string of basic code points to uniquely
represent any string of code points drawn from a larger set.
Punycode is an instance of Bootstring that uses particular parameter
values specified by this document, appropriate for IDNA.
Table of Contents
1. Introduction...............................................2
1.1 Features..............................................2
1.2 Interaction of protocol parts.........................3
2. Terminology................................................3
3. Bootstring description.....................................4
3.1 Basic code point segregation..........................4
3.2 Insertion unsort coding...............................4
3.3 Generalized variable-length integers..................5
3.4 Bias adaptation.......................................7
4. Bootstring parameters......................................8
5. Parameter values for Punycode..............................8
6. Bootstring algorithms......................................9
Costello Standards Track [Page 1]
RFC 3492 IDNA Punycode March 2003
6.1 Bias adaptation function.............................10
6.2 Decoding procedure...................................11
6.3 Encoding procedure...................................12
6.4 Overflow handling....................................13
7. Punycode examples.........................................14
7.1 Sample strings.......................................14
7.2 Decoding traces......................................17
7.3 Encoding traces......................................19
8. Security Considerations...................................20
9. References................................................21
9.1 Normative References.................................21
9.2 Informative References...............................21
A. Mixed-case annotation.....................................22
B. Disclaimer and license....................................22
C. Punycode sample implementation............................23
Author's Address.............................................34
Full Copyright Statement.....................................35
1. Introduction
[IDNA] describes an architecture for supporting internationalized
domain names. Labels containing non-ASCII characters can be
represented by ACE labels, which begin with a special ACE prefix and
contain only ASCII characters. The remainder of the label after the
prefix is a Punycode encoding of a Unicode string satisfying certain
constraints. For the details of the prefix and constraints, see
[IDNA] and [NAMEPREP].
Punycode is an instance of a more general algorithm called
Bootstring, which allows strings composed from a small set of "basic"
code points to uniquely represent any string of code points drawn
from a larger set. Punycode is Bootstring with particular parameter
values appropriate for IDNA.
1.1 Features
Bootstring has been designed to have the following features:
* Completeness: Every extended string (sequence of arbitrary code
points) can be represented by a basic string (sequence of basic
code points). Restrictions on what strings are allowed, and on
length, can be imposed by higher layers.
* Uniqueness: There is at most one basic string that represents a
given extended string.
* Reversibility: Any extended string mapped to a basic string can
be recovered from that basic string.
Costello Standards Track [Page 2]
RFC 3492 IDNA Punycode March 2003
* Efficient encoding: The ratio of basic string length to extended
string length is small. This is important in the context of
domain names because RFC 1034 [RFC1034] restricts the length of a
domain label to 63 characters.
* Simplicity: The encoding and decoding algorithms are reasonably
simple to implement. The goals of efficiency and simplicity are
at odds; Bootstring aims at a good balance between them.
* Readability: Basic code points appearing in the extended string
are represented as themselves in the basic string (although the
main purpose is to improve efficiency, not readability).
Punycode can also support an additional feature that is not used by
the ToASCII and ToUnicode operations of [IDNA]. When extended
strings are case-folded prior to encoding, the basic string can use
mixed case to tell how to convert the folded string into a mixed-case
string. See appendix A "Mixed-case annotation".
1.2 Interaction of protocol parts
Punycode is used by the IDNA protocol [IDNA] for converting domain
labels into ASCII; it is not designed for any other purpose. It is
explicitly not designed for processing arbitrary free text.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119
[RFC2119].
A code point is an integral value associated with a character in a
coded character set.
As in the Unicode Standard [UNICODE], Unicode code points are denoted
by "U+" followed by four to six hexadecimal digits, while a range of
code points is denoted by two hexadecimal numbers separated by "..",
with no prefixes.
The operators div and mod perform integer division; (x div y) is the
quotient of x divided by y, discarding the remainder, and (x mod y)
is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses
these operators only with nonnegative operands, so the quotient and
remainder are always nonnegative.
The break statement jumps out of the innermost loop (as in C).
Costello Standards Track [Page 3]
RFC 3492 IDNA Punycode March 2003
An overflow is an attempt to compute a value that exceeds the maximum
value of an integer variable.
3. Bootstring description
Bootstring represents an arbitrary sequence of code points (the
"extended string") as a sequence of basic code points (the "basic
string"). This section describes the representation. Section 6
"Bootstring algorithms" presents the algorithms as pseudocode.
Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
algorithm