Jump to content

Percent-encoding

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 41.46.84.123 (talk) at 16:38, 16 July 2025 (fg). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

URL encoding, officially known as percent-encoding, is a method to encode arbitrary data in a uniform resource identifier (URI) using only the US-ASCII characters legal within a URI. Although it is known as URL encoding, it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). Consequently, it is also used in the preparation of data of the application/x-www-form-urlencoded media type, as is often used in the submission of HTML form data in HTTP requests. Percent-encoding is not case-sensitive.

Types

Percent-encoding in a URI

Types of URI characters

The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are those characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of a URL (or, more generally, a URI). Unreserved characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.

RFC 3986 section 2.2 Reserved Characters (January 2005)
! # $ & ' ( ) * + , / : ; = ? @ [ ]
RFC 3986 section 2.3 Unreserved Characters (January 2005)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ ~ .

Other characters in a URI must be percent-encoded.

Reserved characters

When a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded. Percent-encoding a reserved character involves converting the character to its corresponding byte value in ASCII and then representing that value as a pair of hexadecimal digits (if there is a single hex digit, a leading zero is added). The digits, preceded by a percent sign (%) as an escape character, are then used in the URI in place of the reserved character. (A non-ASCII character is typically converted to its byte sequence in UTF-8, and then each byte value is represented as above.)

The reserved character /, for example, if used in the "path" component of a URI, has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, / needs to be in a path segment, then the three characters %2F or %2f must be used in the segment instead of a raw /.

Reserved characters after percent-encoding
! # $ & ' ( ) * + , / : ; = ? @ [ ]
%21 %23 %24 %26 %27 %28 %29 %2A %2B %2C %2F %3A %3B %3D %3F %40 %5B %5D

Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not.

In the "query" component of a URI (the part after a ? character), for example, / is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose.

URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting the same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.

Unreserved characters

Characters from the unreserved set never need to be percent-encoded.

URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers should not treat %41 differently from A or %7E differently from ~, but some do. For maximal interoperability, URI producers are discouraged from percent-encoding unreserved characters.

Percent character

Because the percent character ( % ) serves to indicate percent-encoded octets, it must itself be percent-encoded as %25 to be used as data within a URI.

Arbitrary data

Most URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often do not, provide an explicit mapping between URI characters and all possible data values being represented by those characters.

Binary data

Since the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above.[1] Byte value 0x0F, for example, should be represented by %0F, but byte value 0x41 can be represented by A, or %41. The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred, as it results in shorter URLs.

b protocols.

Current standard

The generic URI syntax recommends that new URI schemes that provide for the representation of character data in a URI should, in effect, represent characters from the unreserved set without translation and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This suggestion was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary or character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.

Non-standard implementations

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. For example, the 13th edition of ECMA-262 includes an escape function that uses this syntax.[2] However, this behavior is not specified by any RFC, and has been rejected by the W3C.[3]

hhh

References

  1. ^ RFC 1738 §2.2; RFC 2396 §2.4; RFC 3986 §1.2.1, 2.1, 2.5.
  2. ^ "ECMAScript 2017 Language Specification (ECMA-262, 8th edition, June 2017)". Ecma International. Archived from the original on 2018-07-02. Retrieved 2018-06-20.
  3. ^ rejected

The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other: