Unicode equivalence
This article needs additional citations for verification. (November 2014) |
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.
Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).
These traits get combined in the four normal forms, as explained below, any of which can be used in text processing.
Sources of equivalence
Character duplication
For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the character "Å" can be encoded as U+00C5 (standard name "LATIN CAPITAL LETTER A WITH RING ABOVE", a letter of the alphabet in Swedish and several other languages) or as U+212B ("ANGSTROM SIGN"). Yet the symbol for angstrom is defined to be that Swedish letter, and most other symbols that are letters (like "V" for volt) do not have a separate code point for each usage. In general, the code points of truly identical characters (which can be rendered in the same way in Unicode fonts) are defined to be canonically equivalent.
Combining and precomposed characters
For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the Dutch letter "IJ")
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these combining characters are the combining tilde and the Japanese diacritic dakuten ("◌゛", U+3099).
In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.
In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.
Example
NFC character | A | m | é | l | i | e | |
---|---|---|---|---|---|---|---|
NFC code point | 0041 | 006d | 00e9 | 006c | 0069 | 0065 | |
NFD code point | 0041 | 006d | 0065 | 0301 | 006c | 0069 | 0065 |
NFD character | A | m | e | ◌́ | l | i | e |
Typographical non-interaction
Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are in general canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.
Typographic conventions
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the half-width katakana characters, or the double-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in subscript or superscript positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.
Encoding errors
UTF-8 and UTF-16 (and also some other Unicode encodings) do not allow all possible sequences of code units. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (ie turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.
presence of equivalent cpoint sequlto find other visually innormalization algorithms that produce a unique (normal) code point sequence for allass]], multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normre can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involveecomposed, thus
Errors due to normalization differences
When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, OS X normalized Unicode filenames sent from the Samba file- and printer-sharing software. Samba did not recognize the altered filenames as equivalent to the original, leading to data loss.[1][2] Resolving such an issue is non-trivial, as normalization is not losslessly invertible.
See also
- Complex text layout
- Diacritic
- IDN homograph attack
- ISO 14651
- Ligature (typography)
- Precomposed character
- The uconv tool can convert to and from NFC and NFD Unicode normalization forms.
- Unicode
- Unicode compatibility characters
Notes
- ^ "Sourceforge.net". Sourceforge.net. Retrieved 20 November 2014.
- ^ "rsync, samba, UTF8, international characters, oh my!". 2009. Archived from the original on January 9, 2010.