Unicode bidirectional algorithm
This article is an orphan, as no other articles link to it. Please introduce links to this page from related articles. (March 2026) |
| "Unicode Bidirectional Algorithm" (Unicode 17.0.0 (Revision 51, 13 August 2025)) | |
|---|---|
| Song |
The Unicode Bidirectional Algorithm (UBA), formally defined in Unicode Standard Annex #9 (UAX #9), is a specification developed by the Unicode Consortium that determines how text containing a mixture of left-to-right and right-to-left scripts is displayed. It is a normative part of the Unicode Standard and is required for conformance wherever characters from right-to-left scripts such as Arabic or Hebrew are rendered.
Background
[edit]Most writing systems display text from left to right, but several scripts—including Arabic, Hebrew, Thaana, and Syriac—are written from right to left. When text from both directions appears in the same document, the result is known as bidirectional text (or bidi text). Without a clear specification, ambiguities arise in determining the correct display order of characters.
The Unicode Standard prescribes a logical order for storing characters in memory, regardless of their visual direction. The UBA translates this logical order into a correct visual display order.
Directional Formatting Characters
[edit]The UBA defines several categories of special control characters used to influence text direction:
Implicit Directional Marks
[edit]Lightweight, zero-width characters that act as directional anchors without affecting display:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRM | U+200E | LEFT-TO-RIGHT MARK |
| RLM | U+200F | RIGHT-TO-LEFT MARK |
| ALM | U+061C | ARABIC LETTER MARK |
Explicit Directional Embeddings
[edit]Signal that a piece of text is to be treated as embedded in a given direction:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRE | U+202A | LEFT-TO-RIGHT EMBEDDING |
| RLE | U+202B | RIGHT-TO-LEFT EMBEDDING |
Explicit Directional Overrides
[edit]Force characters to be treated as strongly directional, overriding their implicit types:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRO | U+202D | LEFT-TO-RIGHT OVERRIDE |
| RLO | U+202E | RIGHT-TO-LEFT OVERRIDE |
Explicit Directional Isolates
[edit]Introduced in Unicode 6.3, isolates prevent the enclosed text from affecting the surrounding text's ordering:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRI | U+2066 | LEFT-TO-RIGHT ISOLATE |
| RLI | U+2067 | RIGHT-TO-LEFT ISOLATE |
| FSI | U+2068 | FIRST STRONG ISOLATE |
| PDI | U+2069 | POP DIRECTIONAL ISOLATE |
Terminating Characters
[edit]| Abbreviation | Code Point | Name | Terminates |
|---|---|---|---|
| U+202C | POP DIRECTIONAL FORMATTING | LRE, RLE, LRO, RLO | |
| PDI | U+2069 | POP DIRECTIONAL ISOLATE | LRI, RLI, FSI |
The Algorithm
[edit]The UBA processes text in four main phases:
1. Paragraph Separation
[edit]Text is split into paragraphs at paragraph separator characters (type B). Each paragraph is processed independently.
2. Initialization
[edit]Each character is assigned a bidirectional character type (e.g., L, R, AL, EN, AN) from the Unicode Character Database. An embedding level list is also initialized.
3. Resolving Embedding Levels
[edit]A series of rules resolves the embedding level of each character:
- P1–P3: Determine the paragraph embedding level (0 for LTR, 1 for RTL).
- X1–X10: Assign explicit embedding levels based on directional formatting characters.
- W1–W7: Resolve weak types (e.g., European numbers, separators).
- N0–N2: Resolve neutral and isolate formatting types, including bracket pairs.
- I1–I2: Resolve implicit embedding levels.
The maximum embedding depth is 125 levels, a value guaranteed not to change in future versions of the standard.[1]
4. Reordering
[edit]Rules L1–L4 reorder characters on each line for display:
- L1: Resets trailing whitespace and separators to the paragraph embedding level.
- L2: Reverses contiguous sequences of characters at the highest embedding levels, progressively down to the lowest odd level.
- L3: Reorders combining marks relative to their base characters.
- L4: Applies glyph mirroring to characters with the
Bidi_Mirroredproperty when their resolved direction is right-to-left (e.g., "(" becomes ")").
Bidirectional Character Types
[edit]Characters are classified into the following categories:
| Category | Type | Description |
|---|---|---|
| Strong | L | Left-to-Right (e.g., Latin, Han) |
| R | Right-to-Left (e.g., Hebrew) | |
| AL | Right-to-Left Arabic (e.g., Arabic, Syriac) | |
| Weak | EN | European Number |
| ES | European Number Separator | |
| ET | European Number Terminator | |
| AN | Arabic Number | |
| CS | Common Number Separator | |
| NSM | Nonspacing Mark | |
| Neutral | B | Paragraph Separator |
| S | Segment Separator | |
| WS | Whitespace | |
| ON | Other Neutrals |
Conformance
[edit]A conforming implementation must:
- Display all visible characters in the order described by the UBA (UAX9-C1).
- Only apply higher-level protocol overrides as defined in Section 4.3 of the specification (UAX9-C2).
Higher-Level Protocols
[edit]The UBA permits six higher-level protocol overrides (HL1–HL6), including:
- HL1: Override the paragraph embedding level.
- HL3: Emulate explicit directional formatting characters via markup (e.g., HTML
dirattribute). - HL4: Apply the UBA independently to segments of structured text (e.g., XML, source code).
- HL6: Apply additional glyph mirroring beyond the standard
Bidi_Mirroredproperty.
HTML and CSS Equivalents
[edit]On web pages, Unicode directional formatting characters can be replaced by HTML5 and CSS3 markup:
| Unicode | HTML | CSS |
|---|---|---|
| RLI...PDI | dir="rtl" |
direction:rtl; unicode-bidi:isolate
|
| LRI...PDI | dir="ltr" |
direction:ltr; unicode-bidi:isolate
|
| FSI...PDI | <bdi>, dir="auto" |
unicode-bidi:plaintext
|
Security Considerations
[edit]The misuse of bidirectional formatting characters poses significant security risks, as they can be used to make malicious code or text appear benign. This is documented in Unicode Technical Report #36 (UTR36). Directional overrides (LRO, RLO) are particularly dangerous and should be avoided where possible.
History
[edit]- Unicode 1.0 (1991): Basic bidirectional support introduced.
- Unicode 6.3 (2013): Major revision introducing directional isolates (LRI, RLI, FSI, PDI) and bracket pair resolution (rule N0). These additions were made to address the overly strong effect of directional embeddings on surrounding text.
- Unicode 17.0 (2025): Current version (Revision 51).
See Also
[edit]- Unicode
- Right-to-left
- Arabic script
- Hebrew alphabet
- Unicode security
- Internationalization and localization
References
[edit]- ^ "Unicode Standard Annex #9: Unicode Bidirectional Algorithm". Unicode Consortium. Retrieved 2025-08-13.