Jump to content

Unicode bidirectional algorithm

From Wikipedia, the free encyclopedia
"Unicode Bidirectional Algorithm" (Unicode 17.0.0 (Revision 51, 13 August 2025))
Song

The Unicode Bidirectional Algorithm (UBA), formally defined in Unicode Standard Annex #9 (UAX #9), is a specification developed by the Unicode Consortium that determines how text containing a mixture of left-to-right and right-to-left scripts is displayed. It is a normative part of the Unicode Standard and is required for conformance wherever characters from right-to-left scripts such as Arabic or Hebrew are rendered.

Background

[edit]

Most writing systems display text from left to right, but several scripts—including Arabic, Hebrew, Thaana, and Syriac—are written from right to left. When text from both directions appears in the same document, the result is known as bidirectional text (or bidi text). Without a clear specification, ambiguities arise in determining the correct display order of characters.

The Unicode Standard prescribes a logical order for storing characters in memory, regardless of their visual direction. The UBA translates this logical order into a correct visual display order.

Directional Formatting Characters

[edit]

The UBA defines several categories of special control characters used to influence text direction:

Implicit Directional Marks

[edit]

Lightweight, zero-width characters that act as directional anchors without affecting display:

Abbreviation Code Point Name
LRM U+200E LEFT-TO-RIGHT MARK
RLM U+200F RIGHT-TO-LEFT MARK
ALM U+061C ARABIC LETTER MARK

Explicit Directional Embeddings

[edit]

Signal that a piece of text is to be treated as embedded in a given direction:

Abbreviation Code Point Name
LRE U+202A LEFT-TO-RIGHT EMBEDDING
RLE U+202B RIGHT-TO-LEFT EMBEDDING

Explicit Directional Overrides

[edit]

Force characters to be treated as strongly directional, overriding their implicit types:

Abbreviation Code Point Name
LRO U+202D LEFT-TO-RIGHT OVERRIDE
RLO U+202E RIGHT-TO-LEFT OVERRIDE

Explicit Directional Isolates

[edit]

Introduced in Unicode 6.3, isolates prevent the enclosed text from affecting the surrounding text's ordering:

Abbreviation Code Point Name
LRI U+2066 LEFT-TO-RIGHT ISOLATE
RLI U+2067 RIGHT-TO-LEFT ISOLATE
FSI U+2068 FIRST STRONG ISOLATE
PDI U+2069 POP DIRECTIONAL ISOLATE

Terminating Characters

[edit]
Abbreviation Code Point Name Terminates
PDF U+202C POP DIRECTIONAL FORMATTING LRE, RLE, LRO, RLO
PDI U+2069 POP DIRECTIONAL ISOLATE LRI, RLI, FSI

The Algorithm

[edit]

The UBA processes text in four main phases:

1. Paragraph Separation

[edit]

Text is split into paragraphs at paragraph separator characters (type B). Each paragraph is processed independently.

2. Initialization

[edit]

Each character is assigned a bidirectional character type (e.g., L, R, AL, EN, AN) from the Unicode Character Database. An embedding level list is also initialized.

3. Resolving Embedding Levels

[edit]

A series of rules resolves the embedding level of each character:

  • P1–P3: Determine the paragraph embedding level (0 for LTR, 1 for RTL).
  • X1–X10: Assign explicit embedding levels based on directional formatting characters.
  • W1–W7: Resolve weak types (e.g., European numbers, separators).
  • N0–N2: Resolve neutral and isolate formatting types, including bracket pairs.
  • I1–I2: Resolve implicit embedding levels.

The maximum embedding depth is 125 levels, a value guaranteed not to change in future versions of the standard.[1]

4. Reordering

[edit]

Rules L1–L4 reorder characters on each line for display:

  • L1: Resets trailing whitespace and separators to the paragraph embedding level.
  • L2: Reverses contiguous sequences of characters at the highest embedding levels, progressively down to the lowest odd level.
  • L3: Reorders combining marks relative to their base characters.
  • L4: Applies glyph mirroring to characters with the Bidi_Mirrored property when their resolved direction is right-to-left (e.g., "(" becomes ")").

Bidirectional Character Types

[edit]

Characters are classified into the following categories:

Category Type Description
Strong L Left-to-Right (e.g., Latin, Han)
R Right-to-Left (e.g., Hebrew)
AL Right-to-Left Arabic (e.g., Arabic, Syriac)
Weak EN European Number
ES European Number Separator
ET European Number Terminator
AN Arabic Number
CS Common Number Separator
NSM Nonspacing Mark
Neutral B Paragraph Separator
S Segment Separator
WS Whitespace
ON Other Neutrals

Conformance

[edit]

A conforming implementation must:

  • Display all visible characters in the order described by the UBA (UAX9-C1).
  • Only apply higher-level protocol overrides as defined in Section 4.3 of the specification (UAX9-C2).

Higher-Level Protocols

[edit]

The UBA permits six higher-level protocol overrides (HL1–HL6), including:

  • HL1: Override the paragraph embedding level.
  • HL3: Emulate explicit directional formatting characters via markup (e.g., HTML dir attribute).
  • HL4: Apply the UBA independently to segments of structured text (e.g., XML, source code).
  • HL6: Apply additional glyph mirroring beyond the standard Bidi_Mirrored property.

HTML and CSS Equivalents

[edit]

On web pages, Unicode directional formatting characters can be replaced by HTML5 and CSS3 markup:

Unicode HTML CSS
RLI...PDI dir="rtl" direction:rtl; unicode-bidi:isolate
LRI...PDI dir="ltr" direction:ltr; unicode-bidi:isolate
FSI...PDI <bdi>, dir="auto" unicode-bidi:plaintext

Security Considerations

[edit]

The misuse of bidirectional formatting characters poses significant security risks, as they can be used to make malicious code or text appear benign. This is documented in Unicode Technical Report #36 (UTR36). Directional overrides (LRO, RLO) are particularly dangerous and should be avoided where possible.

History

[edit]
  • Unicode 1.0 (1991): Basic bidirectional support introduced.
  • Unicode 6.3 (2013): Major revision introducing directional isolates (LRI, RLI, FSI, PDI) and bracket pair resolution (rule N0). These additions were made to address the overly strong effect of directional embeddings on surrounding text.
  • Unicode 17.0 (2025): Current version (Revision 51).

See Also

[edit]

References

[edit]
  1. ^ "Unicode Standard Annex #9: Unicode Bidirectional Algorithm". Unicode Consortium. Retrieved 2025-08-13.
[edit]