Jump to content

Text normalization

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Nohat (talk | contribs) at 06:15, 2 December 2004 (start page). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Text normalization is a process by which text is transformed in some way to make it consistent in some way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, and storage in a database.

Examples of text normalization:

  • converting all letters to lower or upper case
  • removing punctuation
  • removing letters with accent marks and other diacritics
  • expanding abbreviations