Jump to content

Truecasing

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Chris the speller (talk | contribs) at 18:00, 23 December 2012 (per WP:HYPHEN, sub-subsection 3, points 3,4,5, replaced: badly- → badly using AWB (8759)). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages). Truecasing aids in many other NLP tasks, such as named entity recognition, machine translation and Automatic Content Extraction[1].

Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the Latin, Greek, Cyrillic or Armenian alphabets, such as Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, etc.

References

  1. ^ Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)