Truecasing
This article needs additional citations for verification. (October 2010) |
Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages).
Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the Latin, Greek, Cyrillic or Armenian alphabets, such as Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, and Georgian.
Techniques
- Sentence segmentation can be used to determine where sentences begin, to implement the rule that the first word of every sentence must be capitalized.
- Part-of-speech tagging can be used to identify proper nouns, which must be capitalized. In some cases, the same word can be used as different parts of speech, and is capitalized differently. For example, Xerox the company, as a noun, is capitalized, but to xerox a document, as a verb, is not capitalized. A xerox, as in the copy of a document, can be recognized by the presence of a determiner, which is not used for proper nouns.
- Named entity recognition can be used to identify proper nouns, which must be capitalized.
- A spell checker can be used to identify words that are always capitalized.
Applications
Truecasing aids in many other NLP tasks, such as named entity recognition, machine translation and Automatic Content Extraction.[1]
References
- ^ Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.
{{cite conference}}: Unknown parameter|booktitle=ignored (|book-title=suggested) (help)