Jump to content

Language model

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Msbmsb (talk | contribs) at 22:47, 19 May 2005 (Initial release. Made it a stub to be expanded.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Statistical language models are probability distributions defined on sequences of words, P(w1..n). Language modeling has been used in many NLP applications such as part-of-speech tagging, parsing, speech recognition and information retrieval. Estimating sequences can become expensive in corpora where phrases or sentences can be arbitrarily long, and so these models are most often approximated using smoothed N-gram models like unigram, bigram and trigram.

In speech recognition, these models refer to a probabilistic distribution capturing the statistics of the generation of a language, and attempt to predict the next word in a speech sequence.

When used in information retrieval, a language model is associated with a document in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|Md).

References

  • {{cite conference}}: Empty citation (help)
  • {{cite conference}}: Empty citation (help)