Jump to content

word n-gram language model

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by DancingPhilosopher (talk | contribs) at 12:08, 10 August 2023 (lede rephased; sections reorder). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

A word n-gram model is a statistical (as opposed to more recent feedforward neural-network-based and the most recent deep learning-based large) language model.[1] It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n-1 words, an n-gram model.[2] Special tokens were introduced to denote the start and end of a sentence and .

To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.

Unigram model

A special case, where n=0, is called a unigram model. Probability of each word in a sequence is independent from probabilities of other word in the sequence. Each word's probability in the seqence is equal to the word's probability in an entire document.

The model consists of units, each treated as one-state finite automata.[3] Words with their probabilities in a document can be illustrated as follows.

Word Its probability in doc
a 0.1
world 0.2
likes 0.05
we 0.05
share 0.3
... ...

Total mass of word probabilities distributed across the document's vocabulary, is 1.

The probability generated for a specific query is calculated as

Unigram models of different documents have different probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:

Word Its probability in Doc1 Its probability in Doc2
a 0.1 0.3
world 0.2 0.1
likes 0.05 0.03
we 0.05 0.02
share 0.3 0.2
... ... ...

Bigram model

In a bigram word (n = 2) language model, the probability of the sentence I saw the red house is approximated as

Trigram model

In a trigram (n = 3) language model, the approximation is

Note that the context of the first n – 1 n-grams is filled with start-of-sentence markers, typically denoted <s>.

Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.

Approximation method

The approximation method calculates the probability of observing the sentence

It is assumed that the probability of observing the ith word wi (in the context window consisting of the preceding i − 1 words) can be approximated by the probability of observing it in the shortened context window consisting of the preceding n − 1 words (nth-order Markov property). To clarify, for n = 3 and i = 2 we have .

The conditional probability can be calculated from n-gram model frequency counts:

References

  1. ^ Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (March 1, 2003). "A neural probabilistic language model". The Journal of Machine Learning Research. 3: 1137–1155 – via ACM Digital Library.
  2. ^ Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
  3. ^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). An Introduction to Information Retrieval. pp. 237–240. Cambridge University Press.