Jump to content

word n-gram language model

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by DancingPhilosopher (talk | contribs) at 11:46, 8 August 2023 (wikilink (instead of MLP)). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

A word n-gram model was a language model that was used until 2003, when it was superseded by a feedforward neural network (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster) developed by Yoshua Bengio with co-authors.[1] It is now superseded by deep learning-based large language models. It was based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words.

The probabilities were not equal to frequency counts, because otherwise it could not assign a portion of the total probability mass to words not contained in the training dataset. Various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.

If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n-1 words, an n-gram model.[2] For example, a bigram language model would assign the probabilities of words in the sentence I saw the red house as:

Where and are special tokens denoting the start and end of a sentence.

Method

The approximation used in the model is the probability of observing the sentence

It is assumed that the probability of observing the ith word wi in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (nth-order Markov property). To clarify, for n = 3 and i = 2 we have .

The conditional probability can be calculated from n-gram model frequency counts:

Example

In a bigram word (n = 2) language model, the probability of the sentence I saw the red house is approximated as

whereas in a trigram (n = 3) language model, the approximation is

Note that the context of the first n – 1 n-grams is filled with start-of-sentence markers, typically denoted <s>.

Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.

Unigram model

A special case, where n=0, the model can be treated as the combination of several one-state finite automata.[3] It assumes that the probabilities of tokens in a sequence are independent, e.g.:

In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1. The following is an illustration of a unigram model of a document.

Terms Probability in doc
a 0.1
world 0.2
likes 0.05
we 0.05
share 0.3
... ...

The probability generated for a specific query is calculated as

Different documents have unigram models, with different hit probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:

Terms Probability in Doc1 Probability in Doc2
a 0.1 0.3
world 0.2 0.1
likes 0.05 0.03
we 0.05 0.02
share 0.3 0.2
... ... ...

References

  1. ^ Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (March 1, 2003). "A neural probabilistic language model". The Journal of Machine Learning Research. 3: 1137–1155 – via ACM Digital Library.
  2. ^ Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
  3. ^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). An Introduction to Information Retrieval. pp. 237–240. Cambridge University Press.