word n-gram language model

A word n-gram model was a language model that was used until 2003, when it was superseded by a feedforward neural network (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster) developed by Yoshua Bengio with co-authors.^[1] It is now superseded by deep learning-based large language models. It was based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words.

The probabilities were not equal to frequency counts, because otherwise it could not assign a portion of the total probability mass to words not contained in the training dataset. Various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.

If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n-1 words, an n-gram model.^[2] For example, a bigram language model would assign the probabilities of words in the sentence I saw the red house as:

$P({\text{I, saw, the, red, house}})\approx P({\text{I}}\mid \langle s\rangle )P({\text{saw}}\mid {\text{I}})P({\text{the}}\mid {\text{saw}})P({\text{red}}\mid {\text{the}})P({\text{house}}\mid {\text{red}})P(\langle /s\rangle \mid {\text{house}})$

Where $\langle s\rangle$ and $\langle /s\rangle$ are special tokens denoting the start and end of a sentence.

Method

The approximation used in the model is the probability $P(w_{1},\ldots ,w_{m})$ of observing the sentence $w_{1},\ldots ,w_{m}$

$P(w_{1},\ldots ,w_{m})=\prod _{i=1}^{m}P(w_{i}\mid w_{1},\ldots ,w_{i-1})\approx \prod _{i=2}^{m}P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})$

It is assumed that the probability of observing the i^th word w_i in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (n^th-order Markov property). To clarify, for n = 3 and i = 2 we have $P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})=P(w_{2}\mid w_{1})$ .

The conditional probability can be calculated from n-gram model frequency counts:

$P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})={\frac {\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1},w_{i})}{\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1})}}$

Example

In a bigram word (n = 2) language model, the probability of the sentence I saw the red house is approximated as

$P({\text{I, saw, the, red, house}})\approx P({\text{I}}\mid \langle s\rangle )P({\text{saw}}\mid {\text{I}})P({\text{the}}\mid {\text{saw}})P({\text{red}}\mid {\text{the}})P({\text{house}}\mid {\text{red}})P(\langle /s\rangle \mid {\text{house}})$

whereas in a trigram (n = 3) language model, the approximation is

$P({\text{I, saw, the, red, house}})\approx P({\text{I}}\mid \langle s\rangle ,\langle s\rangle )P({\text{saw}}\mid \langle s\rangle ,I)P({\text{the}}\mid {\text{I, saw}})P({\text{red}}\mid {\text{saw, the}})P({\text{house}}\mid {\text{the, red}})P(\langle /s\rangle \mid {\text{red, house}})$

Note that the context of the first n – 1 n-grams is filled with start-of-sentence markers, typically denoted <s>.

Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.

Unigram model

A special case, where n=0, the model can be treated as the combination of several one-state finite automata.^[3] It assumes that the probabilities of tokens in a sequence are independent, e.g.:

$P_{\text{uni}}(t_{1}t_{2}t_{3})=P(t_{1})P(t_{2})P(t_{3}).$

In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1. The following is an illustration of a unigram model of a document.

Terms	Probability in doc
a	0.1
world	0.2
likes	0.05
we	0.05
share	0.3
...	...

$\sum _{\text{term in doc}}P({\text{term}})=1$

The probability generated for a specific query is calculated as

$P({\text{query}})=\prod _{\text{term in query}}P({\text{term}})$

Different documents have unigram models, with different hit probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:

Terms	Probability in Doc1	Probability in Doc2
a	0.1	0.3
world	0.2	0.1
likes	0.05	0.03
we	0.05	0.02
share	0.3	0.2
...	...	...

References

^ Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (March 1, 2003). "A neural probabilistic language model". The Journal of Machine Learning Research. 3: 1137–1155 – via ACM Digital Library.
^ Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). An Introduction to Information Retrieval. pp. 237–240. Cambridge University Press.

[1] Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (March 1, 2003). "A neural probabilistic language model". The Journal of Machine Learning Research. 3: 1137–1155 – via ACM Digital Library.

[jm-2] Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.

[3] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). An Introduction to Information Retrieval. pp. 237–240. Cambridge University Press.

[1]

[2]

[3]