Divergence-from-randomness model

In the field of information retrieval, divergence-from-randomness is one type of probabilistic model. It is used to test the amount of information carried in the documents. The idea of the model is that 'informative' terms in a document are more statistically diverged from the randomness of a term distribution model than 'non-informative' terms. It is based on Harter's 2-Poisson indexing model.

Definition

The divergence-from-randomness model is based on the idea that the more the divergence is from its frequency, the more the information carried by the document.^[1] When a term cannot be found in a document, then in that document, the term has approximately zero probability of being 'informative'.

${\text{weight}}(t|d)=k{\text{Prob}}_{M}(t\in d|{\text{Collection}})$

M represents the type of model of randomness which employs to calculate the probability.
d is the total number of words in the documents.
t is the number of a specific word in d.
k is defined by M.

It is possible to use different urn models.

Probability space

Utility-Theoretic Indexing, developed by Cooper and Maron, is a theory of indexing based on utility theory. To reflect the 'value' for documents that are expected by the users, index terms are assigned to documents. The probability distribution assigns probabilities to all sets of terms for the vocabulary. A basic space (a) can be the set (V) of terms (t); it is the vocabulary of the document collection. Due to $a=V$ is the set of all mutually exclusive events, a can also be the event itself.

In information retrieval, the term experiment alludes to the notion that the document can be acted as if it is a sequence of outcomes or just a sample of the items. For the number of occurrences of a specific word in one list is equal to the term-frequency (tf–idf) of the term in the document, one can introduce the product of the probability spaces associated with the experiments of the sequence, such as $a=V\times$ the number of trials of the experiment, to introduce the event space of it. Sometimes, the document length is normalized to a standard length.

If the experiments are designed so that an outcome is influencing the next outcomes, then the probability distribution on V is different at each trial. But the number of 'trials', where each word occurrence is a 'trial', can be assumed to be independent of each other; the probability distribution over the vocabulary is the same for each word. Therefore, all possible configurations are almost always considered 'equally probable'. The sample space can also be associated with a point with possible configurations of the outcomes.

References

^ "Divergence From Randomness (DFR) Framework". Terrier Team, University of Glasgow.

General references

Amati, Giambattista (2003). Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. University of Glasgow.

External links

Terrier on divergence-from-randomness model

[:0-1] "Divergence From Randomness (DFR) Framework". Terrier Team, University of Glasgow.

[1]