Jump to content

Second-order co-occurrence pointwise mutual information

From Wikipedia, the free encyclopedia

In computational linguistics, second-order co-occurrence pointwise mutual information (SOC-PMI) is a method used to measure semantic similarity, or how close in meaning two words are. The method does not require the two words to appear together in a text. Instead, it works by analyzing the "neighbor" words that typically appear alongside each of the two target words in a large body of text (corpus). If the two target words frequently share the same neighbors, they are considered semantically similar.

For example, the words "cemetery" and "graveyard" may not appear in the same sentence often, but they both frequently appear near words like "buried," "dead," and "funeral." SOC-PMI uses this shared context to determine that they have a similar meaning.

The method is called "second-order" because it doesn't look at the direct co-occurrence of the target words (which would be first-order), but at the co-occurrence of their neighbors (a second level of association). The strength of these associations is quantified using pointwise mutual information (PMI).

History

[edit]

The method builds on earlier work like the PMI-IR algorithm, which used the AltaVista search engine to calculate word association probabilities.[citation needed] The key advantage of a second-order approach like SOC-PMI is its ability to measure similarity between words that do not co-occur often, or at all. The British National Corpus (BNC) has been used as a source for word frequencies and contexts for this method.

Methodology

[edit]

The SOC-PMI algorithm measures the similarity between two words, and , in several steps.

Step 1: Score neighboring words with PMI

[edit]

First, for each target word ( and ), the algorithm identifies its "neighbor" words within a certain text window (e.g., within 5 words to the left or right) across a large corpus. The strength of the association between a target word and its neighbor is calculated using pointwise mutual information (PMI). A higher PMI value means the two words appear together more often than would be expected by chance.

The PMI between a target word and a neighbor word is calculated as:

where:

  • is the number of times and appear together in the context window.
  • is the total number of times appears in the corpus.
  • is the total number of times appears in the corpus.
  • is the total number of tokens (words) in the corpus.

Step 2: Create a semantic 'signature' for each word

[edit]

For each target word ( and ), the algorithm creates a list of its most significant neighbors. This is done by taking the top neighbor words, sorted in descending order by their PMI score with the target word. This list of top neighbors, , acts as a semantic "signature" for the word .

, for

The size of this list, , is a parameter of the method.

Step 3: Compare the signatures

[edit]

The algorithm then compares the signatures of and . It looks for words that are present in both signatures. The similarity of to is calculated by summing the PMI scores of with every word in 's signature list.

The -PMI summation function defines this score. The score for with respect to is:

This sum only includes terms where the PMI value is positive. The exponent (with a value > 1) is used to give more weight to neighbors that are more strongly associated with .

This calculation is done in both directions:

  • The similarity of with respect to :
  • The similarity of with respect to :

Step 4: Calculate final similarity score

[edit]

Finally, the total semantic similarity is the average of the two scores from the previous step.

This score can be normalized to fall between 0 and 1. For example, using this method, the words cemetery and graveyard achieve a high similarity score of 0.986 (with specific parameter settings).

References

[edit]