Sentence extraction
Sentence extraction is a technique used for automatic summarization. In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more knowledge-intensive deeper approaches which require additional knowledge bases such as ontologies or linguistic knowledge. In short "sentence extraction" works as a filter which allows only important sentences to pass.
The major downside of applying sentence-extraction techniques to the task of summarization is the loss of coherence in the resulting summary. Nevertheless, sentence extraction summaries can give valuable clues to the main points of a document and are frequently sufficiently intelligible to human readers.
Procedure
Usually, a combination of heuristics is used to determine the most important sentences within the document. Each heuristic assigns a (positive or negative) score to the sentence. After all heuristics have been applied, the x highest-scoring sentences are included in the summary. The individual heuristics are weighted according to their importance.
Early approaches and some sample heuristics
Seminal papers which laid the foundations for many techniques used today have been published by H. P. Luhn in 1958[1] and H. P Edmundson in 1969.[2]
Luhn proposed to assign more weight to sentences at the beginning of the document or a paragraph. Edmundson stressed the importance of title-words for summarization and was the first to employ stop-lists in order to filter uninformative words of low semantic content (e.g. most grammatical words such as "of", "the", "a"). He also distinguished between bonus words and stigma words, i.e. words that probably occur together with important (e.g. the word form "significant") or unimportant information. His idea of using key-words, i.e. words which occur significantly frequently in the document, is still one of the core heuristics of today's summarizers. With large linguistic corpora available today, the TF/IDF value which originated in Information Retrieval, can be successfully applied to identify the key words of a text: If for example the word "cat" occurs significantly more often in the text to be summarized (TF = text frequency) than in the corpus (IDF means "inverse document frequency"; here the corpus is meant by "document"), then "cat" is likely to be an important word of the text; the text may in fact be a text about cats.
References
- ^ H. P. Luhn (1958). "The Automatic Creation of Literature Abstracts" (PDF). IBM Journal: 159–165.
{{cite journal}}
: Check date values in:|year=
(help); Unknown parameter|month=
ignored (help)CS1 maint: year (link) - ^ H. P. Edmundson (1969). "New Methods in Automatic Extracting" (PDF). Journal of the ACM. 16 (2): 264–285. doi:10.1145/321510.321519.
{{cite journal}}
: Check date values in:|year=
(help)CS1 maint: year (link)
External links
- iResearch Reporter - Commercial Text Extraction and Text Summarization system, free demo site accepts user-entered query, passes it on to Google search engine, retrieves relevant multiple documents, produces categorized, easily-readable natural language summary reports covering multiple documents in retrieved set, all extracts linked to original documents on the Web, post-processing, entity extraction, event and relationship extraction, text extraction, extract clustering, linguistic analysis, multi-document, full text, natural language processing, categorization rules, clustering, linguistic analysis, text summary construction tool set.
- NewsFeed Researcher - Extensible Commercial Text Extraction and Text Summarization system, free demo site uses Google search engine to produce background information summary reports for all current items in the topical Google NewsFeeds, selects text extracts from multiple retrieved documents, and automatically generates categorized summary reports in natural language prose text, all extracts linked to source documents on Web, post-processing, entity extraction, event and relationship extraction, text extraction, extract clustering, linguistic analysis, multi-document, full text, natural language processing, categorization rules, clustering, linguistic analysis, text summary construction tool set.