User:SebastianHellmann/Knowledge extraction

Knowledge Extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodical similar to Information Extraction (NLP) and ETL (Data Warehouse), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational scheme. It furthermore requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source.

The RDB2RDF W3C group is currently standardizing a language for extraction of RDF from relational databases. Another popular example for Knowledge Extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia, Freebase_(database)).

Examples

Entity Linking

DBpedia Spotlight analyze free text via Named Entity Recognition and then disambiguates and links the found entities to the DBpedia knowledge repository (web demo).

President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.

As President Obama is linked to a DBpedia LinkedData resource, further information can be retrieved automatically and a Semantic Reasoner can for example infer that the mentioned entity is of the type Person (using FOAF_(software)) and of type Presidents of the United States (using YAGO_(Ontology)). Counter examples: Methods that only recognize entities or link to Wikipedia articles and other targets that do not provide structured data and formal knowledge.

RDB2RDF

Triplify, D2R Server and Virtuoso RDF Views are tools that transform relational databases to RDF. During this process they allow to reuse existing vocabularies and ontologies during the conversion process. Several degrees of knowledge extraction can be distinguished:

a) use of properties with formally defined semantics. For example a column in a user table called marriedTo can be considered a symmetrical relation. Also a column emailAddress can be converted to a property from the FOAF Vocabulary called foaf:mbox, thus qualifying it as an inverse functional property.

b) mapping instances to a pre-existing Ontology (Ontology Population)

c) creation of a domain ontology based on the database scheme. If this process is (semi)-automated, this step is related to Ontology Learning.

Overview

After the standardization of knowledge representation languages such as RDF and OWL, much research has been conducted in the area, especially regarding transforming relational databases into RDF, Entity resolution and Ontology Learning.

Areas

Knowledge Extraction overlaps with the following areas:

Ontology Learning

Ontology Population

MOVE Ontology learning here.

Concept Tagging

Transformation of databases to RDF

Example

Knowledge Tags

Type of processes

References

Template:Computable knowledge