User:SebastianHellmann/Knowledge extraction

Knowledge Extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodical similar to Information Extraction (NLP) and ETL (Data Warehouse), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational scheme. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.

The RDB2RDF W3C group ^[1] is currently standardizing a language for extraction of RDF from relational databases. Another popular example for Knowledge Extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia, Freebase).

Examples

Entity Linking

DBpedia Spotlight analyze free text via Named Entity Recognition and then disambiguates and links the found entities to the DBpedia knowledge repository (web demo).

President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.

As President Obama is linked to a DBpedia LinkedData resource, further information can be retrieved automatically and a Semantic Reasoner can for example infer that the mentioned entity is of the type Person (using FOAF_(software)) and of type Presidents of the United States (using YAGO). Counter examples: Methods that only recognize entities or link to Wikipedia articles and other targets that do not provide structured data and formal knowledge.

Relational Databases to RDF

Triplify, D2R Server and Virtuoso RDF Views are tools that transform relational databases to RDF. During this process they allow to reuse existing vocabularies and ontologies during the conversion process. When transforming a typical relational table named users, one column (e.g.name) or an aggregation of columns (e.g.first_name and last_name) has to provide the URI of the created entity. Normally the primary key is used. Every other column can be extracted as a relation with this entity^[2]. Then properties with formally defined semantics are used (and reused) to interpret the information. For example a column in a user table called marriedTo can be defined as symmetrical relation and a column homepage can be converted to a property from the FOAF Vocabulary called foaf:homepage, thus qualifying it as an inverse functional property. Then each entry of the user table can be made an instance of the class foaf:Person (Ontology Population). Additionally domain knowledge (in form of an ontology) could be created from the status_id, either by manually created rules (if status_id is 2, the entry belongs to class Teacher ) or by (semi)-automated methods (Ontology Learning). Here is an example transformation:

Name	marriedTo	homepage	status_id
Peter	Marry	http://example.org/Peters_page	1
Claus	Eva	http://example.org/Claus_page	2

Peter :marriedTo :Marry .
marriedTo a owl:SymmetricProperty .
Peter foaf:homepage  <http://example.org/Peters_page> .
Peter a foaf:Person .
Peter a :Student .
Claus a :Teacher .

Overview

After the standardization of knowledge representation languages such as RDF and OWL, much research has been conducted in the area, especially regarding transforming relational databases into RDF, Entity resolution, Knowledge Discovery and Ontology Learning. The general process uses traditional methods from Information Extraction and ETL, which transform the data from the sources into structured formats.

Several degrees of knowledge extraction can be distinguished:

a) use of properties with formally defined semantics. For example a column in a user table called marriedTo can be considered a symmetrical relation. Also a column emailAddress can be converted to a property from the FOAF Vocabulary called foaf:mbox, thus qualifying it as an inverse functional property.

b) mapping instances to a pre-existing Ontology (Ontology Population)

c) creation of a domain ontology based on the database scheme. If this process is (semi)-automated, this step is related to Ontology Learning.

This complex topic can be categorized according to 1) what kind of data is searched; and 2) in what form is the result of the search represented.

Conversion of relational databases to RDF

Knowledge discovery

Knowledge discovery describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data ^[3]. It is often described as deriving knowledge from the input data. Knowledge discovery developed out of the Data mining domain, and is closely related to it both in terms of methodology and terminology ^[4].

The most well-known branch of data mining is knowledge discovery, also known as Knowledge Discovery in Databases (KDD). Just as many other forms of knowledge discovery it creates abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery.

Another promising application of knowledge discovery is in the area of software modernization which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group (OMG) developed specification Knowledge Discovery Metamodel (KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery of existing code. Knowledge discovery from existing software systems, also known as software mining is closely related to data mining, since existing software artifacts contain enormous business value, key for the evolution of software systems. Instead of mining individual data sets, software mining focuses on metadata, such as database schemas.

Ontology Learning

References

^ RDB2RDF Working Group, Website: http://www.w3.org/2001/sw/rdb2rdf/ , charter: http://www.w3.org/2009/08/rdb2rdf-charter, R2RML: RDB to RDF Mapping Language: http://www.w3.org/TR/r2rml/
^ Tim Berners-Lee. Relational Databases on the Semantic Web. Created Date: September 1998. http://www.w3.org/DesignIssues/RDB-RDF.html
^ Frawley William. F. et al. (1992), "Knowledge Discovery in Databases: An Overview", AI Magazine (Vol 13, No 3), 57-70 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1011)
^ Fayyad U. et al. (1996), "From Data Mining to Knowledge Discovery in Databases", AI Magazine (Vol 17, No 3), 37-54 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1230

Template:Computable knowledge

[1] RDB2RDF Working Group, Website: http://www.w3.org/2001/sw/rdb2rdf/ , charter: http://www.w3.org/2009/08/rdb2rdf-charter, R2RML: RDB to RDF Mapping Language: http://www.w3.org/TR/r2rml/

[2] Tim Berners-Lee. Relational Databases on the Semantic Web. Created Date: September 1998. http://www.w3.org/DesignIssues/RDB-RDF.html

[3] Frawley William. F. et al. (1992), "Knowledge Discovery in Databases: An Overview", AI Magazine (Vol 13, No 3), 57-70 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1011)

[4] Fayyad U. et al. (1996), "From Data Mining to Knowledge Discovery in Databases", AI Magazine (Vol 17, No 3), 37-54 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1230

[1]

[2]

[3]

[4]