Linguistic Linked Open Data

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects and infrastructure efforts since then.

Linguistic Linked Open Data

Linguistic Linked Open Data publishes data for linguistics and natural language processing using the following principles:^[1]

Data should be openly licensed using licenses such as the Creative Commons licenses.
The elements in a dataset should be uniquely identified by means of a URI.
The URI should resolve, so users can access more information using web browsers.
Resolving an LLOD resource should return results using web standards such as Resource Description Framework (RDF).
Links to other resources should be included to help users discover new resources and provide semantics.

The primary benefits of LLOD have been identified as:^[2]

Representation: Linked graphs are a more flexible representation format for linguistic data.
Interoperability: Common RDF models can easily be integrated.
Federation: Data from multiple sources can trivially be combined.
Ecosystem: Tools for RDF and linked data are widely available under open source licenses.
Expressivity: Existing vocabularies help express linguistic resources.
Semantics: Common links express what you mean.
Dynamicity: Web data can be continuously improved.

Uses of LLOD

Linguistic Linked Open Data was applied to address a number of scientific research problems:

In all areas of empirical linguistics, computational philology and natural language processing, linguistic annotation and linguistic markup represent central elements of analysis. However, progress in this field is being hampered by interoperability challenges, most notably differences in vocabularies and annotation schemes used for different resources and tools. Using Linked Data to connect language resources and ontologies/terminology repositories facilitate re-using shared vocabularies and interpreting them against a common basis. Without this the many people would'nt know about this issue.
In corpus linguistics and computational philology, overlapping markup represents a notorious problem to conventional XML formats. Hence, graph-based data models have been suggested since the late 1990s.^[3] These are traditionally represented by means of multiple, interlinked XML files (standoff XML),^[4] which are poorly supported by off-the-shelf XML technology.^[5] Modeling such complex annotations as Linked Data represents a formalism semantically equivalent to standoff XML,^[6] but eliminates the need for special-purpose technology, and, instead, relies on the existing RDF ecosystem.
Multilingual issues including the linking of lexical resources such as WordNet as performed in the Interlingual Index of the Global WordNet Association and interconnecting heterogeneous resources such as WordNet and Wikipedia, as was done in BabelNet.
Providing forums for standardization of linguistic resource information. It is also very common in Asia and Europe.

Selected LLOD resources

As of October 2018, the 10 most frequently linked resources in the LLOD diagram are (in order of the number of linked datasets):

The Ontologies of Linguistic Annotation (OLiA, linked with 74 datasets) provide reference terminology for linguistic annotations and grammatical metadata;
WordNet (linked with 51 datasets), a lexical database for English and pivot for developing similar databases for other languages, with several editions (Princeton edition linked with 36 datasets; W3C edition linked with 8 datasets; VU edition linked with 7 datasets);
DBpedia (linked with 50 datasets) multilingual knowledge basis of general world knowledge, based on Wikipedia;
lexinfo.net (linked with 36 datasets) provides reference terminology for lexical resources;
BabelNet (linked with 33 datasets) multilingual lexicalized semantic network, based on the aggregation of various other resources, most notably WordNet and Wikipedia;
lexvo.org (linked with 26 datasets) provides language identifiers and other language-related data. Most importantly, lexvo provides an RDF representation of ISO 639-3 3-letter codes for language identifiers and information about these languages;
The ISO 12620 Data Category Registry (ISOcat; RDF edition, linked with 10 datasets) provides a semistructured repository for various language-related terminology. ISOcat is hosted by The Language Archive, respectively, the DOBES project, at the Max Planck Institute for Psycholinguistics, but currently in transition to CLARIN;
UBY (RDF edition lemon-Uby, linked with 9 datasets), a lexical network for English, aggregated from various lexical resources;
Glottolog (linked with 7 datasets) provides fine-grained language identifiers for low-resource languages, in particular many not covered by lexvo.org;
Wiktionary-DBpedia links (wiktionary.dbpedia.org, linked with 7 datasets), Wiktionary-based lexicalizations for DBpedia concepts.

LLOD cloud development and community activities

The LLOD cloud diagram is maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (since 2014 Open Knowledge), an open and interdisciplinary of experts in language resources.

The OWLG organizes community events and coordinates LLOD developments and facilitates interdisciplinary communication between and among LLOD contributors and users.

Several W3C Business and Community Groups focus on specialized aspects of LLOD:

The W3C Ontology-Lexica Community Group develops and maintains specifications for machine-readable dictionaries in the LLOD cloud
The W3C Best Practices for Multilingual Linked Open Data Community Group gathers information on best practices for producing multilingual linked open data.
The W3C Linked Data for Language Technology Community Group assembles user cases and requirements for language technology applications that use Linked Data.

LLOD development is driven forward by and documented in a series of international workshops, datathons, and associated publications. Among others, these include

Linked Data in Linguistics (LDL), annual scientific workshop, since 2012
Multilingual Linked Open Data for Enterprises (MLODE), bi-annual community meeting (2012 and 2014)
Summer Datathon on Linguistic Linked Open Data (SD-LLOD), bi-annual datathon, since 2015

Uses and development of LLOD have been subject to several research projects, including

LOD2 (11 EU countries + Korea, 2010–2014)
MONNET (5 EU countries, 2010–2013)
LIDER (5 EU countries, 2013–2015)
QTLeap (6 EU countries, 2013–2016)
FREME (6 EU countries, 2015-2017)

References

^ Open Linguistics Working Group. "Linguistic LOD". linguistic-lod.org. LIDER project. Retrieved 2016-05-24.
^ Chiarcos, Christian; McCrae, John; Cimiano, Philipp; Fellbaum, Christiane (2013). Towards open data for linguistics: Lexical Linked Data (PDF). Heidelberg: In: Alessandro Oltramari, Piek Vossen, Lu Qin, and Eduard Hovy (eds.), New Trends of Research in Ontologies and Lexical Resources. Springer. Retrieved 2016-05-24.
^ Bird, Steven; Liberman, Mark. "Towards a formal framework for linguistic annotations" (PDF). In: Proceedings of the International Conference on Spoken Language Processing, Sydney, 1998. Retrieved 2016-05-25.^{[permanent dead link‍]}
^ ISO 24612:2012. "Language resource management -- Linguistic annotation framework (LAF)". ISO. Retrieved 2016-05-25.{{cite web}}: CS1 maint: numeric names: authors list (link)
^ Eckart, Richard (2008). Choosing an XML database for linguistically annotated corpora. SDV. Sprache und Datenverarbeitung 32.1/2008: International Journal for Language Data Processing, Workshop Datenbanktechnologien für hypermediale linguistische Anwendungen (KONVENS 2008), Universitätsverlag Rhein-Ruhr, Berlin, Sep 2008. pp. 7–22.
^ Chiarcos, Christian. "Interoperability of Corpora and Annotations (draft version)" (PDF). In: Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann (eds.) Linked Data in Linguistics. Representing and Connecting Language Data and Language Metadata, 2012. Retrieved 2016-05-25.

[1] Open Linguistics Working Group. "Linguistic LOD". linguistic-lod.org. LIDER project. Retrieved 2016-05-24.

[2] Chiarcos, Christian; McCrae, John; Cimiano, Philipp; Fellbaum, Christiane (2013). Towards open data for linguistics: Lexical Linked Data (PDF). Heidelberg: In: Alessandro Oltramari, Piek Vossen, Lu Qin, and Eduard Hovy (eds.), New Trends of Research in Ontologies and Lexical Resources. Springer. Retrieved 2016-05-24.

[3] Bird, Steven; Liberman, Mark. "Towards a formal framework for linguistic annotations" (PDF). In: Proceedings of the International Conference on Spoken Language Processing, Sydney, 1998. Retrieved 2016-05-25.^{[permanent dead link‍]}

[4] ISO 24612:2012. "Language resource management -- Linguistic annotation framework (LAF)". ISO. Retrieved 2016-05-25.{{cite web}}: CS1 maint: numeric names: authors list (link)

[5] Eckart, Richard (2008). Choosing an XML database for linguistically annotated corpora. SDV. Sprache und Datenverarbeitung 32.1/2008: International Journal for Language Data Processing, Workshop Datenbanktechnologien für hypermediale linguistische Anwendungen (KONVENS 2008), Universitätsverlag Rhein-Ruhr, Berlin, Sep 2008. pp. 7–22.

[6] Chiarcos, Christian. "Interoperability of Corpora and Annotations (draft version)" (PDF). In: Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann (eds.) Linked Data in Linguistics. Representing and Connecting Language Data and Language Metadata, 2012. Retrieved 2016-05-25.

[1]

[2]

[3]

[4]

[5]

[6]