Jump to content

Language resource

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Chiarcos (talk | contribs) at 14:48, 13 May 2020 (fixed headings). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In linguistics and language technology, a language resource is a `[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications'.[1]

According to Bird & Simons (2003),[2] this includes

  1. data, i.e. `any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar',[2]
  2. tools, i.e., `computational resources that facilitate creating, viewing, querying, or otherwise using language data' [2], and
  3. advice, i.e., `any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data'. The latter aspect is usually referred to as `best practices' or `(community) standards'.[2]

In a narrower sense, language resource is specifically applied to resources that are available in digital form, and then, `encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management.[1]

Typology

As of May 2020, no widely used standard typology of language resources has been established (current proposals include the LREMap[3] and METASHARE[4]). Important classes of language resources include

  1. data
    1. lexical resources, e.g., machine-readable dictionaries,
    2. linguistic corpora, i.e., digital collections of natural language data,
    3. linguistic data bases such as the Cross-Linguistic Linked Data collection,
  2. tools
    1. linguistic annotations and tools for creating such annotations in a manual or semiautomated fashion (e.g., tools for annotating interlinear glossed text such as Toolbox and FLEx, or other language documentation tools),
    2. applications for search and retrieval over such data (corpus management systems), for automated annotation (part-of-speech tagging, syntactic parsing, semantic parsing, etc),
  3. metadata and vocabularies
    1. vocabularies, repositories of linguistic terminology and language metadata, e.g., MetaShare (for language resource metadata),[4] the ISO 12620 data category registry (for linguistic features, data structures and annotations within a language resource),[5] or the Glottolog database (identifiers for language varieties and bibiliographical database) [6].

References

  1. ^ a b LD4LT (2020), The Metashare Ontology as Created by the LD4LT Community Group, W3C Community Group Linked Data for Language Technology (LD4LT), Development branch, version of Mar 10, 2020
  2. ^ a b c d Bird, Steven; Simons, Gary (2003-11-01). "Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources". Computers and the Humanities. 37 (4): 375–388. doi:10.1023/A:1025720518994. ISSN 1572-8412.
  3. ^ Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., & Soria, C. (2012, May). The LRE Map. Harmonising Community Descriptions of Resources. In LREC (pp. 1084-1089).
  4. ^ a b McCrae, John P.; Labropoulou, Penny; Gracia, Jorge; Villegas, Marta; Rodríguez-Doncel, Víctor; Cimiano, Philipp (2015). Gandon, Fabien; Guéret, Christophe; Villata, Serena; Breslin, John; Faron-Zucker, Catherine; Zimmermann, Antoine (eds.). "One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web". The Semantic Web: ESWC 2015 Satellite Events. Lecture Notes in Computer Science. Cham: Springer International Publishing: 271–282. doi:10.1007/978-3-319-25639-9_42. ISBN 978-3-319-25639-9.
  5. ^ Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat: Corralling data categories in the wild. In 6th International Conference on Language Resources and Evaluation (LREC 2008).
  6. ^ Nordhoff, Sebastian (2012), Chiarcos, Christian; Nordhoff, Sebastian; Hellmann, Sebastian (eds.), "Linked Data for Linguistic Diversity Research: Glottolog/Langdoc and ASJP Online", Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata, Springer, pp. 191–200, doi:10.1007/978-3-642-28249-2_18, ISBN 978-3-642-28249-2, retrieved 2020-05-13