Jump to content

Wikipedia:Authority control integration proposal

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Maximiliankleinoclc (talk | contribs) at 18:44, 26 June 2012 (Template details). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This proposed project intends to extend and systematise the use of authority control identifiers, using the {{Authority control}} template, on English Wikipedia articles. Authority control is the term-of-art in librarianship, archival practice and related fields for unique identifiers to disambiguate objects (people, places, academic subjects, etc). These fields of study have different conceptualisations of unique identifiers form some other fields because many systems in place are backwards-compatible to pre-computerisation systems. This project aims to connect the English Wikipedia to this long tail of identifiers.

The current proposal focuses on biographies, although this may be extended in future to cover other topics, and is built around the use of data from VIAF, a composite system bringing together several major authority files. VIAF algorithmically matches and clusters entries from the individual authority files, and uses data scraped from Wikipedia to aid the process; as a result, there have already been a large number of Wikipedia-VIAF matched pairs identified and this provides a very effective springboard to work from.

The proposal was originally written up here, and discussed on the Village Pump here. It has since been updated to include some of the feedback and commentary received during the discussions. While the Village Pump discussion was broadly favourable, it will soon be formally listed as an RFC in order to ensure clear support from the community before implementation later in 2012.

This plan is being coordinated by Max Klein, the Wikipedian in Residence at OCLC, and Andrew Gray, the Wikipedian in Residence at the British Library. OCLC are the central operating group for VIAF, and have offered to provide technical support for the matching process. If you would like to help work on it, please let us know.

Background

Authority control is a system primarily used in libraries and other metadata services, where a single entity is given a canonical unique identifier. This allows clear disambiguation between different entities with similar names, while also allowing the use of a single identifier for those with multiple variant names.

Currently, around 4,000 articles on the English Wikipedia have some form of embedded authority control identifier. Likewise on Commons around (45,000) articles contain authority control. On the German Wikipedia, by comparison, around 220,000 articles have embedded identifiers.

The practical uses for these identifiers are varied. Among other things, they can:

  • help provide access to material written by and about the subject of an article, by linking directly into catalogues;
  • identify alternate names for which we can create redirects;
  • allow direct linking to, and reuse of, Wikipedia articles by external services, without needing to check page titles;
  • support the development of better metadata services, by helping information flow from Wikipedia back to the central data stores;
  • help tie together articles on specific individuals, supporting the development of Wikidata

This project is not the first effort to merge authority control with Wikipedia, but rather aims to build on previous projects; its main goal is to prepare infrastructure for use in Wikidata and future interoperability with external authority files, and to support the opportunities for future innovation with linked data.

The proposal

This initial proposal focuses on identifiers in biographies; however, it is not intended to be exclusive, and the system can be extended in future to other articles if there is community support for it.

It is built around use of the Virtual International Authority File (VIAF), an international project to merge multiple national authority files into a single master system. VIAF identifiers correspond to identifiers in other systems, and can be used in parallel with, or instead of, these other identifiers.

The process will involve identifying an appropriate VIAF identifier to match to as many articles as possible, using a number of different methods ranked by probable accuracy. Following this, and testing of the data to ensure it is consistent and accurate, a VIAF identifier will be added to these articles by a bot, using an extended version of the {{Authority control}} template. At this stage, it will also be possible to include non-VIAF identifiers, such as LCCN or GND, if desired.

Data sources

There are four available sources of data:

  1. Articles already using {{Authority control}} - some of these will have VIAF numbers. Where they do not, we can use the LCCN/GND numbers to match a VIAF number and include it in the existing template.
  2. Interwikied articles with identifiers - around 220,000 articles in the German Wikipedia have identifiers; some include VIAF, some do not. Where an interwiki to the German Wikipedia exists, we can pull the identifier from the linked page, doing some basic metadata checks to ensure the interwiki linkage is accurate.
  3. VIAF authority file links - as part of the matching process, Wikipedia is used as a source of information to help bring VIAF "clusters" together. OCLC have provided an extracted list of over 250,000 English Wikipedia articles with corresponding VIAF numbers, though these may have to be checked to ensure that pages have not been moved since the matching was carried out.
    (The matching is done with this python code written by OCLC Research Scientists Thom Hickey and Jenny Toves. During the algorithmic creation of the VIAF file if a Wikipedia link is matched with ~98% accuracy then it is included in the entry. Right now there are 266,202 links from VIAF to Wikipedia. Those links are available as a tab-delimited text file.)
  4. Automated matching - for articles not covered by any of the above, it may be possible to generate matches to VIAF using metadata such as variant names or birth and death dates. This will need a degree of error-checking and review to determine the accuracy of the matching process, and should not be used if another source is available.

Implementation

The implementation will be done in four stages.

  1. Create a list of page titles and associated VIAF cluster IDs. This will be done off-wiki, and will use multiple data sources - those noted as #1-3 above. These can then be checked against each other, and any with discrepancies can be examined manually. (This database will be made available for reuse and to help with ongoing maintenance.) A sample of these will be selected and manually compared to ensure accuracy.
  2. Prior to the bot run, {{Authority control}} will be redeveloped to ensure it scales effectively to the new usage, creating sub-templates for specific identifiers. The documentation for this template, along with Wikipedia:Authority control, will be checked and updated or overhauled where necessary.
  3. A bot will be developed and tested, then approved through the standard bot approval process to ensure there are no technical problems and that it is compliant with this proposal.
  4. This bot will add {{authority control}} along with the VIAF codes from this list, once testing is complete.
  5. Finally, this bot will run periodically and in conjunction with the VIAF update schedule, to reflect any reshuffling that occurs in the file.

Initially, this will be done using data sources #1-3 above - ie, those where we have some form of match between VIAF and Wikipedia to begin with. If the data in source #4 turns out to be usable and accurate, then a second wave of matches will be done using these identifiers.

Onwiki discussion to develop proposal
RfC on finalised proposal
Creation of processes and bots; bot approval
Deployment of content
Analysis and maintenance; possible second phase (non-biographies)
Phase I
Phase II
Phase III
Phase IV
Phase V
June
July
August
September
October
November
December
Towards Deep Authority Control Integration
  1. Onwiki discussion to develop proposal (by late June)
  2. RfC on finalised proposal (by mid-July)
  3. Creation of processes and bots; bot approval (by end of July)
  4. Deployment of content (through August)
  5. Future: Wikidata integration (part of phase 2 of Wikidata - entirely dependent on that schedule)
  6. Maintenance (...ongoing...)

Template details

The template currently used to handle authority control data is {{Authority control}}; it is placed at the extreme end of the article, just above the categories, and displays a narrow box with the identifiers. These link to an external service. For an example, see Albert Einstein - this uses GND, LCCN, and VIAF codes, and is nested under four collapsed navigational templates following the external links.

It will only be used on "main" articles, and not on subpages or related bibliographies - no two articles should share an identifier.

As part of this project, we will need to rewrite {{authority control}} to form a wrapper for a number of subsidiary templates, each handling a specific identifier. This will make it easier to maintain as well as easier to develop support for other identifiers, without the need for experimentation on a template used on several hundred thousand pages.

Documentation on {{authority control}}, Wikipedia:Authority control, and related pages will be updated accordingly.

Frequently asked questions

  1. How do I add a subject's VIAF to the article about them (or mine to my user page)?
    Use {{Authority control}}.
  2. Why use VIAF and not another identifier?
    VIAF is a composite of several existing authority control databases, and so includes all the content from many of the other systems. Any entity with, for example, a LCCN should have a corresponding VIAF number as well, but not every entity with a VIAF number will have an LCCN. Adding VIAF does not preclude the inclusion of other identifiers (and may indeed make it easier); this isn't aiming to impose a sole standard.
  3. Why only people?
    The authority control system does cover other things, but for the moment (written 2013) we are only planning to cover people—this is to simplify the initial program, as well as target the articles where the template is most likely to be useful.
  4. What about errors in VIAF?
    You can report apparent errors in VIAF (or its constituent catalogues) at Wikipedia:VIAF/errors. These are then available to the relevant managing body, and for linkage repair on-Wiki. For the German equivalent noticeboard, see de:WP:PND/F.
  5. What about licensing?
    VIAF is licensed as ODC-BY, which is compatible with Wikipedia licensing; the use of a VIAF URI is sufficient attribution for the terms of the license.
  6. Will this give any control over Wikipedia content to third parties?
    No. While we will be including VIAF identifiers, the content of Wikipedia and VIAF will remain entirely separate. No metadata will be imported automatically from VIAF, nor will Wikipedia need to follow VIAF naming conventions.
  7. What if editors object to the template or the identifier?
    Editors of specific pages will in all cases be free to remove the metadata where it is inaccurate or felt to be editorially inappropriate. For the purposes of Wikipedia:Sanctions, the first revert of an automated or semi-automated addition of authority control information shall not count as a revert.
  8. What about pages covering two people?
    There are many cases where a single article deals with two individuals. If two VIAF identifiers refer to the same article, this will be logged but not added to the article; if it currently contains one but not the other, or a mixture of identifiers referring to both, this will also be flagged.
  9. What about Wikidata?
    Wikidata includes authority identifiers. However, adding the template now allows us to gain the benefit of having this information available before Wikipedia transcludes it from Wikidata ; it also will simplify any future work to add these identifiers to Wikidata.
  10. What about cases where several people have the same name?
    The primary purpose of authority control records is to help distinguish between people with the same (or similar) names. As such, identifiers are usually not matched on the name alone; the software is able to take account of other information such as birth and death dates.
  11. I wrote a new biographical article, how do find the VIAF identifier?
    Thank you for contributing to Wikipedia! You can look up a subject's VIAF at http://viaf.org/ Enter their name as the "Search Terms:", and leave the other parameters at their default values. If there are two or more entries with the same name, check the listed works for a match. If you're not sure which to use, you can ask for advice at Wikipedia talk:Authority control.
  12. I have another question
    Any comments, criticisms, etc. will be gratefully received, again at Wikipedia talk:Authority control.

If the community RFC decides not to go ahead with the project, we'll still be able to pass the data generated so far to the Wikidata team, and hopefully it can be used there.

Any comments, criticisms, etc. gratefully received.

- Max Klein, OCLC Wikipedian in Residence, and Andrew Gray, British Library Wikipedian in Residence.