Wikipedia:Authority control integration proposal
Towards deep authority control integration in Wikipedia
Excutive Summary
Authority control is a library practice of creating official records to disambiguate articles and creating connections between them. So far German Wikipedia has linked author articles to their national library unique identifiers in nearly all appropriate cases. On English Wikipedia an analogous effort is less than one percent complete. The Virtual International Authority File (VIAF) is an international project to merge all national authority files into a single master. The algorithm which VIAF relies on for matching the national files uses Wikipedia for hints, and in the process links to the Wikipedia article of the author when matches are made. Using a bot we could make .25 million reciprocal links from Wikipedia to VIAF, using a template that already exists in 4,000 articles. The debate focuses around whether we still think that the {{Authority control}} template is useful, and should be scaled. Long term this could have strong Wikidata implications in that links to VIAF could be used to draw in properties about the authors - landmark dates, and major works, preferred names, etc. - and propagate authoritative metadata across all language Wikipedias.
Proposal
Introduction
The utility of authority control systems are understood by both the library community and Wikipedians alike. For the library community, authority files are indispensable tools to precise cataloging. To Wikipedians, the inclusion of authority control is part of the march towards building a better encyclopedia with more structured data. The Virtual International Authority File is a joint project of 20 national libraries and operated by the Online Computer Library Center (OCLC), to combine the disambiguating authority files into one source. It algorithmically matches and clusters the agreements between the national authority files, and uses data scraped from Wikipedia to aid the process.
This project is not the first effort to merge authority control with Wikipedia, but rather aims to build on previous projects; its main goal is to prepare infrastructure for use in Wikidata and future interoperability with VIAF, and linked data oppportunites that such a bridge would confer.
What's happening already?
German Wikipedia implemented a comprehensive program to match articles with authority identifiers several years ago; a similar project on the English Wikipedia, using {{Authority control}}, has gained some traction but has only covered ~0.4% (4,308 transclusions) of the project's biographies, against a ~50% (219,428 transclusions) coverage rate in German. There is also a automated way of human integrating these information by way of a Gadget.
Parallel to this, the "Persondata" structured data system has been widely rolled out on the English Wikipedia; during 2011, the proportion of biographies with persondata leapt from under 10% to well over 90%. This wealth of structured data means that there is a good opportunity to try and link English Wikipedia articles en masse in the next few months; if it can be done soon, it will help support the deployment of the cross-project Wikidata later in 2012-13.
There are a number of other templates which use non-standard identifiers for individuals in similar ways - {{OL author}} links to author records on Open Library, {{Gutenberg author}} to index entries on Gutenberg, etc. These may potentially be convertible to a uniform identifier.
How would we get the data?
As mentioned before VIAF already uses Wikipedia in it's algorithm to help it cluster and match the multitude of national authority files. The VIAF entries themselves took data from 788,582 records created from wiki dump, using python code written by OCLC Research Scientists Thom Hickey and Jenny Toves. During the algorithmic creation of the VIAF file if a Wikipedia link is matched with ~98% accuracy then it is included in the entry. Right now there are 266,202 links from VIAF to Wikipedia. Those links are available as a tab-delimited text file.
Other techniques that might be useful would be the use of {{normdaten}} and {{authority control}} templates that had GND or LCCN, but not VIAF variables since those are a subset of VIAF. That is any GND or LCCN corresponds to a VIAF identifier and conversion can occur between them.
Licensing
VIAF is ODC-BY and OCLC considers using the canonical URI to be suitable attribution. Therefore for this proposed plan there would appear to be no licensing conflicts.
Short-term Integration
There are potentially two ways to include the identifiers:
- We do it visibly; we roll out or complete the existing {{Authority control}} template using by making the reciprocal links to the VIAF->Wikipeida links. A bot that will be created for the purpose, by a combination of interested community developers, and OCLC developer resources if necessary.
- Pros: Gains mindshare on the importance of the project, and precedence for linking more library sources. Easy use of the identifiers for readers.
- Cons: This will put external links on several hundred thousand pages, which may cause community disputes about which sources to use and whether this is appropriate. The {{Authority control}} template is occasionally challenged as visual clutter and may not be appropriate on some pages.
- We do it invisibly; we either create a new non-displaying template and tracker category, or we leverage an existing one - {{persondata}}, for example, and add an identifier to that.
- Pros: Less controversial and still builds infrastructure for potential Wikidata use. Editors are still able to choose to use {{Authority control}}, but are not forced to do so.
- Cons: Raises no awareness in the short term for the project or the use of VIAF.
Either of these is workable; it's really a matter of what the community RFC chooses as the best way of doing it. My (Andrew's) personal preference is to do it entirely invisibly (possibly in persondata); this would leave the opportunity for people to add visible linkage templates only when it seems editorially appropriate as a start.
Long-Term Wikidata Integration
Wikidata has the potential to be a “game changer” and that it will “fundamentally alter the way we think about Wikipedia.” We need to imagine a world where each VIAF entity, Bibliographic entity, and Wikidata entity had it’s own Uniform Resource Identifier (URI). Each Wikidata URI that was an author or book would link to the VIAF and could automatically read live data from the linked data and be negotiated upon to deliver item properties. These item properties could serve to dynamically generate infoboxes.
Also, VIAF might possibly be a good set of seed data for Wikidata because it represents multi-lingual linked concepts. Furthermore once clusters form in Wikidata there be concepts with GND identifiers with out VIAF identifiers, and these could be related which would help to contribute back more accurate matching of VIAF.
Thoughts?
The central crux of this proposal is to reassess the utility and acceptance of the {{Authority control}} template. Do we want to create approximately a one-quarter-million edits in conjunction with this template? And if so, does that also imply that a bot to make these edits should be approved once it is proved to be technically sound?
Ideas from Village Pump Discussion
- Make use of a notice board to report inaccuracies like de:WP:PND/F
- Include a phase after the bot runs to compare the VIAF entries of deWP and enWP where the pages are linked.
- Onwiki discussion to develop proposal (by late June)
- RfC on finalised proposal (by mid-July)
- Creation of processes and bots; bot approval (by end of July)
- Deployment of content (through August)
- Wikidata integration (part of phase 2 of Wikidata - dependent on that schedule)
- Documentation (through August)
- Maintenance (...ongoing...)
If the community RFC decides not to go ahead with the project, we'll still be able to pass the data generated so far to the Wikidata team, and hopefully it can be used there.
Any comments, criticisms, etc. gratefully received.
- Max Klein, OCLC Wikipedian in Residence, and Andrew Gray, British Library Wikipedian in Residence.