Wikipedia:Authority control integration proposal
Introduction
What we've been working out is a way to match as many Wikipedia articles as possible to an authority identifier, and embed this in the article in some way. This will help build infrastructure for external services wanting to use Wikipedia, and can be pulled into Wikidata when that project starts taking structured data from Wikipedia over the next year.
What's happening already?
Authority control is currently done to a limited extent on en.wp by use of {{Authority control}}, which places a visible set of links on the footer of the article, using a variety of existing authority control identifiers.
At the moment, this template is rarely used - it's in about 4,000 articles out of ~1,000,000 biographies. de.wp has a similar template (Normdaten) which is used in 217,000 articles out of 438,000 biographies. This precedent suggests that en.wp could include some form of identifier in several hundred thousand articles.
What's the point?
There's many uses for authority control services. For example, a tool currently implemented for de.wp allows you to link directly to a Wikipedia article using only an identifier number - this means that any service with a robust authority control system can link automatically to Wikipedia articles without worrying about checking links to names, or disambiguation, or transliteration.
How would we get the data?
Almost every biography in en.wp currently has {{persondata}}; this provides basic structured information (name, dates, etc.) about the subject. With this, we can algorithmically compare the metadata to the authority database - at the moment, VIAF - and return either a clear match, a possible, or no result. The clear matches can then be imported into Wikipedia, and the possibles checked more closely.
There are some other approaches; for articles with German interwikis, we can pull information from there; and for a number of articles, we can use existing links embedded in the VIAF records themselves.
For the first phase of this project, we would probably use only VIAF records. It is possible we could run it again at a later date with other identifiers; however, many of the other major identifiers (LCCN, etc) are included in VIAF in some way, and so it covers a lot of bases already.
...imported?
There are potentially two ways to include the identifiers:
a) We do it visibly; we roll out the existing {{Authority control}} template, or something similar, to a lot more pages.
b) We do it invisibly; we either create a new non-displaying template and tracker category, or we leverage an existing one - {{persondata}}, for example, and add an identifier to that.
Either of these is workable; it's really a matter of what the community feels is the best way of doing it. My (Andrew's) personal preference is to do it entirely invisibly (possibly in persondata); this would leave the opportunity for people to add visible linkage templates only when it seems editorially appropriate.
Thoughts?
We'd like to gather some feedback on this draft. Ideally, what we'd hope for is to get a fleshed-out proposal (with full details) within the next couple of weeks, then submit that to a broadly advertised community RFC for approval, including deciding the thorny question of how best to include the identifiers in articles. Once that RFC's closed (perhaps mid-July?) we'll hopefully be ready to go.
During the RFC period, we can experiment with extracting the data and generating/validating matches; if the community RFC decides not to go ahead with the project, we'll still be able to pass the data generated so far to the Wikidata team, and hopefully it can be used there.
Any comments, criticisms, etc. gratefully received.
- Andrew Gray & Max Klein.