Jump to content

Wikipedia:Citations Database

From Wikipedia, the free encyclopedia

The Wikipedia Citations Database is a comprehensive database of citations (broadly construed) that appear and have appeared on English Wikipedia. It is developed by the Internet Archive. As of 2025, a prototype citations database is online for review, based on a Wikipedia dump from October 2024. The system works by parsing Wikipedia articles and extracting the references from there. For that reason, there may be errors in accuracy, especially with older revisions or revisions with broken syntax.

Using the prototype

[edit]

The prototype is available at wikipediacitations.scatter.red. Enter the full link of the Wikipedia article (e.g., https://en.wikipedia.org/wiki/Wikipedia). If you select "show raw references," you will get the exact wikitext of the citation instead of the normalized (cleaned up) format.

When you load the report, it will show the complete citation history, including older variants of the same underlying citation. To narrow down to the citations that appeared as of a given date, enter a date in the textbox above the table and click "Apply". You may have to wait several seconds for the page to process. If you are told the page is frozen, keep selecting "wait" until it is finished. It is working; it can just take a while.

The leftmost column includes the content of the citation, raw or normalized. The "earliest revision timestamp" is the first revision where that given citation appears. The "latest revision timestamp" is the most recent revision where it appears. (Note that the database only goes up to early October 2024.) There are also SHA1 hash colunms, uniquely identifying the text of the (raw or normalized) citation, as well as a "record" SHA1 which combines the normalized reference hash with the domain and page title.

You can toggle the visibility of any column by selecting (or de-selecting) the checkbox corresponding to that column. You can also click on the column header to sort by that column.

This database does not make semantic inferences, e.g., that a given bit of wikitext on an article corresponds to a specific, known source as documented in a bibliographic database. Those inferences will be carried out in a separate step. The goal at the moment is to isolate wikitext.

Feedback

[edit]

The goal of releasing the prototype database is to get feedback on any errors or opportunities for improvement. If you notice anything that could be improved, you are strongly encouraged to leave your feedback on the talk page.

Known issues

[edit]
  • Missing pages: if a page is missing, either it was created after October 2024 (so it wouldn't be in the dump) or it was created before but was not imported before putting the database online (it is only partially built).
  • Extracting citations from broken wikitext: if the wikitext has some error that interferes with parsing, it will extract more than just the citation.
  • Counting slight variants as separate citations: if two citations are to the same document, but each is formatted slightly differently (e.g. one contains an ISBN the other doesn't), the database will treat them as separate even if it is to the same underlying thing.

Code

[edit]