User:Bluerasberry/Wikidata graph split
This is a draft of a potential Signpost article, and should not be interpreted as a finished piece. Its content is subject to review by the editorial team and ultimately by JPxG, the editor in chief. Please do not link to this draft as it is unfinished and the URL will change upon publication. If you would like to contribute and are familiar with the requirements of a Signpost article, feel free to be bold in making improvements!
|
Wikidata Graph Split and how we address major challenges
If we had a problem, then would we talk about it?
[edit]On 20 January 2026, the Wikimedia Foundation finalized the split of Wikidata into two collections of data, or "graphs". This Wikidata Graph Split affects the hundreds of regular contributors and thousands of regular tool users in the WikiCite community, who see value in curating a Wikimedia citation database. Since at least 2015, WikiCite has been among the most popular Wikidata projects in terms of contributor count, content produced, financial investment, institutional partnerships, active discussions, count of non-editor users, and stirring of passion. Also since 2015, WikiCite's popularity exceeded the limits of Wikidata, or broke Wikidata, and consequently Wikidata has turned away new users, institutional partnerships, financial investments, and major content contribution projects due to our infrastructure lacking capacity to accept the contemporary standard of small data upload projects. All of us Wikipedia editors understand technical limitations throughout the Wikimedia projects, and to me Wikipedia's last-generation janky technology is 💙 cute and endearing. But in the case of Wikidata's limits, the part that seems different to me is that after 10 years of turning away new users, we have ambiguity on if and when Wikidata's capacity will ever increase. I do Wikidata at my university, and I preach Wikidata as the solution to all kinds of real-world challenges, and I am optimistic about limitations. What is much more difficult for me to manage is long-term perpetual uncertainty in the absence of conversations or planning. I can cool disappointment in people who say that Wikidata is technologically insufficient right now, but I would have liked to give an update at some point in previous years that there was a plan or schedule to update Wikidata, at which point users should come back to do their data project. In the Wikimedia Movement we track new user registrations, but no one makes reports of how many people or institutions leave in disappointment, or for how many years they have been telling their friends that Wikidata is not viable for common sorts of data curation projects. If we had a major problem with a Wikimedia platform, then do we have the community infrastructure to talk about it?
My feeling is that our Wikidata challenge was not technical, but rather was about interpersonal relationships. For the future, I want confidence and trust that when we Wikimedia editors have major challenges, then we have a community governance system to recognize and discuss them. Look here with me at the circumstances which have slowed Wikidata growth for some years, and be hopeful with me about the success plan to fix things by summer 2027 when the Wikimedia Foundation will migrate Wikidata's backend to a new SPARQL engine.
Why anyone should care about WikiCite or Scholia
[edit]TL;DR Universities are in the business of doing research, but they neither know who their own professors are, or what research their own professors publish. This is a crazy insight. WikiCite is a project which creates Wikidata metadata of all researchers, and matches them to their publications. Universities already pay a lot
For a typical person who is not into library indexing, here is how I explain this: Many universities are in the business of doing research, and the job of professors at research universities is to do research and publish Academic journal However, right now, despite the Internet, AI, technology, and everything else, have no idea who their faculty are or what their faculty do.
there is hardly any university in the world that is able to push a button and get a report of all the research which its faculty published in the last year. Every university wants that information, and the ones with money pay for costly subscriptions which attempt to give this information. If Wikipedia had a scholarly catalog, then we could refer readers and researcher to all publications by author, university, topic, region, co-author network, funding source, ethical compliance, method used, or any other searchable research criteria. We have not yet even begun the age of universal access to fundamental library catalogs. Even universities who are not persuaded by such idealism are often persuaded to contribute data to match their researchers to grant opportunities, or to evaluate department publication output, or to identify peer reviewers or collaborators for research. A 2025 survey of Scholia collected user feedback that there is a base of power users who find value in this approach.
WikiCite is the project to curate scholarly metadata in Wikidata. It includes the editing project, the community of editors and conferences, and outreach efforts through which institutions contribute their data, such as the WikiProject Program for Cooperative Cataloging project which recruited 50 universities to index their research in Wikidata. There are a handful of projects in the Wikimedia Movement which have 100s of editors and a portfolio of institutional partnerships. Although there are multiple reasons why editors come to WikiCite, a unique connection that the project has is that universities index their faculty and research publications in Wikidata both for Wikimedia community curation, and also because Wikidata is a good value investment for any research institution to circulate its research output as linked open data in all other Internet services and AI which index research.
Scholia is a friendly web interface for accessing WikiCite collections. It is friendly in the sense that it has more than 400 scholarly queries already formatted, for example, list of a researcher's publications, list of people and research at a university, or profile of research on a topic. This sort of service is "scholarly profiling", and to sort this data, one needs the "scholarly graph of metadata" as Linked Open Data connecting topics to scholarly articles to authors to their institutions, co-authors, software, datasets, grants, and everything else. Scholia and WikiCite are the Wikimedia projects for scholarly profiling, and alternatives to services including Google Scholar, Web of Science, or OpenAlex. I am part of the Scholia team, and I am biased, but I think the WikiCite approach to connecting Wikimedia projects to a global scholarly database is one of the best and most popular project ideas that the Wikimedia Movement has developed.
Exceeding the limits of Wikidata
[edit]In May 2024, The Signpost shared my story that Wikidata would soon split as the sheer volume of information overloads the infrastructure. Disclosure: I am a Wikimedian in Residence who develops Wikidata content as a university researcher, so please note that I have an employer conflict of interest in this op-ed and in Wikidata's perpetual growth.
The split divided WikiCite content, which was 1/3 of the content of Wikidata, from everything else in Wikidata. The Wikimedia Foundation and Wikimedia community actually did discuss this, a lot. I really appreciate the Wikimedia Foundation staff who did many favors for me to give me many meetings monthly since 2024 by video, email, at conferences, and through referrals. Copied from the 2024 Signpost article, here again are the major discussion reports. The insight to gain from these reports is long term recognition of a major challenge, when all the while Wikidata is at reduced growth with no planned year in which we would increase capacity. No one did anything incorrectly, and delaying the decision always made sense at the time.
- 2018 d:Wikidata:WikiCite/Roadmap
- 2019 d:Wikidata:WikiProject Limits of Wikidata
- 2021 wikitech:User:AKhatun/Wikidata Scholarly Articles Subgraph Analysis
- 2021 d:Wikidata:SPARQL query service/WDQS backend update/Blazegraph failure playbook
- 2021 WikiCite panel discussion (WikidataCon 2021 recording) (video)
- 2023 WikiCite talk page discussion
- 2023 meta:WikiCite/Roadmap 2023
- 2024 d:Wikidata:SPARQL query service/WDQS graph split/WDQS Split Refinement
In retrospect, I see parts of the Wikimedia Movement that invest heavily in growing the editor community, and other parts of the Wikimedia community where I feel that technical challenges are incompatible with editor recruitment. I never expected Wikidata to be closed and in limbo for 10 years, but no community group ever organized to make a leadership statement of when Wikidata might update, and how we should make multi-year plans. There were thousands of hours of user time spent talking about the problem. We were unable to establish a governance plan to evaluate the cost of delay versus the scheduling of a decision.
Wikidata Graph Split
[edit]While WikiCite is a major Wikidata project, Wikidata is such a large platform that most Wikidata users do not curate citations, and will not notice the Wikidata Graph Split. Individuals and institutions who do metadata curation can update their processes as suggested in the Graph Split FAQ. As Wikidata is not designed to split or federate, all queries which seek both a citation and anything else need to look in two separate databases. Besides Wikimedia editors as stakeholders, I see great value in growing multi-year Wikimedia collaborations with universities. A lot of universities have contributed scholarly data to Wikidata, and I regret that when we have lots of high-level institutional buy in where universities pay faculty and librarians to train Wikidata editors, we have not responded by acknowledging this external investment into our ecosystem with internal investments expand our capacity to accommodate new users, data uploads, and the institutional partnerships. The split is not primarily a WikiCite issue, but instead WikiCite was the most prominent community project to recruit Wikidata contributions from staff at many institutions, and it makes me anxious to think that we have for years communicated that Wikidata is an unstable project and not ready for partnerships.
Although the Wikidata Graph Split affects me greatly, and although it has been a massive burden to me and my colleagues in the WikiCite project, we WikiCite contributors recognize that software updates and necessary and decisions are hard to make. The bigger issue to me is that the WikiCite community has had major challenges for 10 years, but lacked communication and governance processes to confirm which year we could anticipate turning Wikidata back on. Because of both the technical limits of Wikidata and the lack of Wikimedia user governance processes for discussing our challenges, we have a 10-year generation where we have turned away new users and institutional collaborations even while editor recruitment is a top Wikimedia Foundation priority as named in the 2025-26 annual plan and elsewhere. Wikidata is unique among Wikimedia projects in that it spontaneously attracts institutional partnerships including universities, museums, and research institutes, where the organizations recognize the value in paying their staff to curate Linked Open Data and send it into the Internet through Wikidata. Contrary to Wikidata's attractiveness, Wikidata lacks technical capacity to host the contemporary standard of Linked Open Data projects which institutions intuitively expect Wikidata to be able to handle.
Even so, the story here is social, and not technical. I am grateful for the Wikidata Graph Split because it advances a solution to the problem. What I want is to have more confidence that if we in the Wikimedia Movement face a great challenge, then we can all have faith that our global governance systems will identify them and plan a response with fewer years of uncertainty.
The Wikidata Query Service is available at
Before the graph split, anyone could query all of Wikidata that. After the graph split, now there is an additional endpoint,
which contains only citation metadata of scholarly articles. To perform a query which accesses both kinds of data, now users and tool developers must have a query which looks in both endpoints.
We created Wikidata Query Service graph split documentation to describe how anyone should respond to the Wikidata graph split. The major technical issue is that if anyone wants citation data through the Wikidata Query Service, then they have to write a two-part query in which they seek some data from the Wikidata main graph, then get citation data from the Wikidata scholarly graph. The broken queries include those in affected tools which query or process citation data, so anyone who notices a bug in a tool should please report it.
The Scholia team hosts virtual hackathons where anyone can put issues or problems in queue for the volunteer developer team to address in the next round. The April, November, and December events from 2025 all have documentation on what volunteers had to organize to prepare for the January 2026 graph split. This is both extraordinary that volunteers put these events and labor together, but also common across Wikimedia projects that volunteers organize responses and adaptations to keep tools functional in response to Wikimedia Foundation platform changes.
Blazegraph migration
[edit]Wikidata was established in 2012 as the linked data complement to Wikipedia's prose, and was part of our strategy to keep Wikimedia projects technologically advanced. The software backend of Wikidata is the scrappy Blazegraph, which is free and open-source software. At the time of Wikidata adopting it, it already had its own independence, development team, and funding to sustain it. While no one can buy or close open-source software, companies can do acqui-hiring of every developer and expert on the software. Amazon acquired the Blazegraph team soon after Wikidata had committed to Blazegraph as its SPARQL engine for queries. Consequently, Wikidata's SPARQL engine backend has not had a significant update since Wikidata established its SPARQL endpoint in 2015.
While the Wikidata graph split relieves the Wikimedia Foundation servers of the intense computation required of a larger dataset, the graph split is not intended as a solution. If Wikidata users were allowed, they would upload datasets to again fill Wikidata to capacity again. With current restrictions in place, and at the current rate of allowed Wikidata user editing, the Wikidata main graph should again reach data holding capacity in 2028. Blazegraph is now abandoned technology and inferior to alternatives. The planned solution to ready Wikidata for next generation editing is to migrate Wikidata's SPARQL engine from Blazegraph to another, undecided database backend by summer 2027. Speculating on future capacity is only wishful thinking, but realistic estimates for the next backend range from 3-15 times the capacity that we get from Blazegraph.
In September 2025, the Wikimedia Foundation announced a schedule for a Wikidata Query Service backend update. It is good news for Wikidata editors that there is a newly appointed Wikidata Platform WMF staff team doing these changes. Everyone should support them and wish them all success. Theu are available to meet during scheduled office hours. Although I am an active WikiCite content contributor, I know nothing about options for database architecture or estimating computational capacity. I do not think it is worth speculating whether we now have a Wikidata Platform team because the technological environment has progressed to the point that we now have a ripe opportunity for migration, versus whether now is the time to migrate because Wikidata is in existential crisis and migration is our desperation and compulsion. Wikimedia projects are a marvelous technological and social environment, and we face many threats, and it is a normal part of our reality that we repeatedly face major threats only to escape because some free and open source software community has developed and freely provided the miraculous fix we need to survive. Software solutions which the Wikidata platform team are testing include QLever and Virtuoso.
Another major change which is timely now is that when Wikidata migrates to a new SPARQL engine, we will migrate to standard SPARQL 1.1. The Wikidata Query Service has been using a customized version of SPARQL only for Wikidata. It was not adapted to be compliant with the latest version of SPARQL used elsewhere, so anyone using both Wikidata and other SPARQL applications would need to know the variations. Bringing Wikidata to conform with the SPARQL that everyone else uses will make SPARQL queries more reusable.
Selection of next-generation SPARQL engine
[edit]
If all goes well, we should have a revived Wikidata by mid 2027 with greatly expanded capability for processing data and inviting institutional partnerships.
How we talk about challenges
[edit]The solution that I want for the graph split, and for many other existing Wikimedia Movement challenges, is simply to be able to see that there is some group of Wikimedians somewhere who have active communication about our challenges. I want to get public communication from leadership who acknowledges challenges and who has the social standing to freely discuss possible solutions. I want to see that someone is piloting the ship upon which we all sail, and which no one would replace if it ever failed and sunk. For lots of issues at the intersection of technical development and social controversy - data management, software development, response to AI, adapting to changes in political technology regulation - I would like to see Wikimedia user leadership in development, and instead I get anxious for all the communication disfluency that we experience. Ten thousand of us or so participated in the 2018-2020 Wikimedia Movement Strategy, which had the goal of improving our governance infrastructure such that if we ever had a major problem, then we would quickly identify it and discuss it without fear. The Wikidata Graph Split is not the story here. The story here is that so much in the Wikimedia Movement is fragile, and that when we have major challenges then networks like WikiCite are unable to create chains of decision making to address them.
I appreciate all the effort that Wikimedia Foundation staff put into collaborating with the WikiCite community for the transition. As always with Wikimedia Foundation staff, they are friendly and committed for as long as their manager gives them a resource allocation to care for a project. It is a funny dynamic to have a group of passionate users who choose to work on a project talk with line workers who are supremely savvy to fix problems, but also, they are on command from a powerful and mysterious cloud of corporate software development, and they act after negotiation with both the Wikimedia Foundation and Wikimedia Deutschland. Communicating across a chain of staff to get to decision makers, over a 10 year conversation, and where the decision makers never quite enter the decision with a schedule for a plan, is an odd thing to experience. Also as I understand, this is just normal for Internet tech platform development anywhere, and is the way that user communities experience software updates.
What you can do
[edit]- If you have a problem with something in Wikimedia platforms, be at peace with longer term solutions, because sometimes a conversation happens for 10 years then a solution comes
- Participate in on-wiki conversations to make decisions.
- If you want to talk with the Wikimedia Platform team then there are Migration office hours
- Wikidata is currently having its boldest discussion on notability criteria at d:Wikidata:Requests for comment/Notability policy reform. Is WikiCite in scope? What about locations in OpenStreetMap? Should we graph split biographies? Can we do WikiCite, but for Internet Archive holdings instead of scholarly publications? Is it finally time to import all proteins and all astronomical objects?
- The Wikimedia Foundation and Wikimedia Deutschland agreed off-wiki that even after migration from Blazegraph, the split graphs will not be rejoined, even if the new platform has capacity. In Wikimedia projects, all kinds of decisions get made, and maybe they make sense or maybe not, but regardless they only get discussed in public if people ask. The tech decisions that get made affect the social relationships we grow in Wikimedia as a virtual space, and what kinds of collaborators we can invite in. I have no technical understanding of whether Wikidata should be federated into a series of graphs, but this is another decision that I know gave a lot of people anxiety over the years of discussion. If anyone has insight into the costs/benefits of federation, then it is needed in Wikidata Graph Split conversations.
- The Wikimedia Foundation operates Wikidata's API, and Wikimedia Deutschland operates everything else Wikidata. They share power and money with each other. I do not know anyone in authority for Wikidata issues at either place, but right now is Valentine's Day and I think they could be better pals. If anyone can, get interviews with representatives from both and get them to say publicly that each one wants the other to perpetually have all the power and control and money that they currently do. If either objects, then get them to talk it through.
- Please sign to support meta:WikiCite (3), which is a proposal to establish WikiCite the citation database as an official Wikimedia project











Discuss this story