Wikipedia:Using neural network language models on Wikipedia
![]() | This idea is in the brainstorming stage. Feel free to add new ideas; improve, clarify and classify the ideas already here; and discuss the merits of these ideas on the talk page. |

With the rise of machine learning, discussions about Wikipedia and AI models are becoming more and more heated. As of December 2022, with the release ChatGPT for free to the public, AI has shown its potential to either massively improve or disrupt Wikipedia. It is clear that research is needed in order to inform discussions surrounding potential AI policies, so I made this page to catalog my observations around ChatGPT and its potential use based on its capabilities.
tl;dr: Don't use neural networks to generate content, use them to assist you at creating content. Especially in the neural network context, confidence in the result does not mean validity.
The chat transcripts here are donated to the public domain by me, and the bot's text here is not copyrightable as there is no author that can claim copyright for, at least under the current (2022) law. It is worth noting that OpenAI has used Wikipedia as a training dataset. I am open to removing these texts from the page as a copyright violation.
Proposed guidelines
Based on my research below, here are my proposed guidelines on how to align neural network models to our purpose of building an encyclopedia. Some of the guidelines are obvious from common sense, but I think it's worth it to write them down.
- You may not ask neural networks to write original content and find sources, as these neural networks don't know what is right and wrong. Adding these kind of content would jeopardize Wikipedia's WP:OR and WP:RS policy. Even if it is heavily edited by humans, seek other alternatives that don't use the neural network's original content.
- You may use these neural networks as a writing advisor, i.e. asking for outlines, asking how to improve the paragraph, asking for criticism for the text, etc. However, you should be aware that the information it gives to you can be unreliable and flat out wrong. Use due diligence and common sense when choosing whether to incorporate the neural network's suggestion or not.
- You may use these neural networks for copyediting and paraphrasing, but note that it may not properly detect grammatical errors or keeping the key information intact. Use due diligence and do heavily edit the response from the neural network.
- Use due diligence when crafting prompts for neural networks. Prompt designed for Wikipedia should use natural sentences and be as descriptive as possible, and include keywords such as "encyclopedic", "keep the meaning intact", etc. to minimize tbe AI from adding original content.
- You are responsible for making sure that using neural network will not be disruptive to Wikipedia. Therefore, you must denote whether the edit use the neural network or not, and what do you use it for in the edit summary.
Planning an article

It is no surprise that the bot can give coherent answers since it is based on the earlier GPT-3 model. As many has noted, original content from AI models should not be imported directly to Wikipedia due to sourcing and accuracy concerns. I am very impressed however by the fact that the bot knows about our internal policies and give a reasonable outline about how a WIkipedia article may be structured. It seems ChatGPT uses Wikipedia's policy pages in addition to articles for its dataset.
Based on the results, AI models seem to be a very powerful brainstorming tool, and via prompt engineering, these AI do allow an impressive amount of refinement to the plan. AI can also be a great tool as a pointer to potential sources and can remind editors of Wikipedia's content policy (NPOV, RS, etc.) Even though original content from AI is not suitable for Wikipedia as an import, it can be used by editors as an inspiration for research ideas. In the future, when Abstract Wikipedia becomes a thing, AI tools can be a massive help for organizing information for the planning stage of the article. This research is a bit limited due to the fact that the article SpaceX Starship has already existed when the AI was trained.
Transcript
- Ok, let's try it out then. Could you make an outline for a Wikipedia article about Starship, a rocket by SpaceX?
Sure! Here is a potential outline for a Wikipedia article about Starship, a rocket developed by SpaceX:
- Introduction
- Brief overview of Starship and its purpose
- Overview of SpaceX and its history
- Development and design of Starship
- Description of the development process and timeline
- Overview of the design and features of Starship
- Any notable updates or changes to the design
- Testing and launches of Starship
- Details of any test flights or launches that have occurred
- Any notable successes or failures
- Plans for future launches and missions
- Future plans and potential uses for Starship
- Description of any future plans or goals for the rocket
- Potential uses for Starship, such as crewed missions to the Moon or Mars
- Comparison to other rockets and space technologies
- Conclusion
- Summary of the main points of the article
- Future prospects for Starship and SpaceX
This outline is just a suggestion and could be modified or expanded upon as needed. It is also important to consider the specific focus and audience of the article when deciding on the content to include.
- I want the article to focus more on providing technical details about the rocket, while still being easy for non-technical readers to read the article. Could you refine the outline a bit more please?
Certainly! Here is a revised outline for a Wikipedia article about Starship, with a focus on providing technical details about the rocket in a way that is easy for non-technical readers to understand:
- Introduction
- Brief overview of Starship and its purpose
- Overview of SpaceX and its history
- Technical details of Starship
- Description of the rocket's design and features, including its size, materials, engines, and other key components
- Explanation of how the rocket functions and operates
- Any notable updates or changes to the design
- Testing and launches of Starship
- Details of any test flights or launches that have occurred
- Any notable successes or failures
- Plans for future launches and missions
- Potential uses and future plans for Starship
- Description of any future plans or goals for the rocket
- Potential uses for Starship, such as crewed missions to the Moon or Mars
- Comparison to other rockets and space technologies
- Conclusion
- Summary of the main points of the article
- Future prospects for Starship and SpaceX
This outline is just a suggestion and could be modified or expanded upon as needed. It is important to consider the specific focus and audience of the article when deciding on the content to include.
- That's a great outline! Now, could you try to make a short paragraph about the rocket's statistics
- Mass: 5000 metric tons when fueled
- Height: 120 meters
- Diameter: 9 meters
- Thrust: 75.9 meganewtons
- Capacity to low Earth orbit: 100 metric tons
- Add comparison with the Saturn V rocket, which is 110 meters tall, 2800 tons in weight, and has a thrust of 33 meganewtons
Sure! Here is a short paragraph about the rocket's statistics:
Starship is a massive rocket, weighing in at 5000 metric tons when fueled. It stands at a towering height of 120 meters, with a diameter of 9 meters. With a thrust of 75.9 meganewtons, it has the capacity to lift 100 metric tons to low Earth orbit. For comparison, the Saturn V rocket, which was used during the Apollo program, was 110 meters tall, weighed 2800 tons, and had a thrust of 33 meganewtons. This demonstrates the impressive capabilities of Starship and its potential for future space exploration.
- That's a nice paragraph, but could you try to tone down the paragraph a bit and make it more in line with Wikipedia's encyclopedic voice?
Certainly! Here is a revised version of the paragraph that is written in a more encyclopedic style:
Starship is a spacecraft developed by SpaceX that has a mass of 5000 metric tons when fueled. It has a height of 120 meters and a diameter of 9 meters. Its thrust capacity is 75.9 meganewtons. Its intended payload capacity is 100 metric tons to low Earth orbit. For comparison, the Saturn V rocket, which was used during the Apollo program, had a height of 110 meters, a mass of 2800 tons, and a thrust of 33 meganewtons. This demonstrates the impressive capabilities of Starship and its potential for future space exploration.
- I like this paragraph more than the last one. Now, could you make the paragraph more coherent?
Certainly! Here is a revised version of the paragraph that is more coherent and flows more smoothly:
Starship is a spacecraft developed by SpaceX with impressive capabilities for space exploration. It has a mass of 5000 metric tons when fueled, stands at a height of 120 meters, and has a diameter of 9 meters. Its thrust capacity is 75.9 meganewtons, and it has the capacity to lift 100 metric tons to low Earth orbit. For comparison, the Saturn V rocket, which was used during the Apollo program, had a height of 110 meters, a mass of 2800 tons, and a thrust of 33 meganewtons. This shows the significant advancements in technology and capabilities between the two rockets.
- Alright, I like this paragraph the most! What sources should I reference and cite when I publish this paragraph to Wikipedia?
It is important to properly cite your sources when publishing content on Wikipedia to ensure that the information is accurate and can be verified by other readers. When citing sources for the information in the paragraph about Starship, you could consider including the following:
- A reference to SpaceX's website or other official documentation about the Starship spacecraft
- A reference to a reliable news article or report about the rocket's development and testing
- A reference to a technical publication or report that provides more detailed information about the rocket's design and capabilities
You can use the citation style guidelines for Wikipedia to format your citations properly. It is also a good idea to include multiple sources for each piece of information, to ensure that the information is accurate and can be easily verified.
- Agreed. Thanks a lot for your useful feedback!
You're welcome! I'm glad that I could help. If you have any other questions or need further assistance, don't hesitate to ask.
Copyediting paragraphs

AI copyediting of Wikipedia text as of 2022 can slightly reduce the work copyeditors need to do. However, human supervision is critical when using such tools. This task heavily relies on prompt engineering in order for the AI to give satisfactory results. For me, I settled with the prompt "Can you copyedit this paragraph from Wikipedia while still keeping the tone and the information as intact as possible:" followed by the paragraph without citations in plain text. There seems to be room for improvement for the prompt as ChatGPT may occasionally give texts that have run-on sentences or grammatical errors, but other than that, the text usually is more clear after a run by the AI.
Even though the AI is conservative at removing information and details, the text's length usually decrease by quite a bit as it removes redundant phrases. The AI is also good at reordering phrases to make the text more coherent, but at the cost of grammar errors and obscuring meaning. In more developed articles, AI seems to give more minor fixes to the text and less inclined to slash out content. In my opinion, ChatGPT can be used on Wikipedia as a coherence checker, as long as care is taken to make sure that no critical information is lost.
I've published these AI generated texts below on Wikipedia after very heavily modified them. Overall, I think that ChatGPT can does reduce the copyedit work quite a bit, but not as much as a lot of people make it out to be. Think of the AI response as a "second opinion" about what to cut, not as an authoritative answer.
Transcripts
—Land
There are several methods of irrigation. They vary in how the water is supplied to the plants. The goal is to apply the water to the plants as uniformly as possible, so that each plant has the amount of water it needs, neither too much nor too little. Irrigation can also be understood whether it is supplementary to rainfall as happens in many parts of the world, or whether it is 'full irrigation' whereby crops rarely depend on any contribution from rainfall. Full irrigation is less common and only happens in arid landscapes experiencing very low rainfall or when crops are grown in semi-arid areas outside of any rainy seasons.
Surface irrigation, also known as gravity irrigation, is the oldest form of irrigation and has been in use for thousands of years. Micro-irrigation, sometimes called localized irrigation, low volume irrigation, or trickle irrigation is a system where water is distributed under low pressure through a piped network, in a pre-determined pattern, and applied as a small discharge to each plant or adjacent to it. Micro-irrigation uses less pressure and water flow than overhead irrigation. Drip irrigation trickles out at the root zone. In sprinkler or overhead irrigation, water is piped to one or more central locations within the field and distributed by overhead high-pressure sprinklers or guns. Subirrigation has been used for many years in field crops in areas with high water tables. It is a method of artificially raising the water table to allow the soil to be moistened from below the plants' root zone.There are several methods of irrigation that differ in how water is supplied to plants. The aim is to provide each plant with the necessary amount of water, neither too much nor too little, as uniformly as possible. Irrigation can be supplementary to rainfall, which is common in many parts of the world, or it can be full irrigation, where crops rarely rely on any contribution from rainfall. Full irrigation is less common and only occurs in arid landscapes with very low rainfall or when crops are grown in semi-arid areas outside of rainy seasons.
Surface irrigation, also known as gravity irrigation, is the oldest form of irrigation and has been in use for thousands of years. It involves the application of water to the surface of the soil. Micro-irrigation, also known as localized irrigation, low-volume irrigation, or trickle irrigation, is a system that distributes water under low pressure through a piped network and applies it as a small discharge to each plant or near it. It uses less pressure and water flow than overhead irrigation. Drip irrigation delivers water directly to the root zone of plants. In sprinkler or overhead irrigation, water is piped to one or more central locations within the field and distributed by high-pressure sprinklers or guns. Subirrigation has been used in field crops in areas with high water tables for many years. It involves artificially raising the water table to moisten the soil below the root zone of plants.—Irrigation, some paragraphs are removed
Shorten a bloated section

Based on User:JPxG's optimistic result using ChatGPT to condense plot summaries, I tried my hand on trying to condense sections in general, which some articles in Wikipedia:Vital articles are guilty of. I found ChatGPT to be prone of the "garbage in, garbage out" problem; if the text contains a lot of junk and not enough useful details, then it may try to repackage those junk in the result, despite that you have told it explicitly to not do so.
New York City
I first try to shorten New York City's lead using this descriptive prompt "Could you try to shorten this opening section of Wikipedia's article New York City to about 3/4 the length so that it is encyclopedic, neutral, non-promotional and easy to understand for the reader? Do remove trivia such as surveys if necessary and do not make up details." and paste each paragraph one by one. This is first done not to drastically shorten the paragraph, but to preprocess the text for junk to make it easier for it to shorten later. Keep in mind that the original lead is basically a big steaming pile of junk and trivia. Here's what it spits out:
New York City is the most populous city in the United States, with a population of over 8 million people. Located at the southern tip of New York State, it is the center of the New York metropolitan area, the largest metropolitan area in the world by urban landmass. New York City is a major cultural, financial, and media center, with a significant influence on commerce, healthcare, research, technology, education, politics, tourism, art, fashion, and sports. It is home to the headquarters of the United Nations and is known for its international diplomacy and as a safe haven for global investors. The city is also known for its diverse population and linguistic diversity, with over 800 languages spoken. New York City is home to the New York Stock Exchange and Nasdaq, the two largest stock exchanges in the world by total market capitalization, and is a major hub for the art market, with many art galleries and auction houses hosting half of the world's art auctions.
New York City is located in the state of New York and is made up of five boroughs: Brooklyn, Queens, Manhattan, the Bronx, and Staten Island. It is known for its diverse population, with over 800 languages spoken, and for having the largest foreign-born population of any city in the world. The city's metropolitan area has a strong economy, with an estimated gross metropolitan product of over $2.1 trillion in 2022. New York City is home to a high number of billionaires and millionaires and was the wealthiest city in the world in 2017. New York City was founded as a trading post by Dutch colonists on the southern tip of Manhattan Island in the early 17th century. It was named New Amsterdam and later became a chartered city. The English took control of the city in 1664 and renamed it New York after the Duke of York. The city served as the capital of the United States from 1785 to 1790 and has been the largest U.S. city since that time. The Statue of Liberty is an iconic symbol of the city and the country, and has welcomed millions of immigrants to the U.S. over the years. New York City is known for its creativity, entrepreneurship, and sustainability, and is a symbol of cultural diversity. The New York Times is a renowned newspaper based in the city that has won numerous Pulitzer Prizes.
New York City is a popular global tourist destination, known for its iconic landmarks and entertainment industry. Times Square, located in the Broadway Theater District, is a major hub for tourists and the entertainment industry. The city is known for its fast pace and many famous landmarks, skyscrapers, and parks.
New York City is home to numerous landmarks and cultural institutions, including the Stonewall Inn in Greenwich Village, which is considered the epicenter of LGBTQ+ culture and the birthplace of the modern gay rights movement. The city is known for its expensive real estate market and its 24/7 subway system, the largest single-operator rapid transit system in the world. It is also home to over 120 colleges and universities, including Columbia University, New York University, and the City University of New York system. Wall Street, located in the Financial District of Lower Manhattan, has made New York City a global financial center and a hub for the art market, with many galleries and auction houses hosting half of the world's art auctions.
Integrate as a Wikipedia assistant

Optimistic about the result, I start to use ChatGPT as an assistant of sort to help me rewrite an article about OpenStreetMap. As the article stands before my edit, it was a mess patched together by primary sources. Of course, I reviewed all AI data thoroughly before submitting it to Wikipedia, but most of my queries will probably be about brainstorming and advising anyways. I won't delve in too much on the transcripts here unless it is relevant to the discussion.
Planning stage
Before actually querying ChatGPT for anything, I rewrote the lead based on my knowledge of the project to help with my prompt ideas. Even with the knowledge that I can just ask ChatGPT for a lead, I found that doing so helps a lot at reorienting yourself to improve individual sections.
Then, I asked ChatGPT for a potential outline for the article, and this is what it spits out:
- Introduction
- Brief overview of OpenStreetMap and its purpose
- Definition of key terms, such as "open source" and "map data"
- History of OpenStreetMap
- Founding of OpenStreetMap and early development of the project
- Key events and milestones in the history of OpenStreetMap
- Growth and evolution of the OpenStreetMap community
- Features and capabilities of OpenStreetMap
- Description of the types of map data available on OpenStreetMap (e.g. roads, buildings, points of interest)
- Discussion of the data sources used by OpenStreetMap, including satellite imagery and user contributions
- Overview of the tools and technologies used by OpenStreetMap contributors
- Community and governance
- Description of the OpenStreetMap community, including its size, demographics, and geographic distribution
- Overview of the governance and management of the project, including the role of the OpenStreetMap Foundation
- Discussion of the role of volunteers in the maintenance and development of OpenStreetMap
- Use cases and applications
- Discussion of the various ways in which OpenStreetMap is used, such as in mapping applications, navigation systems, and disaster response
- Mention of any notable companies or organizations that use OpenStreetMap in their products or services
- Controversies and challenges
- Overview of any controversies or challenges faced by OpenStreetMap, such as issues with data quality or copyright
- Conclusion
- Summary of the main points of the article
- Discussion of the significance of OpenStreetMap and its impact on the field of mapping and geographic information
The outline is a very good reference. It has shown me aspects of OpenStreetMap (OSM) that I haven't considered before and organize it in a way that it at least wiki-like (except for the conclusion). The general gist of the outline had made its way to the article, of course without point 6 since OSM is not that controversial to warrant such a section.
Out of curiosity, I ask ChatGPT for potential sources. This is a part that requires careful prompt engineering to make sure that it only shows general pointers for further research, like this:
To research OpenStreetMap for the article, you may want to start by visiting the OpenStreetMap website (https://www.openstreetmap.org/) and reading about the project's history and features. You can also browse the OpenStreetMap wiki (https://wiki.openstreetmap.org/) for more detailed information about the project. In addition, you should look for reliable sources such as news articles, academic papers, and other published works about OpenStreetMap to use as references in your article. Finally, you may want to consider reaching out to members of the OpenStreetMap community for additional insights and perspectives on the project.
and not actually listing specific non-existent-but-very-credible-sounding-sources like this:
- "The Making of OpenStreetMap: An Analysis of Community and Crowdsourced Geographic Information" by Schuyler Erle and Jo Walsh (2014)
- "OpenStreetMap: User-Generated Street Maps" by Schuyler Erle (2006)
- "Volunteered Geographic Information and OpenStreetMap: Mapping the Developing World" by Michael J. Widener and Jennifer S. Light (2012)
- "Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice" edited by Michael F. Goodchild, Mike F. Goodchild, and Jan Elbæk Pedersen (2014)
- "OpenStreetMap: A Review and Future Challenges" by Michael J. Widener and Jennifer S. Light (2016)
As a bonus point, it gives good advice for writing Wikipedia articles along the way, such as this:
You may also want to search for more recent articles and papers on the subject to get a more up-to-date understanding of OpenStreetMap and its role in the field of mapping and geographic information. It's important to use reliable sources when writing a Wikipedia article, so be sure to carefully evaluate the credibility and relevance of any sources you use.
Detecting generated text

One of the main concerns about using these language models is that a person cannot detect whether the text is original or it is written by AI. However, in my opinion, this can be easily solved by designing an algorithm/AI that does just that. In a demy by Hugging Face at [1] (based on RoBERTa), even with a heavily edited paragraph (such as those in § Copyediting paragraphs), the detector can recognize AI text and real text with extremely high confidence (>99%); make sure to remove the reference notes "[1], [2]" beforehand. Such a model can be extremely useful for ORES, a MediaWiki machine learning API primarily used to detect vandalism in Special:RecentChanges. Overtime, these detecting models would seem to have a hard time finding "abnormalities" as AI text generation becomes more sophisticated.
See also
- User:JPxG/LLM demonstration, a similar experiment done by a different user