Wikipedia talk:Large language models/Archive 5

This is an archive of past discussions on Wikipedia:Large language models. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

←

Archive 3

AI Generated Content:(Ban, Limit or Allow)?

(Moving this from Village Pump's policy board)

Prompted by some responses in the Wikimedia Community Discord to a query I had about using AI tools (such as LLM's or Alpha to generate content for wikis.

I present three choices of wording of a policy/guideline on the inclusion use of AI Generated content, such as for example that from LLM's :-

1. " Wikipedia is a entirely work of collaborative human authorship, Use or contribution of material generated wholly or mostly in part from non human sources (such as LLM based generation) is prohibited."

2. "Wikipedia is a primarily a work of human authorship, Use of content generated from AI's s (such as LLM based generation) should be used sparingly and content generated with it's assistance should be clearly identifiable as such, with full attribution of the tools or models used."

3. "Wikipedia is a collaborative work, and users may make use of appropriate tools such as LLM's (with appropriate attribution), in order to further this aim."

This of course assumes that the generated content meets all other considerations for content that would apply irrespective of human vs AI generation.

I'm not going to argue for any specific position, but my concerns about AI generated content, are the lack of clarity and transparency about usage rights under compatible license, and the possibility of copyright material 'leaking' into an otherwise 'freely' licensed wiki.

English Wikipedia should have a clearly documented policy, on what 'AI/machine-generated' content can or cannot be included.

I also appreciate that there are plenty of passive bots on Wikipedia that assist skilled users in performing tasks that would be time consuming to do manually. ShakespeareFan00 (talk) 12:29, 11 April 2023 (UTC)

It's not merely a matter of copyright "leaking" (i.e., chunks of material from the model's training set coming out in the material it generated.) At least some of the providers of AI tools are claiming copyright in the output. While there is much question as to the legal viability of such a move, to the best of my not-a-lawyerly knowledge this has not yet been settled in court. As such, we may discover at a later date that query results that have been integrated into the text are actually the copyright of an AI-generating firm that did not consent to its inclusion, even if they are not currently claiming that right. One could in theory have an AI system trained only on public domain texts and issuing its results with a Creative Commons license, but that is not the default, if it exists at all. As such, it may behoove us to wait until there is better established law that clears these concerns. --Nat Gertler (talk) 14:25, 11 April 2023 (UTC)

I like the "work of collaborative human authorship" language. Another option would be to treat LLMs as bots and require them to go through the bot approval process. –dlthewave ☎ 18:15, 11 April 2023 (UTC)

How would that work with indvidual (human authors) using LLM drafted content? A Policy that to use LLM derived content you have to have a bot flag for those edits perhaps? ShakespeareFan00 (talk) 19:28, 11 April 2023 (UTC)

I also like work of collaborative human authorship. The bots that operate at all in the mainspace are maintenace-focused at best, rescuing sources and auto-reverting the most blatant vandalism. I don't even like the thought of whitelisting LLMs for use because of how it could spiral out of control, given their lack of human constraints.

While I have a great gratitude for editors arguing the copyright side of AI-generated content, my focus lies in the logistics of maintaining veracity and due-attention in the content of articles itself. Large language models are by their very nature and development, made to be believed, not to be true. They are grown to replicate human language and be able to pass for the work of a person, not to furnish truth or to adhere to good research practices. If you developed an LLM on content or writings exclusively written in British English, the algorithm would swear up and down the garden path that the word color can only be spelled colour. The algorithm cannot discern, and that is a problem. They are primarily concerned with responding to prompts with an answer that is both plausible and delivered like a human because these are what makes them convincing.
TL;DR Even in the hands of experienced editors, LLMs will deliver misinformation to its users and will, by their very design, be as convincing with these falsehoods as possible. GabberFlasted (talk) 18:59, 11 April 2023 (UTC)

Another concern. LLM's cannot necessarily infer bad intention of the users of them. If you ask a "wrong" question, the right way, or a "right" question the wrong way, you might get something unexpected. ShakespeareFan00 (talk) 19:37, 11 April 2023 (UTC)

+1 - you wrote exactly what I was going to say. LLMs are designed to the very best at producing plausible content to fool humans, and care about truth only to the extent that truth can help fool humans. If there isn't already an essay explaining this there should be one. I think the only tenable option for the survival of Wikipedia as a reliable work is banning LLMs. Galobtter (talk) 19:58, 11 April 2023 (UTC)

I agree with an approach based on how programs are used, rather than the specific underlying technologies, which are evolving and becoming increasingly embedded into many programs. Ultimately, though, if there is a significant increase in poorly written submissions, the community needs to figure out a way to handle them. A policy alone isn't going to prevent them. isaacl (talk) 21:41, 11 April 2023 (UTC)

becoming increasingly embedded into many programs

Indeed, when LLMs are by default completing everything you write in MS Word and Google Docs, a policy isn't going to do anything anymore. PopoDameron ⁠talk 23:05, 11 April 2023 (UTC)

I would be opposed to a complete ban on several grounds; it's impractical and doesn't recognize the wide range of ways LLMs might be used. The key thing is that we have to make sure everything added adheres to our existing core content policies; enforcing those properly will make most of the problems go away (aside from maybe the more hand-wavy copyright issues, which I feel are speculative and would oppose writing policy around today - as opposed to when LLMs spit out concrete copyvios, of course, which fall under current policy.) The one thing I feel we might want to consider is tighter rules on automated or semi-automated mass-creations, which LLMs might enable and which we should probably require clear prior approval for on a case-by-case basis. --Aquillion (talk) 16:49, 12 April 2023 (UTC)
If there is a particular way LLMs are useful, then that use can be allowed, but I don't see any issue with banning use until that is shown. My issue with allowing use is that it makes creating content much much quicker than verifying it, especially since it seems LLMs are very adept at creating fake references. It also allows for good-faith users to create lots of false, unverifiable information inadvertently - this is something that is much harder without LLMs. Even if mass creation is not done, this still is a big issue, as we rely on the fact that good faith users generally add verifiable content.

Enforcing this policy is going to be hard, but my goal is that at least good-faith editors aren't mislead into thinking LLMs are a useful or endorsed way to write articles. Galobtter (talk) 08:57, 13 April 2023 (UTC)
Some other concerns I had thought of from reading the article are WP:SYNTH and WP:REFLOOP, if an LLM's been trained on wiki based sites, and that without significant oversight, I'm also concerned that "hallucinated" content about a BLP, could be inserted. (Aside: I am reminded that one A.P. Herbert's 'Misleading Cases' was about a suing a computer for defamation.) ShakespeareFan00 (talk) 09:36, 13 April 2023 (UTC)
I don't think there's an LLM whose dataset doesn't include Wikipedia, so yeah I think almost by definition LLM additions violate our core content policies. Galobtter (talk) 21:02, 13 April 2023 (UTC)
I assume that what you mean here is circular citation. While this is certainly true, I don't think this requires anything close to a complete prohibition: previous discussion here (and the several demo pages linked to from here) have given a litany of constructive uses. I agree that citing content directly to the language model as a source is unbelievably dumb and bad, and that blindly pasting output from the model into Wikipedia is also dumb and bad (which is an opinion shared by many, hence its inclusion on most if not all guidance pages that have been written thus far). jp×g 01:34, 14 April 2023 (UTC)
There are many reasons not to use LLM-generated prose, but I don't think citogenesis would be an issue as long as everything is cited and verified. If an LLM happened to output a direct copy of a Wikipedia article, for example, it would be no different from Copying within Wikipedia. The only thing to worry about would be proper attribution. –dlthewave ☎ 15:51, 15 April 2023 (UTC)

@Galobtter: There are examples of their use in the linked transcripts here (to wit: User:JPxG/LLM demonstration, User:JPxG/LLM demonstration 2, User:Fuzheado/ChatGPT, User:DraconicDark/ChatGPT, and Wikipedia:Using neural network language models on Wikipedia/Transcripts). The issues with fake references are indeed bad. Any time a person types "Write a Wikipedia article about XYZ" into ChatGPT and pastes the output straight into mainspace, it is trash and should be deleted (and I think WP:G3 should be expanded to this effect), but there are mane other ways to use these models. jp×g 01:39, 14 April 2023 (UTC)
For the cases where it is actually used for creating content in Wikipedia, which is what I care about, what I'm seeing is mostly evaluations of the like "almost all incorrect" etc. Stuff like giving suggestions is not really what I care about, and the use cases shouldn't be conflated. Galobtter (talk) 03:24, 14 April 2023 (UTC)
It seems to me that every few weeks, a new discussion is opened about this subject is started, and more or less the same points are raised as during the previous ones. In this case, the specific issues of whether LLM output should be covered under WP:BOTS (i.e. require BAG approval) and whether it inherently violated copyright was discussed at great length here and at the village pump in January. The product of these discussions, more or less, exists at WP:LLM and WP:LLMCOPY. In my opinion, it would be quite useful (and perhaps necessary) to work towards adopting or rejecting an existing proposal, or at least toward addressing whether or not existing proposals are good, versus having the same discussions a priori each time. jp×g 01:31, 14 April 2023 (UTC)
- I think we should start speeding things up toward initiating the policy proposal in the next few days. —Alalch E. 14:03, 14 April 2023 (UTC)
- Maybe start a pre-VPP RfC on this talk page with the options: A—this draft is finished enough to be proposed as the 'Large language models policy', B—this draft is not finished enough to be proposed as the 'Large language models policy', C—there should not be a new policy about this. It would be purely consultative in nature, and would not actually prevent someone from proposing it as a policy, if, for example, C gets the most support. —Alalch E. 14:21, 14 April 2023 (UTC)
  This discussion was originally at VPP. I was asked to read the draft policy, and the thread seemed better here.
  
  I think the VPP RfC should be of the Ban, Limit, Allow style of debate.
  
  ShakespeareFan00 (talk) 14:30, 14 April 2023 (UTC)
  The VPP RfC is going to be adopt / don't adopt new policy (this is a draft for that policy). —Alalch E. 14:44, 14 April 2023 (UTC)
  It may be better to get a view first on what general bounds the community agrees upon for use of writing assistant tools, and then revise the draft policy accordingly. This would increase the likelihood of the policy lining up with community consensus. isaacl (talk) 15:06, 14 April 2023 (UTC)
  isaacl, I strongly agree that we should get consensus on 3 or 4 "big questions" before presenting the entire page for adoption. We don't want one unpopular point to derail the whole thing. –dlthewave ☎ 16:57, 14 April 2023 (UTC)
  I also agree that we should start with an RfC with the overarching questions, like "Should LLMs be banned?" and such. PopoDameron ⁠talk 18:19, 14 April 2023 (UTC)
  Such an RfC was already held at VPP, essentially: Wikipedia:Village pump (policy)/Archive 179#Crystallize chatbot discussions into a policy?. At the very least, there was no consensus for a blanket ban. While more editors have become aware of the problem due to the ANI threads that have appeared in the meantime, little changed—everything that has been happening was predicted at some point, so there's not much new information that can be expected to influence someone to change their mind.
  What are the other 2-3 "big questions"? —Alalch E. 10:15, 15 April 2023 (UTC)
  Since there doesn't seem to be appetite for an outright ban, questions like "Should LLM use be allowed for minor edits but banned from use in content creation?" and "Should LLM users be required to obtain bot approval per WP:MEATBOT?" would help establish limits around the most potentially disruptive uses. –dlthewave ☎ 13:00, 15 April 2023 (UTC)
  Would the first question refer to WP:MINOR edits, or does it refer to non-major edits, or just edits that don't add new content? The second question addresses a novel idea, that can maybe be discussed here and worked out within the draft, before proposing to the wider community. —Alalch E. 13:16, 15 April 2023 (UTC)\
  Edits that don't add new content. I'm not quite sure how to define it, but the idea is to allow straightforward tasks like changing the color scheme of a table while restricting anything with the potential for "hallucinations". –dlthewave ☎ 15:37, 15 April 2023 (UTC)
  So maybe a three option RfC: A—blanket ban; B—ban on all use except for (ennumerated?) uses that don't involve the risk of adding hallucinations (straightforward tasks like changing the color scheme of a table); C—no blanket ban and not B either?—Alalch E. 16:36, 15 April 2023 (UTC)
  Yes, that's what I had in mind, thanks. Another option I had in mind would be something like "consensus required", where an editor wishing to use a LLM for a certain task would have to demonstrate its reliability and gain approval through either an RfC or Bot Approvals Group. Would this add too much complication? –dlthewave ☎ 12:48, 16 April 2023 (UTC)
  Personally I think it might trigger a lot of discussion that might be better directed at changing the general guidance rather than dealing with an individual exception. Do you have some examples in mind to help illustrate the kinds of tasks you are thinking of? isaacl (talk) 17:11, 16 April 2023 (UTC)
  If we agree the first option (RfC as described above) is a good idea, we might as well stick with that first option. Another thing to do before asking if the draft (or... a draft) on this page is the something that should be proposed as the policy is getting a view on what general bounds the community agrees upon for use of writing assistant tools, which would be a separate RfC (or perhaps not an actual RfC). So there are approximately four things to do before VPP: (1) the "blanket ban or something close to it RfC", (2) the "writing assistant discussion", (3) revising the draft, (4) the "ready to go?" pre-VPP RfC. Is that about right? —Alalch E. 20:01, 16 April 2023 (UTC)
  I'm not sure if we have the same understanding about what you called the "writing assistant" discussion. I think if there is support for defining specific uses of tools (regardless of underlying technology), then there will have to be a discussion to agree upon those uses. (Roughly speaking, working out the details for a position between "ban all uses" and "allow all uses".) Is this what you have in mind? I don't think a formal RfC to establish if the draft is ready to proceed to an RfC is needed, but certainly soliciting more viewpoints from editors other than those who participated in creating the draft can help. isaacl (talk) 21:02, 16 April 2023 (UTC)
  When you had mentioned writing assistant tools, I interpreted that in the context of how people have been bringing up that LLM-powered applications are becoming ubiquitous: from Microsoft Editor to grammar checkers (some of which have even branched off into full-fledged text-generating apps like GrammarlyGo). So, as in the question of: What is even meant when a policy would refer to an "LLM"? But that's not what you meant, I get it.—Alalch E. 21:19, 16 April 2023 (UTC)
  It's related to what I meant. As someone mentioned previously, I don't think anyone is concerned about the technology underlying grammar checkers. Rather than have a policy about technology, I think having a policy about the extent to which tools can be used to help with writing versus doing the writing (even if based on some direction from humans) would better align with the concerns of the community. isaacl (talk) 21:37, 16 April 2023 (UTC)
  So something like "Should editors be prohibited from using writing assistant tools (ranging across grammar checkers, tools that offer writing suggestions, all the way up to the tools that generate new text or source code) to make their contributions in whole or in part, and if no (not prohibited), to which extent / in which cases / under which circumstances/modalities should the use of said tools be allowed?" —Alalch E. 22:18, 16 April 2023 (UTC)
  Continuing discussion below at § Focus on types of uses... isaacl (talk) 22:11, 17 April 2023 (UTC)
  
  The discussion was not a formal RfC and branched off into many directions, and thus it's not clear to me that there was sufficient focused discussion to be considered a consensus viewpoint of the community. isaacl (talk) 15:39, 15 April 2023 (UTC)
  In a section dedicated to this matter at a project-wide venue that is VPP, approximately 27 editors made formatted comments ("support/oppose blanket ban" along with a few non-explicitly-advocating "comment" comments) about the blanket ban idea. Consensus for a blanket ban was obviously not reached. Should someone really start a blanket ban RfC now hoping for a productive outcome? Seems like a probable waste of time. I might be wrong.—Alalch E. 16:25, 15 April 2023 (UTC)
  Discussions with a single question get better focus, and I suspect having formal RfC notifications will generate a larger sampling of interested parties. Given that the small handful of editors on this talk page haven't really come to any consensus, as the page goes through cycles of expansion and trimming, I think there's a reasonable probability that there's a significant divergence from community consensus, and it might be better to get a more definitive view on high-level consensus, to smooth the way for a policy to be approved. isaacl (talk) 02:48, 16 April 2023 (UTC)
As I've said before, I think an outright ban is not possible because all variety of autocompletion, grammar checking and suggestion, optical character recognition, and such (the list is probably much longer) may use some form of language model and be considered "AI" in the broadest sense. The scan of a historical book into text that one uses may have been accomplished with some form of AI... This will increase in the future. —DIYeditor (talk) 10:22, 15 April 2023 (UTC)
Rewording one of the positions I gave initially, "Wikipedia is overwhelmingly a work of human authorship, Use of content generated from AI's s (such as LLM based generation) should be used only to meet specific defined goals or narrow technical requirements, under close scrutiny of the Wikipedia community, and subject to compliance with any existing policy, guideline or customary practice by the contributor using the AI tool concerned." ShakespeareFan00 (talk) 10:34, 15 April 2023 (UTC)
Have you read the draft (Wikipedia:Large language models)? —Alalch E. 10:38, 15 April 2023 (UTC)
Out of curiosity, I told GPT-4 to write an article, with citations, about the Belleclaire Hotel in NYC (which doesn't have an article yet). While the AI got a lot of things right, there were a few things I noticed immediately:

The AI wrote an article that is a little promotional in tone. For example, The hotel's location on the Upper West Side of Manhattan makes it a convenient destination for visitors to the city. It is located just steps away from Central Park and many popular museums and attractions, including the American Museum of Natural History and the Metropolitan Museum of Art. If I were a new editor and I submitted this draft to AFC, it would be declined.
While we're at it, the hotel is actually two blocks (not "steps") away from Central Park, and it's across the park from the Metropolitan Museum of Art, so that's also factually wrong.
The AI cited Tripadvisor as a source. Again, this is not really ideal if I were a new editor submitting a draft.

On the whole, the AI got most of the facts correct, but these facts are presented in such a way that the article would need significant revisions. I do not think that a total ban is warranted, but, at the very least, we would have to be very judicious with the use of AI. – Epicgenius (talk) 19:01, 20 April 2023 (UTC)

How do you feel about the sources? Do they directly support what was written by ChatGPT? –dlthewave ☎ 01:55, 21 April 2023 (UTC)
Nope, not a single one of them actually support the text. The NY Times source doesn't even mention the article's subject at all. Given how this is the most reliable of the four sources that GPT gave, it's definitely a red flag. The Historic Hotels of America source doesn't even exist anymore. The NYC Architecture source is about a different building entirely (the Century (apartment building)). And I have no idea what to even say about this Tripadvisor source about a hotel in West Virginia.
Funnily, there are sources that support the text, like this. However, I can definitely say that ChatGPT creates fictitious references, so it's of no use if you're trying to find actual sources. – Epicgenius (talk) 02:21, 21 April 2023 (UTC)

In response to some plans by the EU, to much more tightly regulate Generative AI's and foundation models, potentially in ways that make it far harder for smaller and open source implementers, I'm changing my viewpoint to:-

Total Ban on the use of Generative AI and LLM derived content on all Wikimedia sites, until the regulatory framework is certain, and individual providers are completely transparent about what their models can and cannot do, and what mitigation measures they have taken to ensure appropriate compliance with regulatory requirements. ShakespeareFan00 (talk) 23:08, 20 April 2023 (UTC)

@ShakespeareFan00, how would you implement it? As for me, I see a great potential for AI, as it would be possible to run it through unsourced or expandable content. I personally am on your side on the use of AI content, but my experience is that it will be rather difficult to find consensus to stop or even regulate semiautomated editing. For the regulation WP:MEATBOT or WP:BOTUSE already exist but so far no-one could show me an editor who applied for permission at the BRFA as mentioned at MEATBOT. Paradise Chronicle (talk) 06:42, 21 April 2023 (UTC)

Could you be more specific what are "some plans by EU"? -- Zache (talk) 08:15, 21 April 2023 (UTC)

https://www.theregister.com/2023/04/18/eu_lawmakers_want_ai_regulation/ ShakespeareFan00 (talk) 08:26, 21 April 2023 (UTC)

https://www.reuters.com/technology/eu-lawmakers-call-political-attention-powerful-ai-2023-04-17/ ShakespeareFan00 (talk) 08:27, 21 April 2023 (UTC)

Total ban for me. Even if an LLM is adept, it's not perfect. Limiting its usage to require checking all citations means that, mostly, just as much work is required to do fact-checking by both the editor using the LLM, and other editors who have to be highly suspicious as to the assisted edits. We can and should put more trust in human edits to be accurate over LLMs. I would at minimum require more scrutiny, but that takes effort. A flat ban is how I see the best option being. (I am not watching this page, so please ping me if you want my attention.) SWinxy (talk) 03:27, 3 May 2023 (UTC)

I see LLMs as being a big accessibility tool for allowing users to create better prose than they may be able to otherwise. For this reason I support the use of LLMs. They can't be trusted for facts but they are useful for creating readable text. Immanuelle ❤️💚💙 (talk to the cutest Wikipedian) 16:20, 3 May 2023 (UTC)

Oppose ban, per Aquillion and Immanuelle. The case for a ban is far from weak, but I think LLMs can be judiciously used in ways that benefit the encyclopaedia. The main potential harms are mass LLM edits and article creations, which should (be clarified to) fall under the meatbot policy. I wish this had been kept at VPP and reworded, because there's no way we can reach a local consensus on such a big question. At least, with a central well-attended RfC, we would have a formal close we could point to and say: "there's consensus to ban LLM", or "no consensus", or "consensus not to ban". But there, I think Alalch E. is right that it might be pointless to start a "ban/no ban" RfC since I doubt it would go differently from the last WP:VPP discussion. It just feels like we're headed nowhere right now. DFlhb (talk) 16:36, 3 May 2023 (UTC)
Oppose the ban and would like to quote colleague User:DFlhb from a month ago:

we've likely overreacted and thrown everything but the kitchen sink into that draft; ChatGPT's been out of months, and the LLMpocalypse hasn't happened. And after checking the long WP:VPP discussion on LLMs, I'm not even sure where we got that "mandatory disclosure" idea from, because I'm not seeing any community consensus for it.

I think I couldn't agree more with that. Ain92 (talk) 18:34, 15 May 2023 (UTC)

Focus on types of uses

Continuing the discussion on types of uses: I suggest not having one combined question as in this comment, but separately asking about different categories of tools. For example, there could be analysis tools (such as spell checkers, grammar checkers, reading level analyzers), text generators (such as tools generating text from human prompts), and conversion tools (such as voice-to-text tools, optical character recognition tools, translators). Alternatively, since the text generator category is of most interest, the question could just be about that category: do not use versus use with restrictions, with a non-exhaustive list of potential restrictions. isaacl (talk) 22:11, 17 April 2023 (UTC)

We already allow use of machine translation as long as people fix the issues, so I don't think there's anything new to discuss there. There's been a lot of discussion conflating the various use cases, but they are very distinct. I'm not sure why people are bringing up grammar checkers and voice-to-text tools in a discussion that's primarily and obviously about text generation. None of the use cases mentioned create material that is "wholly or mostly in part from non human sources" except for text generation. Text generation is the only use case that's dramatically changed with the new LLMs, and that's the one that matters and should be discussed (and relatedly is use of images created through e.g. DALL-E though that should be a separate RfC).

There is a bit of a fuzzy line in terms of LLM autocomplete tools and such, but I think that falls into text generation. Galobtter (talk) 04:03, 18 April 2023 (UTC)

Yes, I'm aware of the current guidance for translation. People bring up other uses for technology X because the discussion has been framed as a discussion about technology X (witness the name of this page). Personally I agree on focusing on text generation. I'm not sure what types of autocompletion tools you are considering; if it's more akin to a thesaurus then I'd see it as an analysis tool. I think there may be a divergence in community views for code generation, as I think there are many who see it as a way to extend their ability to write code. isaacl (talk) 07:22, 18 April 2023 (UTC)

About translations, I made translation from PinePhone Pro to Finnish article fi:PinePhone Pro and there is prompts used in the talk page. Substantial difference between translating pages using ChatGPT style software compared to Google translate for example is that translator can do "translate + summarise + restore references + convert links and referenes to local wiki" instead of direct translation which is more useful. Note: The original article was also mainly written by me. -- Zache (talk) 08:14, 18 April 2023 (UTC)

"Micro-hallucinations"

Something I posted at Wikipedia talk:Using neural network language models on Wikipedia, but perhaps better said here:

There are a lot of stories of rather large-scale "hallucinations" (lying/fiction) on the part of ChatGPT, etc., but it's become clear to me just experimenting a bit that every single alleged fact, no matter how trivial or plausible-looking, has to be independently verified. I asked the current version of ChatGPT at https://chat.openai.com/ to simply generate a timeline of Irish history (i.e., do nothing but a rote summarization of well-established facts it can get from a zillion reliable sources) and it produced the following line item:

1014 CE: Battle of Clontarf in Ireland, in which High King Brian Boru defeats a Viking army and secures his rule over Ireland

That's patent nonsense. Brian Boru and his heirs died in the Battle of Clontarf, though his army was victorious. In the larger timeline, it also got several dates wrong (off by as much as 5 or so years).

We need to be really clear that nothing an AI chatbot says can be relied upon without independent human fact-checking. — SMcCandlish ☏ ¢ 😼 06:55, 6 May 2023 (UTC)

The hallucinations are pretty common in the existing chatbots from what I have seen. Sometimes it'll be coherent and seem to stick to facts, but at any point it could wander off into fantasy/lies/fiction. —DIYeditor (talk) 07:31, 6 May 2023 (UTC)

This kind of thing is why the language in the draft like Large language models can be used to copy edit or expand existing text, to generate ideas for new or existing articles, or to create new content is being too kind. Permission shouldn't come first, followed by qualifications and caveats. The hazards come first.

Too much of this draft policy is analogous to saying, "You can go ahead and do Original Research or play journalist in Biographies of Living Persons, as long as you pinkie-promise to be extra careful." XOR'easter (talk) 20:23, 16 May 2023 (UTC)

I feel like many of these concerns presuppose that AI text is riddled with errors (which it often is) while human text is largely correct (which it is not always). I fear that lambasting generative tools as blatant OR while also ignoring the fact that human editors paraphrase/pick-and-choose which facts to report when they edit (more subtle forms of OR, imho) will result in a skewed understanding of this issue.--Gen. Quon_[Talk] 18:55, 17 May 2023 (UTC)

Well, we have WP:CIR for human editors. The big difference between humans and LLMs with regard to this is that human errors usually follow a visible pattern of incompetent behavior (or else are accidents that can be quickly recognized by the person when pointed out). An LLM is not able to actually recognize, identify and correct past errors, even if it can mimic the process of apology fairly well. There's humans that do that too; for the most part, they're indefinitely blocked. _signed,Rosguill ^talk 21:41, 17 May 2023 (UTC)

That sentence is not a blanket permission. The meaning of "can" is as follows: "It is a given—arising from the current state of technology—that large language models can /objectively/ be used to copy edit or expand existing text, to generate ideas for new or existing articles, or to create new content"—Alalch E. 21:32, 17 May 2023 (UTC)

I think this has come up before, I too was initially confused by this sentence until it was explained to me. The intent is good but the wording could use improvement. It's a stretch even to say that LLMs are technically capable of writing article content, it would be more accurate to say that they can generate text that has the appearance of a Wikipedia article.

A different approach would be to open with a detailed description of how LLMs work and explain in the same paragraph that although they have the appearance of intelligence, they don't actually "understand" what they're writing and often produce false information that's difficult to distinguish from fact (AKA "vaguely plausible bullshit"). Most editors on this page take this for granted, but we need to write for folks who have heard amazing things about AI or played around with ChatGPT and are eager to use it on Wikipedia. –dlthewave ☎ 02:09, 18 May 2023 (UTC)

I don't think that's even an accurate statement of what the current technology makes possible. I mean, one might also try "to generate ideas for new or existing articles" using a Ouija board and a bottle of tequila, but to say that the combination "can objectively be used" for the purpose stretches the word can beyond the point of meaningfulness. At any rate, if the sentence needs this much explication, it's not a good line to put in a policy. XOR'easter (talk) 22:41, 18 May 2023 (UTC)

Circling back to getting this into a presentable state

Pinging the major contributors to the current page per Sigma's tool. @DFlhb, Alalch E., and Phlsph7: I think we should return to DFlhb's suggestion in early April and just trim the damn thing down to the absolute minimum so that we can have an RfC and there can be something. I'm going to look at all the LLM- and AI-related policy pages and see if I can come up with some scheme that makes sense (there are about a million of them and they all overlap in bizarre ways). jp×g 09:56, 18 May 2023 (UTC)

~~I support trimming back to something similar to DFlhb's trim.~~ Actually I'm unsure. Maybe start that RfC that was discussed, independent from this as a potential proposal. That's what isaacl suggested and I think DFlhb agreed.—Alalch E. 10:15, 18 May 2023 (UTC)
I still stand by that diff. We need an "allow vs ban" RfC to give us the community's assent. I'd also rather we hold that RfC now & take advantage of there being no LLM-related controversy at WP:ANI, so the !votes are representative and not a heat-of-the-moment under/overreaction. Afterwards we can ask WP:VPI what they think of the current draft, my trim, jpxg's potential rework, or anything else. DFlhb (talk) 21:06, 18 May 2023 (UTC)
Please start the RfC. —Alalch E. 21:40, 18 May 2023 (UTC)
XOR'easter, you said you're on the Ban side. Want to take a stab at it? Feels unfair for me to start an RfC that I'd (likely) oppose. DFlhb (talk) 22:17, 18 May 2023 (UTC)
No. I am, frankly, exhausted by trying to deal with one disaster after another after another while trying to keep an eye on discussions that should have been closed months ago. To be blunt, the last thing I want is to be the whipping-person for an RfC that will have half the people here wanting to run me out on a rail for Luddism. XOR'easter (talk) 22:54, 18 May 2023 (UTC)
To sum up what I wrote earlier, personally I think the RfC should ask if the use of programs to generate text for inclusion in an article is supported. This would include text generated using existing content as an input. I think this is the key area of concern for most people, so I think it would be better to tailor the question to the specific category of tools that is a potential problem. isaacl (talk) 22:32, 18 May 2023 (UTC)
I agree, people including AI-generated texts directly in articles is probably the central issue for most editors. Phlsph7 (talk) 07:50, 19 May 2023 (UTC)
There are so many potential variations and aspects that I don't think this would be well-suited for an RfC. We couldn't even agree on how to phrase it in the last 2 discussions.

If this is where we're at, then let's just trim & go straight to WP:VPI, which is where these nuances are best discussed. If a majority favours a ban, they'll let us know anyway. I now think I was wrong to believe an RfC would help. DFlhb (talk) 10:47, 19 May 2023 (UTC)
To illustrate how the trim makes this dispute easier to resolve, I've tweaked the trim's first bullet point. Only one sentence (in italics) would be in contention, instead of many paragraphs:
LLMs can make things up, produce biased outputs, include copyrighted content, or generate fake or unreliable citations. Never paste LLM outputs directly into Wikipedia. You must rigorously scrutinize all your LLM-assisted edits before hitting "Publish", and you are responsible for ensuring that these edits comply with all policies.

Without the trim, we can hold RfCs on abstract questions, but we'll still struggle to figure out how to turn the RfC results into specific wording. With the trim, anyone who wants to tweak the policy can just put an alternative sentence to an RfC. Removes one layer of abstraction.

Another benefit is that it doesn't explicitly condone any use cases, unlike the current draft. Saves us the need to gauge consensus on which use cases are okay by putting the locus on policy-compliance rather than use cases. I think that makes sense, since I doubt there's such a thing as a "low risk" use case. DFlhb (talk) 11:48, 19 May 2023 (UTC)

The issue is that there hasn't been agreement on having a minimal proposed draft, in part because different editors want to clarify many different aspects based on what they assume the community wants. If community consensus is reached on disallowing text generated by programs, for example, then the resulting policy will be very short with respect to this aspect. isaacl (talk) 15:38, 19 May 2023 (UTC)

There has been very little progress in terms of consensus on how to proceed in the last weeks/months. Trimming it down wouldn't be my first choice. But it's better than keeping the draft in an indefinite state of limbo. As I see it, the important point would be to find a path forward, one way or the other. Phlsph7 (talk) 10:34, 18 May 2023 (UTC)

Taken literally, "allow" vs. "ban" are the two extremes (total green light and total ban) and IMO not good choices for an RFC. What's happening here is a lot of work to refine a proposal for an RFC where I'd guess the RFC would be whether or not to make it a guideline. IMO it should be a separate later question on whether to upgrade it from a guideline to a policy. Sincerely, North8000 (talk) 15:58, 19 May 2023 (UTC)

I suggest a third way forward beside the trim and allow-ban-RfC. Does the current draft have parts with which several editors strongly disagree? If yes, we can consider an alternative version of each controversial part, have a short discussion on it, and then an RfC. We do that until all strong disagreements are sorted out. Then we can propose the draft at WP:VPI. This probably works best if each discussed alternative focuses only on one core issue, for example, by taking a couple of sentences and suggesting how they should be changed.

Compared to the trim, it has the advantage that it does not involve a radical change to what we were already working on all these months. It has the disadvantage that the result will probably be longer and more difficult to manage. Compared to the allow-ban RfC, it has the advantage that we have clearly defined alternatives for the RfCs and therefore a good chance to get a consensus one way or the other. It has the disadvantage that the people participating in the RfCs have less influence since each one affects only a small portion. Phlsph7 (talk) 20:05, 20 May 2023 (UTC)

We do that until all strong disagreements are sorted out After the trim, I was invited to do that, but refused, for reasons later expressed by XOR'easter better than I could have. So far, we've been ineffective at defining problems precisely, and have not just disagreed on the best options, but on what these options even meant. I suggest that we make the "next step" WP:VPI, not further discussions here. Though maybe there are downsides to "rushing over there" that I'm not seeing, so please scrutinise my reasoning. DFlhb (talk) 22:41, 22 May 2023 (UTC)

Note that trimming isn't required before going to VPI. Let's just see what they think; many of the things I disliked have been addressed. DFlhb (talk) 23:41, 22 May 2023 (UTC)

Personally, I'm with you. I just had another look at the draft. It's not perfect but it seems to me to do a decent job at describing how LLMs should not be used, how they can be used, and what dangers this involves. If the proposal fails we may still learn important things on how to improve it and either repropose it or change it into a guideline. Phlsph7 (talk) 09:04, 23 May 2023 (UTC)

I suggest we take a few weeks to finalize a draft that has pretty strong support here, put a notice / invitation to participate at the pump that we are doing that. Then we should do an prominent RFC to accept it as a guideline. (leave the policy idea for later) Also indicate that the RFC is just to put it in, not lock it all in...that further evolution is planned. If possible, folks here should support the result even if it is only 90% how you want it. If the RFC changes into a brainstorming page with a zillion versions, it will die under it's own weight. What do y'all think about that idea? North8000 (talk) 21:35, 22 May 2023 (UTC)

I agree that compromises are necessary if we want to move forward. I'm not sure about the issue of policy vs guideline. So far, the draft was treated as a draft of a policy. We could try to get it accepted as a policy and use the guideline approach as a plan B in case that fails. Phlsph7 (talk) 09:13, 23 May 2023 (UTC)

I have no strong opinions, I was just trying to crystalize something. Also, in the current "crowd source" architecture of Wikipedia, with the usual approaches, something of this scale has about a 2% chance of getting accomplished. I think that the sequence of a lot of work and input to crystalize a single proposal, and then the drafters agreeing to support the outcome even if you only like 90% of it is a way of raising those 2% odds to 90%. North8000 (talk) 13:19, 24 May 2023 (UTC)

Add a header to this page

Should we add "This article is about the use of Large Language Models in Wikipedia. For the article about Large Language Models (general), see https://en.wikipedia.org/wiki/Large_language_model instead?" It would help clear up confusion, especially for people who Google "LLMs" and find this page. Thegamebegins25 (talk) 23:36, 22 May 2023 (UTC)

The idea makes sense. But I checked a few other policies and guidelines, like Wikipedia:Copyrights, Wikipedia:Plagiarism, and Wikipedia:Public domain: they don't have a disambiguation link to the corresponding mainspace articles. Phlsph7 (talk) 09:19, 23 May 2023 (UTC)