Jump to content

User:Gnomingstuff/AI experiment

From Wikipedia, the free encyclopedia

Loosely following "Why Does ChatGPT Delve So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models" by Tom S. Juzek and Zina B. Ward: comparing human-written wiki articles to probably AI-written wiki articles to see which words have spiked in usage.

Methodology

[edit]

Sourcing text

[edit]

"AI" text is selected via the following methods:

  • Drafts: Articles in draftspace, by one contributor in consecutive edits, containing the words "large language model" (which catches the AfC decline tag).
  • Articles by one contributor, confirmed to be AI generated via disclosures
  • Articles tagged AI-generated, where the initial version of the article was AI generated,
  • Text from userpages/sandboxes that seems unambiguously intended to be article drafts

All articles are manually reviewed by me to make sure the tag isn't bullshit. The earliest "complete" version was used in all cases.

"Human" articles are selected from:

  • Random non-stub articles
  • Articles tagged with peacock, promotional, or essay tags, as AI writing often has this tone by default
  • Articles that contain the aforementioned AI "focal words," to counteract over-weighting on them (and because we don't need extra evidence AI uses them)
  • The ~3-5 articles I primarily wrote, because why not

Only article versions prior to mid-2022 are used, to be near-certain the text isn't AI.

Finally, the text includes excerpts from articles where:

  • The article contains a diff adding several uninterrupted paragraphs of new AI-generated text (copyedits don't count).
  • A pre-2022 version of the article contains a comparable passage of text in length, prose density, and subject matter.

Processing text

[edit]

All text is sorted into folders by category, to make including/excluding types of text easier in the future. The creation date is appended to the filename, in anticipation of running this on only AI articles from certain years/spans of LLM release dates.

The wikitext of the AI and human articles are lightly cleaned to remove AfC boilerplate and some non-indicative syntax. No other manipulation of words, such as to remove punctuation, normalize capitalization, or tokenize beyond removing whitespace, was done. This causes some data issues (see Limitations) but is deliberate, as there are known wiki-syntactical differences between AI and human articles.

Then, the text of AI articles and human articles are compared with code similar to the code in the original study, extended to also analyze two-, three-, and four-word phrases. Non-indicative or coincidental syntax -- e.g., "2025," "'|access-date=November'" -- is excluded from results.

Code

[edit]

Adapted from brute_force_div_py on the study authors' GitHub

no attempts to be efficient, pythonic, or good code were made by me

Results

[edit]

Note: This is a somewhat small dataset (~1,000,000 tokens each human and AI) so far and so there are undoubtedly many flukes. Nevertheless, you will see some old friends.

Only the top of each list is shown. The bottom half (words more common in human text) is not currently useful data as there are too many false negatives and syntax coincidences. Sorting by AI frequency may be most indicative.

Results (two-word phrases)

[edit]

Results (three-word phrases)

[edit]

Results (four-word phrases)

[edit]

This seems to be the limit of what we can check. Some results here are not statistically significant.

Grokipedia

[edit]

Grok's distribution of words is similar in many ways to AI-in-general, but over-emphasizes certain topics.

The same code was run with datasets of Grokipedia articles versus their pre-Q3 2022 counterparts. Only original Grokipedia articles were used, not articles scraped from Wikipedia. Currently they consist of 96 articles each of Grokipedia and Wikipedia, with 1 million and ~870,000 tokens respectively.

Since Grokipedia's articles are not publicly editable, the text of the articles was used, rather than the wikisource (which would be different anyway, Grokipedia appears to use markdown). A limitation here is that Grokipedia articles are text-only, while Wikipedia articles contain image captions, see also headers, portal text, and similar cruft. They have been cleaned somewhat but more cleaning is needed.

Another limitation: Grokipedia articles are much longer than their Wikipedia counterparts, which means the human token set is substantially smaller.

One more limitation: Articles were not chosen randomly; Grokipedia has no random-article feature, and few random Wikipedia articles have Grok-written counterparts, so it is unclear how to randomize the dataset. Good and Featured Articles on Wikipedia were prioritized.

Four-word phrases are not included here -- there are too few articles for it to be conclusive.

One word

[edit]

Two words

[edit]

Three words

[edit]

Limitations

[edit]

Out the wazoo:

  • I am not a statistician, a computational linguist, or any kind of academic at all.
  • Wikipedia articles cover a much broader range of subject matter and verbiage than research abstracts, making overlap less likely and meaning any one article may overwhelmingly influence the results.
  • The training set is small compared to the original study.
  • The training set is not unbiased:
  • All of the AI samples being drafts introduces many issues:
  • Since drafts may be auto-deleted after 6 months, most of the samples are from 2025, which may over-represent newer chatbots.
  • They are almost all rejected AfC submissions, and thus may not represent AI-generated text that "passes" better.
  • They likely over-represent common subject matter that gets rejected for non-notability. For example, the word "AI" tops the list in part because many deleted AfC drafts are about AI startups. (The other reason is because AI is talked about more in 2025 than in 2022.)
  • Since human AfC submissions from the same time are not included due to the 2022 cutoff, text may be listed as characteristic of AI that is actually just characteristic of following the AfC guidelines.
  • All of the AI tagged samples are of new articles, and thus do not represent article subjects that already had articles.
  • All of the AI-tagged samples are text that wasn't deleted, and thus may not represent AI-generated text that is older or more obvious.
  • Many of the tagged articles were tagged by me, which means I may have over-focused on certain tells or subjects.
  • All the human articles were either manually chosen or curated.
  • Most human articles have accumulated much more wiki-syntax, templating, and such than newer articles, which means the human token set may be functionally much smaller (as more of the tokens are irrelevant). It also means some words may be under-represented in the human text, since [[Dog]] and Dog are two different tokens.

Policy note

[edit]

I believe this is acceptable to do per WP:NOTALAB: Research that analyzes articles, talk pages, or other content on Wikipedia is not typically controversial, since all of Wikipedia is open and freely usable.

For fun: The words/phrases with the largest decreases

[edit]

Don't take this seriously, it is statistically not useful and may be explained by wiki boilerplate, dataset over/under-representation, other coincidences.

  • One word: "Because"; "box"; "dead"; "defeated"; "east"; "finished"; "however"; "motorcycle"; "normal"; "opposed"; "poopoo", "precipitation"; "probably"; "quite"; "refugees"; "said"; "satellites"; "seating"; "sentenced"; "shall"; "ski"; "sun"; "Telugu"; "tries"; "Tunisia"
  • Two words: "and will"; "building was"; "charge of"; "in order"; "opposed to"; "said that"; "south of"; "the average"; "the whole"; "There were"; "to start"; "was announced"; "was built"; "was given"; "was very"; "went to"; "were to"; "win the"; "with two"
  • Three words: "a number of"; "as a result"; "because of the"; "can also be"; "in order to"; "is in the"; "Most of the"; "release of the"; "result of the"; "the name of"; "to be the"; "was based on"; "was promoted to"
  • Four words: "a result of the"; "as a result of"; "at the beginning of"; "in the eyes of"; "is evident in the"; "often referred to as"; "tens of thousands of"; "the launch of the"; "took part in the", "was a member of";