Wikipedia talk:Edit check/Tone Check

Summary of discussions so far

On Tuesday, June 10, 2025, ~15 volunteers who are active at en.wiki joined WMF Staff in a Discord voice chat to discuss Tone Check.

What follows is an initial attempt to document the concerns we understood volunteers to be raising (and opportunities they might present) in this Discord chat, as well as those raised in the on-wiki discussion that prompted it and followed it.

We need your help!

Please boldly edit the table you will see below to make it a more accurate and exhaustive representation of the ideas you raised and/or saw other people raise about Tone Check so far. We recognize there are likely things we have missed and/or misunderstood!

Note: like all new explorations, we are approaching this ongoing conversation with y'all about Tone Check a) seeking to learn and b) open to not yet knowing what other solutions this learning might lead us to.

Tone Check

As a reminder, Tone Check is meant to address two issues:

Newcomers publishing edits to Wikipedia that contain promotional, derogatory, or otherwise subjective language because they lack the awareness that this kind of editing is not aligned with Wikipedia policies.
- Note: at en.wiki, 56% of all new content edits by people with ≤100 cumulative edits included peacock words.
Patrollers/moderators being burdened by patrolling preventable damage made in good faith. This can come at the expense of identifying and addressing more subtle and complex forms of vandalism.

Tone Check is designed to address these two issues by:

Offering new(er) volunteers feedback while they are editing so they can avoid unintentionally publishing edits that are likely to violate policies
Offering patrollers/moderators deeper insight into the edits they are reviewing (and the intentions of the people publishing them) by logging the moderation feedback new(er) editors are being presented with, and the actions they do/do not take in response.

Tone Check is still under active development. You can learn more about the current state of the project here.

Potential scenarios

Please note: these are scenarios we are trying to understand so that we can brainstorm how we might A) mitigate them and B) track if/when they do occur.

Potential scenario	Details	Potential outcomes	Opportunities [i]
Scenario A: Tone Check could nudge people away from more obvious peacock words (e.g. "iconic") and towards subtler forms of biased writing that are more difficult for the model and people to detect.	People acting in bad faith might try to repeatedly change the wording of what they've written until they find a way to use language that's promotional, derogatory, or otherwise subjective while also being subtle/ambiguous enough for Tone Check NOT to activate.	Tone Check inadvertently increases the amount of biased / non-encyclopedic content on Wikipedia. Tone Check could help people editing in bad faith obfuscate COI editing.	Introduce a hidden tag that gets appended to edits Tone Check would become activated on before making the feature visible to people editing. This way, volunteers could see the edits and people Tone Check would activate it for, were it to be made available. Note: this functionality is actively being worked on. Enable volunteers to prevent Tone Check from becoming activated on certain pages/categories where people tend to edit in bad faith more frequently. E.g. contentious topics, companies, public figures, conflicts, etc. Please add ideas/thoughts to T393820. Limit how many times Tone Check can be activated within a given edit (or for a given user within a certain time period) so people acting in bad faith can't use the feature to craft edits that could circumvent it. Enable volunteers to see the reason someone provided for declining to revise the tone of what they have written when Tone Check asks them to consider doing so. Please add ideas/thoughts to T395175. Enable volunteers to see specific words/phrases the model detects as being promotional, derogatory, or otherwise subjective, along with the model's confidence in such predictions, so that volunteers can consider integrating these patterns into existing moderation tools and workflows. ??? ??? ??? i. Note: These are the ideas mentioned so far. We need your help to identify additional ideas and, ultimately, evaluate the extent to which the "final" set of ideas will be effective at both supporting experienced and newcomers.
Scenario B: Moderator/patrollers/reviewers could become less effective at catching people attempting to add spam to Wikipedia because Tone Check could discourage them from inadvertently "outing" the mal-intent they are editing with.	People editing with a conflict of interest (or other mal-intent) could leverage Tone Check to write in a promotional, derogatory, or otherwise subjective tone that is subtle enough for moderators/patrollers to potentially miss. This would, in effect, remove one of the key signals (tone) volunteers currently depend on to prioritize investigating potential COI and other forms of bad faith editing.
Scenario C: By building a model like the one underlying Tone Check, and making it available under an open source license, people could run the model locally and, in turn, use AI to generate non-neutral text that on-wiki tools might not detect.

Discussion

Pinging everyone who participated in the two main on-wiki discussions.[i][ii]: Tamzin, Hammersoft, AndyTheGrump, Sohom Datta, Berchanhimez, Mike Christie, Shushugah, CaptainEek, Tactica, Fuzheado, Donald Albury, Loki, Isaacl, SD0001, WereSpielChequers, NebY, Novem Linguae, Rhododendrites, Gnomingstuff, Polygnotus, QEDK, Barkeep49, Toadspike, Chess, Phil Bridger.

Note 1: I'll also be posting in Discord to ensure the people who were present there are aware of this conversation.

Note 2: I'll be on vacation for a short period of time. Trizek (WMF) will monitor this thread and engage with the discussion.

- PPelberg (WMF) (talk) 22:13, 1 July 2025 (UTC)[reply]

Thanks for the summary, that sounds great! Another idea I briefly suggested during discussions, which might have some merits, would be to add ToneCheck activations into the Wikipedia:AbuseFilter workflow as a possible trigger (adding to the action options documented here).

Also, just to clarify – if a use of ToneCheck is logged, would the pre-ToneCheck text also be accessible? On the one hand, I understand people might not want to make their drafts public, or be judged on them, but on the other hand, it could also be a useful tool to help catch spammers abusing it, so there is again a balance to be found here. Chaotic Enby (talk · contribs) 05:58, 9 July 2025 (UTC)[reply]

Another idea...add ToneCheck activations into the Wikipedia:AbuseFilter workflow as a possible trigger (adding to the action options documented here).

Oh, interesting, @Chaotic Enby. A couple of resulting questions:

Were we to move forward with adding, "...ToneCheck activations into the Wikipedia:AbuseFilter workflow.", in what – if any – ways is what I've described below misaligned with what you were thinking?
1. Each time Tone Check is presented to someone during an edit session, pass that information along to AbuseFilter as an action, as defined here.
2. Hypothetically, with Tone Check activations now being logged within AbuseFilter, volunteers could treat said activations as a condition they could use to trigger any other action/effect (there might be better term here) that AbuseFilter supports (e.g. tag the edit, show a subsequent message/warning, etc.)
Building on the question above, might you be able to share what you could imagine the effect(s) of implementing this idea to be? Asked another way: what scenario(s) documented in the table above do you this this idea could be helpful with? In me asking I do not mean to suggest that the scenario(s) listed above are exhaustive or that this idea need to be related to one.

...if a use of ToneCheck is logged, would the pre-ToneCheck text also be accessible?

Assuming it would be accurate for me to understand the idea above as enabling volunteers (TBD what roles/permissions they would need) to see on-wiki the exact text someone wrote that caused a Tone Check to be shown, then no. Currently [i], this functionality is not implemented.

A couple of follow up questions...

@Sohom Datta, I'm understanding the idea you shared below as similar to (if not the same as) the idea @Chaotic Enby shared above (see the sentence beginning with, "Also, just to clarify..."). Are you thinking the same? I ask this wanting to be sure I'm not mistakenly conflating the two, and with it, missing/excluding nuances that might distinguish them.
@Chaotic Enby + @Sohom Datta, assuming it's accurate for me to understand these two ideas as one-in-the same, could you please say a bit more about how you can imagine patrollers/reviewers using the text someone wrote that prompted them to be shown a Tone Check?

One idea, maybe y'all are thinking along the lines of the following...

"As an patroller/reviewer motivated to ensure that Tone Check is //not// inadvertently enabling people acting in bad faith to become more effective at disguising destructive/vandalistic edits, I would value seeing the text someone wrote that caused Tone Check to become activated so that I can use this information to make a more informed assessment about the intentions with which someone published the edit I am in the process of reviewing."

In parallel, I've created T399511.

----

i. Emphasis on "currently" as Tone Check is still evolving with conversations like this serving as a key input into that process! PPelberg (WMF) (talk) 22:18, 14 July 2025 (UTC)[reply]

Thanks a lot @PPelberg (WMF)! What you described is pretty much exactly what I had in mind. My idea is that logging the text of every single Tone Check edit, and making it visible to all users, might be seen as too much of an invasion of privacy, but incorporating in the edit filter workflow can serve two purposes:

Logging uses of Tone Check and making the content visible to edit filter helpers and administrators
Using a Tone Check hit in combination with other conditions, some of which are implemented in existing filters (for example, usernames in Special:AbuseFilter/54 or Special:AbuseFilter/148) to target specific uses that might be deliberate promotional editing, rather than a new user unfamiliar with Wikipedia's tone

In the latter case, these filters could, for example, make the content of Tone Check activations visible to all editors for more transparency, and/or tag the edits for patrollers.

This is not necessarily a specific way to implement it. For instance, we could instead tag all Tone Check activations by default, or make their contents more public and instead use specific edit filters to outright block edits on more "contentious" pages. Instead, it is an example of a general framework that edit filters can allow us to build. Crucially, the specifics of this framework can then be tweaked on-wiki by edit filter managers (with community consensus, similarly to IA actions), meaning the community won't need to bother WMF developers to adjust the specifics once Tone Check is live. Chaotic Enby (talk · contribs) 22:49, 14 July 2025 (UTC)[reply]

There is currently a glaring lack of trust between the WMF and the English Wikipedia community, both in general, and about anything that looks like a language/AI model broadly construed. I get that this isn't a generative AI model, but I would advise a lot of caution here to avoid a backlash that would waste a lot of everyone's time. I think the concerns about making problematic edits harder to spot in this case are huge and real. When we were cleaning up the Content Translation tool with X2, we explicitly decided not to polish up articles that might have a big translation error so that we didn't hide that error. This tool encourages editors to polish up edits that we inherently think might have a problem, in this case with neutrality. I think a system that flags problematic edits for human review is good. I also think it should pretty firmly stop there. Tazerdadog (talk) 15:41, 9 July 2025 (UTC)[reply]

Hi @Tazerdadog – thank you for thinking diligently about the Tone Check work.

I think we are aligned with you in thinking:

The concerns about making problematic edits harder to spot are real and
We need to be deliberate in how we discuss, design, and develop any intervention that seeks to delegate an action and/or choice to a machine that people have historically performed.

With the above in mind, a few follow up questions for you in response to the points you raised…

1. I think the concerns about making problematic edits harder to spot in this case are huge and real. To be doubly sure we're holding a shared understanding of what you're referring to here, could you please share what might be missing from/inaccurate about the interpretation I've drafted below?

"Obvious policy infractions/errors (like biased tone) are one of the key signals I, Tazerdadog, and other patrollers/reviewers depend on to prioritize investigating potential COI and other forms of bad faith editing. As such, an intervention like Tone Check – which makes people aware of potential tone issues and invites them to reconsider the tone of the text they are adding to Wikipedia – could cause me to miss out on spotting problematic edits."

2. With regard to, "...a system that flags problematic edits for human review... might you be able to say a bit more here? When you think about the current processes/workflows for discovering and patrolling/reviewing edits, what is challenging/inefficient/not possible/etc. that you think the system you're imaging could potentially be helpful with? Further, would it be accurate for me to think that you would consider a new edit tag that gets "silently" appended to edits that a small language model trained on past Wikipedia edits "thinks" could contain a tone issue would be a constructive step in the direction of the system you're describing?

3. More broadly, you using the phrase, ...pretty firmly stop there. leads me to think this work is intersecting with – what you consider to be – a boundary you feel strongly committed to. Assuming this is an accurate perception, might you be able to bring some language to this boundary? Asked another way: what deeper principle/value do you see this Tone Check work as encroaching upon? PPelberg (WMF) (talk) 21:57, 23 July 2025 (UTC)[reply]

@PPelberg (WMF):

1) I think your statement in 1 is accurate, but limited. I am also frequently tipped off to copyvio by out of place promotional language. I can then copy/paste a sentence into a Google search and often find both the source and the extent of the problem. If the user was prompted to smooth out their text first, it would interfere with both my ability to get the gut feeling that there's something worth taking a look at here, and my ability to confirm it to a high standard.

2) I'm imagining that a machine learning tool (and probably not a LLM) might be helpful for turning a firehose such as recent changes into some kind of orderly list. ClueBot is a good example of a machine done right on Wikipedia in the anti-vandalism space - it's able to catch a fraction of obvious vandalism and undo it, but it leaves the history of that vandalism in place for any human follow-ups and doesn't seem to encourage the vandalism to be more subtle. Even something as simple as a running list of recent edits where Tone Check would have activated, or a similar list for edits near, but below ClueBot's revert threshold would be a very valuable tool for patrollers. A silent tag sounds to me like a reasonably good way to implement this, no comment on the best technical way.

3)The issue with how Tone Check was initially presented was that it could paper over problems instead of fixing them properly. If all Tone Check does was alert a human editor silently, that would likely be useful. If it took it further and reverted or filtered out edits in their entirety that it was very confident were promotional, that's probably also fine. What it shouldn't do is change, or prompt the human into changing the edit into one that superficially looks better while not actually fully fixing the NPOV/PROMO/COPYVIO issue. That's the "boundary" that I think you were identifying earlier. Tazerdadog (talk) 23:46, 23 July 2025 (UTC)[reply]

@PPelberg (WMF), @Trizek (WMF) Would it be possible to implement a system that allows editors to see the text that triggered the warning in the first place ? Sohom (talk) 16:55, 9 July 2025 (UTC)[reply]

(For context, like edit filters) Sohom (talk) 16:56, 9 July 2025 (UTC)[reply]

@Sohom Datta assuming the idea you're raising here is sufficiently similar to the idea Chaotic Enby raised above, let's continue talking in the thread above about this?

In the meantime, might you be able to share link(s) to edit filter(s) that Tone Check could draw inspiration from were we to collectively think implementing this behavior would be worthwhile? PPelberg (WMF) (talk) 22:21, 14 July 2025 (UTC)[reply]