Jump to content

Wikipedia:Bots/Requests for approval/DYK-Tools-Bot

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by RoySmith (talk | contribs) at 20:06, 18 January 2023 (Cron running: Reply). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

New to bots on Wikipedia? Read these primers!

Operator: RoySmith (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 00:35, Thursday, December 15, 2022 (UTC)

Function overview: A bot to assist in various tasks related to WP:DYK maintenance.

Automatic, Supervised, or Manual:Automatic

Programming language(s): Python

Source code available: https://github.com/roysmith/dyk-tools

Links to relevant discussions (where appropriate):

Edit period(s): Hourly

Estimated number of pages affected: Category:Pending DYK nominations currently has 268 entries

Namespace(s): Template

Exclusion compliant (Yes/No): No

Function details: This is a proposal for a new bot to help out at WP:DYK. A big part of the back-end work of DYK is building prep sets. Each set consists of 8 "hooks", which are chosen from those proposed in nominations. The selection of hooks needs to comply with an absurdly large number of rules. These rules include:

  • The hook must be previously approved, indicated by a checkmark icon on the nomination template.
  • Once approved, a hook can be unapproved by somebody raising an objection, requiring that it be re-approved
  • If you are the author of a hook or have approved it, you can't promote it to a set yourself
  • The first hook in set must include an image (which in turn must be approved)
  • Within a set, it is strongly discouraged to run two hooks that are biographies next to each other
  • It is similarly strongly discouraged to run two hooks about American topics next to each other
  • The total number of biography and/or American topic hooks in a set is capped
  • Between sets, it is discouraged to have the lead hooks be of similar types
  • Certain hooks are tagged to be run on particular dates
  • And so on

In the current process, people building prep sets scan the list pending hooks looking for ones that meet all the requirements. It would be good to have a tool which automates as much of this as possible and presents to the human a list of potential hooks that might fit a given slot. It would then be up to the human to confirm the suitability and pick from the suggestions presented (or ignore them completely).

A POC implementation of the evaluation system is currently running on toolforge. Source is available in github.

The next step is to repackage the nomination evaluation code as a bot which runs under cron on toolforge. This would:

  • Run at some reasonable interval. Hourly seems like a good starting point. Based on some initial measurements, I estimate a run will take a couple of minutes to complete.
  • Iterate over the articles in Category:Pending DYK nominations to find nominations to examine.
  • For each unassessed nomination, evaluate it to determine if it's a biography and/or an American topic.
  • Add Category:Pending DYK biographies and/or Category:Pending DYK American hooks to the nomination template as appropriate. The edit summary will include a link back to the bot's user page. A human can override the automatic assignments by adding or deleting classification templates manually.
  • Keep track of which nominations it has processed so it doesn't keep reprocessing the same ones. Any nomination which already has any of the classification templates will be automatically skipped. Thus, if a human does a manual evaluation, the bot will never override the human.
  • Iterate over [[:Category:Pending DYK biographies and Category:Pending DYK American hooks to find any templates which are (no longer) in Category:Pending DYK nominations and remove the classification categories.
    • Alternative to that would be to have the bot edit the {{DYKsubpage}} which is on every nomination, adding new parameters to indicate the categories. That will clean up the cats automatically when the {{DYKsubpage}} during the nomination close process.
  • I'll implement some kind of emergency button so anybody can stop it if it goes haywire.
  • Assert will be used to prevent logged-out editing (I need to figure out how that works in pywikibot).

Future work will be to build a tool that a user can run (probably as part of the existing toolforge web service) to filter based on these categories and/or other criteria. I could also see additional classification categories being added in the future if needed.

The code that touches the wiki is pywikibot. The web app is Flask.

I don't anticipate the need to persist much data. What little bits of state I need, I'll probably use redis to keep things simple.

I've created User:DYK-Tools-Bot.

Discussion

So to clarify, this BRFA is about the addition and removal of Category:Pending DYK biographies and/or Category:Pending DYK American hooks to pages in Category:Pending DYK nominations? How does it make this assessment? I presume by the associated article/article talk containing certain categories (like biographies or america-related wikiprojects)? Or some other heuristic? ProcrastinatingReader (talk) 23:24, 15 December 2022 (UTC)[reply]

The code is Article.is_biography() and Article.is_american(). The gist is:
These are probably not perfect, but they seem to be working. The heuristics can always be tweaked. Errors (in either direction) are not critical, since this is just an aid to a human who makes the final decision. -- RoySmith (talk) 23:48, 15 December 2022 (UTC)[reply]
Will the bot also be differentiating between approved and non-approved noms, by the way? theleekycauldron (talkcontribs) (she/her) 03:06, 16 December 2022 (UTC)[reply]
The existing code certainly has the ability to figure out if a nomination is approved. Ultimately I envision a front-end where you can say, for example, "Show me all the non-American biographies that are approved". But that's not something the bot part of this needs to know about when it's assigning categories. -- RoySmith (talk) 03:24, 16 December 2022 (UTC)[reply]

Approved for trial (7 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. ProcrastinatingReader (talk) 13:36, 16 December 2022 (UTC)[reply]

OK, thanks. I haven't actually written the bot code yet; I assume the 7 days runs from whenever I turn it on? -- RoySmith (talk) 14:27, 16 December 2022 (UTC)[reply]
Yeah ProcrastinatingReader (talk) 16:52, 16 December 2022 (UTC)[reply]
2022-12-18 21:49:35,789 INFO dykbot Done. Processed 242 nominations in 0:40:22.708095
but additional runs should take a lot less time since they will just be working on the new nominations. -- RoySmith (talk) 21:59, 18 December 2022 (UTC)[reply]

@RoySmith: I'm concerned because when you have a list that looks like this:

: something
:: something
::: something
:::: something
{{DYK-Tools-Bot was here}}
::::: something

we end up with a LISTGAP problem for screenreaders. Moving it outside {{DYKsubpage}} would theoretically prevent users from talking around it. theleekycauldron (talkcontribs) (she/her) 06:18, 26 December 2022 (UTC)[reply]

That makes sense. But it's at odds with your last statement, Seems that it absolutely needs to go inside the DYKsubpage template -- RoySmith (talk) 15:06, 26 December 2022 (UTC)[reply]
@RoySmith: that would be because I erred in offering that solution, despite the LISTGAP concerns – when a nomination is closed, the {{DYK-Tools-Bot was here}} persists in otherwise-blank transclusions, which is not great. theleekycauldron (talkcontribs) (she/her) 23:12, 27 December 2022 (UTC)[reply]
I think part of the problem is that {{DYK-Tools-Bot was here}} is producing user-visible text to begin with. My original intent was that it would have no visible text, and I'm planning to go back to that. It was intended as just a boolean marker that the bot would use to keep track of whether it had processed a nomination yet. I'll add something to the template documentation explaining what it is.
So, I think we're in agreement now that I'll put {{DYK-Tools-Bot was here}} after {{DYKsubpage}} and before the "Please do not write below this line" comment?
Template:Did you know nominations/E. Daniel Cherry is an interesting case. In this edit, ONUnicorn ignored the "do not write below this line" and put the DYK checklit below the line. I'm not sure if anything really cares about that. Which is another way of saying I'm not sure why we even have that comment line. -- RoySmith (talk) 23:57, 27 December 2022 (UTC)[reply]
@RoySmith: It's because of the aforementioned problem – if it's not in the {{DYKsubpage}} at time of close, it won't be against the pale blue and will be transcluded in places it shouldn't be. People do routinely ignore that line, it'd be nice to do something about it. theleekycauldron (talkcontribs) (she/her) 00:00, 28 December 2022 (UTC)[reply]
I think that works, yes :) theleekycauldron (talkcontribs) (she/her) 00:01, 28 December 2022 (UTC)[reply]
@RoySmith Oops. ~ ONUnicorn(Talk|Contribs)problem solving 03:50, 28 December 2022 (UTC)[reply]

I'm wary of bloating up the DYK page with that long message, honestly. Is it really required? Can this not just be stated on the bot/bot talk page (which someone will probably look at once the bot re-adds the template, if only to complain), or in a comment in the wikitext which doesn't show but will be visible to someone removing the template? ProcrastinatingReader (talk) 00:33, 28 December 2022 (UTC)[reply]

@ProcrastinatingReader @Theleekycauldron Leeky and I just had a quick zoom conversation where we cleared up a lot of confusion. The bottom line is I'll get rid of the message completely. And all the DYK-Tools-Bot stuff will go at the very end of the page, after the HTML comment.
Most of my confusion had to do with what the HTML comment means. While it says, "Place comments above this line", what it really means is "Place comments inside the DYKsubpage template". If you're thinking of the page as a sequence of lines of text, those two interpretations lead to the same meaning. But I was thinking of the page as a structured tree of nodes, which led me to think what I should be doing is putting my stuff after the {{DYKsubpage}} but before the start of the comment. -- RoySmith (talk) 00:58, 28 December 2022 (UTC)[reply]

Cron running

I've now got this running as an hourly cron job:

(venv) tools.dyk-tools@tools-sgebastion-11 [~] toolforge-jobs list
Job name:    Job type:             Status:
-----------  --------------------  ---------------------------
dykbot-cron  schedule: 43 * * * *  Last schedule time: unknown

I believe I've incorporated all of the comments above. Please let me know if there's anything of concern, and of course, feel free to block DYK-Tools-Bot if it starts doing something stupid. Now that this is running in automated mode, I'm assuming the 7-day trial clock is now running. -- RoySmith (talk) 04:09, 2 January 2023 (UTC)[reply]

I pushed a few changes out today. There's some improvements to the "American" detection heuristics, and I've moved to the {{Pending DYK biographies}} scheme instead of the raw Category:Pending DYK biographies tags. -- RoySmith (talk) 03:50, 6 January 2023 (UTC)[reply]
@ProcrastinatingReader The 7-day trial period is up in a couple of hours. What's next? -- RoySmith (talk) 02:15, 9 January 2023 (UTC)[reply]
Oh, I see. I'm supposed to say Trial complete. -- RoySmith (talk) 13:42, 9 January 2023 (UTC)[reply]
Is {{DYK-Tools-Bot was here}} really required? Can you track state within the bot (using some kind of database, even some buffered flatfile if you want to avoid spinning up new infrastructure. On Toolforge might be easier to spin up a db, see wikitech:Help:Toolforge/Database#Steps_to_create_a_user_database_on_tools.db.svc.wikimedia.cloud, saves you having to deal with on-disk logic)? The bot seems to be editing far more pages than necessary ('necessary' being cases where it has to add a category). ProcrastinatingReader (talk) 15:58, 18 January 2023 (UTC)[reply]
Strictly speaking, yes, it would be possible to track the state managed by {{DYK-Tools-Bot was here}}, but keeping everything on the DYK pages seemed more logical. The intent is that the categories can be overridden by humans (either adding or removing). {{DYK-Tools-Bot was here}} provides a human-visible indication that the bot has done its thing. Lacking that, a human would have no way to know if the bot mis-categorized the nomination as not belonging to any of its managed categories, or simply hasn't been there yet. -- RoySmith (talk) 16:25, 18 January 2023 (UTC)[reply]
Why not run the bot more frequently? Any reason it can't run every 5 minutes or so - I suspect it's not really intensive on the API? (Or webhooks, if the API offers that, though I think it's probably overkill.) My thinking is: If it runs frequently enough then by the time a human DYK editor starts to wonder if they need to add categories manually, the answer will be yes, as the bot has very likely visited it.
If we really do want to make it obvious and possible for other users to verify state, you could maintain the flatfile onwiki under the bot's userspace, though IMO that's probably overkill and adds some overhead. ProcrastinatingReader (talk) 16:33, 18 January 2023 (UTC)[reply]
I agree with Proc, it would be nice if we didn't need the template. The main advantage of keeping state on-wiki tends to be that humans can "reset" the bot without needing maintainer assistance. But since you've already suggested having a web tool, maybe that tool could also surface whether an article has been processed or not?
Alternatively, I also like leeky's suggestion above of having it be a parameter to the existing DYKsubpage template, though I'd suggest something like |categorized=yes, rather than being named exclusively to the bot. Legoktm (talk) 18:53, 18 January 2023 (UTC)[reply]
One other suggestion, you could alternatively just check the page history to see if the bot has edited the nomination before. Since you're on Toolforge and have access to the database replicas, it shouldn't be too difficult to have a query that gives you a list of pages in the pending nominations category that the bot hasn't edited yet. Legoktm (talk) 18:55, 18 January 2023 (UTC)[reply]
That can't differentiate between "looked at and decided not to add any categories" and "hasn't been looked at yet". -- RoySmith (talk) 20:06, 18 January 2023 (UTC)[reply]