Wikipedia:Bots/Requests for approval/DYKToolsAdminBot
Operator: RoySmith (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 15:34, Wednesday, March 1, 2023 (UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): Python
Source code available: https://github.com/roysmith/dyk-tools/tree/main/dyk_tools/bot
Function overview: Applies move protection to DYK articles which are on the main page or in the queue to be placed on the main page soon.
Links to relevant discussions (where appropriate): Wikipedia talk:Did you know/Archive 188#Move protection
Edit period(s): Continuous
Estimated number of pages affected: 10 per day
Exclusion compliant (Yes/No): No
Already has a bot flag (Yes/No): No
Adminbot (Yes/No): Yes
Function details:
First, some definitions:
Nomination: A nomination template, i.e. a subpage of {{Did you know nominations}}.
Hook: A string starting with "..." and ending with "?". Optionally includes a tag such as "ALT1".
Target: An article referenced from a hook using a bolded wikilink. All hooks have one or more targets.
Hookset: A template containing a collection of hooks along with other metadata. One of {{Did you know}} (i.e. the current hookset), the 7 numerically named subpages of {{Did you know/Queue}}, or the 7 numerically named {{Did you know/Preparation area 1}}, etc.
DYKToolsBot is already approved for a different task, but does not have admin rights. This new account (DYKToolsAdminBot) will handle tasks that require admin rights. They share the same code.
There are two distinct tasks proposed here, protect and unprotect. Both tasks are run as scheduled toolforge jobs. Currently both tasks run every 10 minutes, offset by a few minutes. The exact timing is not critical.
The protect task does:
Parse the main page + queue hooksets, extracting all the hooks. From the hooks, extract the targets which need protecting ("protectable targets"). These titles are indicated by wikilinks set in bold. There is typically one target per hook, but there can be more than one. For each protectable target, indef move=sysop indef protection will be applied. The page protection log messages will include a link to a page in the bot's userspace explaining the process.
The unprotect task does:
Queries the bot's user log with type=protect for the previous N days, where N is long enough to account for any hooks which have progressed through the normal promotion process plus extra time to account for intra-queue hook swapping. It's currently set to 9, but might need to be increased. The exact value is not critical. These are the "unprotectable targets". The current list of protectable targets is acquired as in the protect task. Any targets in the unprotectable set which are not also in the protectable set are unprotected.
I considered computing an expiration date and only protecting until then. The problem is that the expiration date is a moving target. Hooks often get shuffled around when problems are discovered. Sometimes hooks get unpromoted entirely after hitting a queue (or even when they're on the main page). Sometimes the queue processing schedule is disrupted by failure of the bot which manages that process (this has happened a couple of times in the past few weeks). A few times a year, queue processing toggles between 1 per day or 2 per day. Keeping track of all these possibilities and updating the expiration time would add significant complexity for no benefit. It's far simpler to use a declarative approach, in the style of puppet; periodically figure what state each target should be in right now and make it so, regardless of history.
This is currently running on testwiki. See https://test.wikipedia.org/wiki/Special:Log/DYKToolsAdminBot. Reviewers should feel free to exercise the bot by editing the DYK queues on testwiki.
Known problems
On rare occasions, hook targets are written as templates such as one of the (many) {{Ship}} variants. The current code does not recognize these properly (github bug) This happens infrequently enough, and it's difficult enough to do correctly (it requries a call out to Parsoid), and the consequence are mild enough (a page doesn't get the move protection it should), that I'm not going to make it a blocker for an initial deployment.
If a target was already move protected before entering the DYK pipeline, it will have that protection removed when it transitions out of DYK. The probability of this happening is so low, I'm going to ignore it. The alternative would be to maintain a database of pre-existing protections so they could be restored properly, which seems like more trouble than it's worth.
If enough protection log history isn't examined, it's possible to miss unprotecting a target which spent an abnormally long time in the DYK queues. If it happens, the target can be manually unprotected and the history window size increased.
Discussion
If a target was already move protected before entering the DYK pipeline, it will have that protection removed when it transitions out of DYK.
Why would this be the case as you say it[q]ueries the bot's user log with type=protect for the previous N days
? Would it not unprotect only the pages that were protected by it? – SD0001 (talk) 03:05, 2 March 2023 (UTC)- You set move=autoconfirmed, then the bot changes that to move=sysop. It'll lose your original protection when it migrates off the main page and the bot unprotects. But this is enough of a corner case, I'm not going to worry about it. -- RoySmith (talk) 14:12, 2 March 2023 (UTC)
- Every article going through DYK losing its move protection seems like a problem to be worried about. While I understand new articles are rarely protected, recently promoted GAs could be. Can this be fixed? If a database is too much trouble, you can use redis since the data here is easily represented as key-value pairs. – SD0001 (talk) 17:13, 7 March 2023 (UTC)
- I need to think a bit on this. I had previously assumed existing move protection was such a rare thing, it wasn't worth worrying about much. But I just did a quick scan of WP:Recent additions and found:
- Recent additions/2022/January, protect_count=28
- Recent additions/2022/February, protect_count=14
- Recent additions/2022/March, protect_count=6
- Recent additions/2022/April, protect_count=12
- Recent additions/2022/May, protect_count=12
- Recent additions/2022/June, protect_count=9
- Recent additions/2022/July, protect_count=20
- Recent additions/2022/August, protect_count=15
- Recent additions/2022/September, protect_count=17
- Recent additions/2022/October, protect_count=13
- Recent additions/2022/November, protect_count=21
- Recent additions/2022/December, protect_count=9
- The protect_counts are how many targets had any move protection in their page protection log at all. The ones I spot-checked either had that protection already expired by the time they got to DYK, or applied after DYK was over, but it's still more than I had expected to see. I'm working on some ideas of how to deal with this. -- RoySmith (talk) 13:57, 8 March 2023 (UTC)
- I need to think a bit on this. I had previously assumed existing move protection was such a rare thing, it wasn't worth worrying about much. But I just did a quick scan of WP:Recent additions and found:
- Every article going through DYK losing its move protection seems like a problem to be worried about. While I understand new articles are rarely protected, recently promoted GAs could be. Can this be fixed? If a database is too much trouble, you can use redis since the data here is easily represented as key-value pairs. – SD0001 (talk) 17:13, 7 March 2023 (UTC)
- You set move=autoconfirmed, then the bot changes that to move=sysop. It'll lose your original protection when it migrates off the main page and the bot unprotects. But this is enough of a corner case, I'm not going to worry about it. -- RoySmith (talk) 14:12, 2 March 2023 (UTC)
On rare occasions, hook targets are written as templates such as one of the (many) {{Ship}} variants.
You could parse the HTML instead of wikitext. Scanning the HTML for <a> tags leading to article namespace can be easier than parsing wikitext and doesn't require parsoid. – SD0001 (talk) 03:07, 2 March 2023 (UTC)- I'm not totally following you. The output of parsoid is HTML (sort of) so calling out to parsoid is indeed parsing HTML. But it would add complexity which isn't justified for an initial rollout. The logic for this is contained in Hook.targets(), so at least plugging it in later wouldn't be too disruptive. -- RoySmith (talk) 14:40, 2 March 2023 (UTC)
- Well, it turns out this was easier to do than I thought it would be. Pywikibot's Site.expand_text() handles everything. I suspect it's calling Parsoid under the covers, but haven't gone digging to verify that. In any case, it works just fine. -- RoySmith (talk) 04:26, 7 March 2023 (UTC)
- mw:API:Parse is what I meant, it gets you the HTML without going through parsoid. There's also mw:API:Expandtemplates which might be what Site.expand_text() uses under the hood. – SD0001 (talk) 15:03, 7 March 2023 (UTC)
- Well, in any case, it's working. Can this be approved for a trial? -- RoySmith (talk) 15:09, 7 March 2023 (UTC)
- mw:API:Parse is what I meant, it gets you the HTML without going through parsoid. There's also mw:API:Expandtemplates which might be what Site.expand_text() uses under the hood. – SD0001 (talk) 15:03, 7 March 2023 (UTC)
- Well, it turns out this was easier to do than I thought it would be. Pywikibot's Site.expand_text() handles everything. I suspect it's calling Parsoid under the covers, but haven't gone digging to verify that. In any case, it works just fine. -- RoySmith (talk) 04:26, 7 March 2023 (UTC)
- I'm not totally following you. The output of parsoid is HTML (sort of) so calling out to parsoid is indeed parsing HTML. But it would add complexity which isn't justified for an initial rollout. The logic for this is contained in Hook.targets(), so at least plugging it in later wouldn't be too disruptive. -- RoySmith (talk) 14:40, 2 March 2023 (UTC)
If enough protection log history isn't examined, it's possible to miss unprotecting a target which spent an abnormally long time in the DYK queues.
Why not set a liberal value for N, say 25 - since anyway at the processing step it will skip the pages that don't have protection any longer? I'm assuming the unprotect task only needs to be run at quite a lower frequency than every 10 minutes. – SD0001 (talk) 03:12, 2 March 2023 (UTC)- Yeah, there's very little downside to making the history window longer. Setting it to 25 wouldn't be a problem. You're also correct that the unprotect task could run at a lower frequency, and that's easy to change. -- RoySmith (talk) 14:45, 2 March 2023 (UTC)
- Has WP:AN been notified of this bot task per WP:ADMINBOT? Primefac (talk) 10:30, 8 March 2023 (UTC)
- Ooops, I didn't realize that was required. I just dropped a notification on WP:AN. -- RoySmith (talk) 13:48, 8 March 2023 (UTC)
- The discussion linked looks like a local consensus to me. Policy is against pre-emptive protection and move-protecting DYKs looks like a solution in search of a problem. As far as I can tell, only one example was given of an article being moved while on DYK and that was generally considered a good move. HJ Mitchell | Penny for your thoughts? 14:00, 8 March 2023 (UTC)