Jump to content

Wikipedia:Bots/Requests for approval/CitationCleanerBot

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Headbomb (talk | contribs) at 05:43, 7 September 2011 (CitationCleanerBot: :*Gonna do roughly ~500 edits :**[http://en.wikipedia.org/w/index.php?title=Agaricus_bisporus&diff=448888604&oldid=443561028 Screwed up here], this was fixed. :**[more to come]? :~~~~). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Operator: Headbomb (talk · contribs)

Time filed: 01:09, Monday September 5, 2011 (UTC)

Automatic or Manual: Automatic

Programming language(s): AWB

Source code available: On request (will evolve over time)

Function overview: Cleaning citation templates.

Links to relevant discussions (where appropriate): N/A, kind of a "duh" thing.

Edit period(s): Will usually edit after dumps on articles likely to contain fixes.

Estimated number of pages affected: 10-20K?

Exclusion compliant (Y/N):

Already has a bot flag (Y/N): No

Function details: This mostly concerns fixes like the following

  • |url=http://www.jstor.org/stable/123456798|jstor=123456789
  • {{cite news}}/{{cite web}}{{cite journal}} when appropriate
  • Removing accessdates when no URL is present
  • Convert certain bare url to cite journals. <ref>[http://www.jstor.org/stable/123465789]</ref> → <ref>{{cite journal|123456789}}</ref> (similar to Citation bot 8) so Citation bot can expand them. (let's leave this for another BRFA)
  • AWB genfixes
  • Removal of rarely used empty parameters (on a per-template basis). For example, all empty |laysummary= in {{cite journal}} are leftover clutter from a copy-paste of the template documentation. |oclc= is likely to have a use for a book, but for journal articles they are leftover clutter from a copy-paste. |issn= can be useful for a journal, but is just silly for a book.
  • Some other minor fixes, like issn hyphenation, would be bundled with it over time
  • The bot does not add or remove information, it just cleans up the existing stuff

Discussion

  • For the bare url conversions, see Wikipedia:Bots/Requests for approval/Citation bot 8. I would get a list of articles with such conversions done at the end of each run, and then run citation bot on them myself, which would bring them in line with the majority use of {{citation}} vs {{cite xxx}}. If another style is used, filled out citations are easier to convert than bare urls, and will be better than bare urls in the meantime.
  • ISSNs are always hyphenated as XXXX-XXXX (unlike ISBNs, which can be unhyphenated, although this is not recommended officially), and I've after cleaning up several thousand articles, I've yet to come across an unhyphenated ISSN that was inserted manually (rather than by bots/scripts) or which was not at odds with the rest of the article. The bot could do ISBN hyphenation, but I specifically left it out since that's a legitimate stylistic alternative actually found in the outside world (e.g. Google Books does not hyphenate ISBNs, many books have unhyphenated ISBNs on their information page, but all journal databases do hyphenate ISSNs, and no journal features an unhyphenated ISSNs on their cover/information page).
  • Other minor fixes would include whitespace striping in citation parameters (something like |title=How  the   West \n Was     Won|title=How the West Was Won, once I figure how to implement it, or journal disambiguation |journal=[[Nature]]|journal=[[Nature (journal)|Nature]], etc...
Headbomb {talk / contribs / physics / books} 16:28, 5 September 2011 (UTC)[reply]
  • I'm OK with ISSNs then, though I'm not an expert, so leaving this up to someone who knows better. Fine on minor fixes, as long as you notify about these on WT:BRFA or something.
  • Citation Bot is a manual tool, this is an automated one. Citation Bot was approved as a manual tool, and that doesn't imply consensus for automating that. It's one thing to have an edit initiated manually, it is another to have a batch of pages edited automatically. While I'm all for citation templates and couldn't agree more they are better than bare urls, that's not everyone's opinion. You are bound to stomp on someone's garden eventually and we'll be on our merry way to AN (again) :) So I can only suggest advertising this broader, imposing thresholds, or asking other BAGgers. —  HELLKNOWZ  ▎TALK 17:35, 5 September 2011 (UTC)[reply]

Seeing as part of this is mostly uncontroversial, Approved for trial (100 edits) Please provide a link to the relevant contributions and/or diffs when the trial is complete. for the following:

with genfixes enabled. Try to balance the different tasks throughout the 100 edits. I'm leaving the "convert bare url to cite journals" task open to discussion/advertising. — The Earwig (talk) 19:47, 5 September 2011 (UTC)[reply]

Trial complete. I'm currently reviewing the edits, see if there's anything wrong with any of them. Headbomb {talk / contribs / physics / books} 21:21, 5 September 2011 (UTC)[reply]

In [1] |id=ISBN-x is converted to |isbn=-x. This is GIGO stuff (the article isn't made any worse by the bot).
In [2], there a bad handling of a bad use of doi. That's GIGO, but avoidable GIGO. This has been fixed.
In [3], it unlinked two PDF. It shouldn't have done that, and I tweaked the logic accordingly.
And that's pretty much that. Headbomb {talk / contribs / physics / books} 21:35, 5 September 2011 (UTC)[reply]
For 1, it would be useful if the bot could tell that it is garbage, and post about it somewhere, add a cleanup category, or whatever. I think ISBNs that do not match /((\d-?){9}|(\d-?){12})[\dX]/ are invalid; perhaps there's a better regex around. Ucucha (talk) 00:05, 6 September 2011 (UTC)[reply]
Would probably be better job for a database/toolserver report than for this bot. I'm not not against incorporating some ISBN checker with the bot, but I would rather have an established solution / cleanup template for this before doing so. Headbomb {talk / contribs / physics / books} 00:18, 6 September 2011 (UTC)[reply]

I made some other tweaks to the bot (added a few more urls to recognize and clean, fixed a few regexes, and made it skip articles it was likely to mess up after discovering an issue). I've tested them semi-automatically on a variety of articles, but it would probably be a good idea to trial them. So could I get another trial? Headbomb {talk / contribs / physics / books} 16:02, 6 September 2011 (UTC)[reply]

Given that you chopped out that one buggy part, I'd like to see another/extended trial just to make sure everything works okay now. =) --slakrtalk / 04:42, 7 September 2011 (UTC)[reply]
I still note a few bugs that I would like to see resolved. For example, this should not happen; it is duplicating |pmc= with the same exact data. Yes, it's relatively minor, and supposedly a problem with AWB and not the bot itself, but it should be looked into. Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Trial until you think everything's been sufficiently tested. — The Earwig (talk) 04:59, 7 September 2011 (UTC)[reply]
  • Gonna do roughly ~500 edits
Headbomb {talk / contribs / physics / books} 05:43, 7 September 2011 (UTC)[reply]