Wikipedia:Bots/Requests for approval/SQLBot-AmazonAffiliateRemoval
Appearance
Operator: SQL (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 01:11, Tuesday, December 11, 2018 (UTC)
Function overview: Per Wikipedia:Spam#Affiliate_links, removes the affiliate (tag=) portion from amazon links.
Automatic, Supervised, or Manual: Automatic
Programming language(s): PHP
Source code available: Once complete, yes.
Edit period(s): Daily
Estimated number of pages affected: 673 in the first run
Namespace(s): Mainspace
Exclusion compliant (Yes/No): No
Function details: Looks for tag= in amazon urls, this query:
SELECT page_title,
el_to
FROM externallinks
JOIN page
ON el_from = page_id
WHERE ( el_to LIKE "%amazon.%tag=%"
OR el_to LIKE "%amzn.%tag=%" )
AND page_namespace = 0;
To find pages, and URLs. Once found, removes ?tag= from every amazon url as found above.
The function used to strip the tag= portion of the url is:
$p_url = parse_url( $url );
parse_str( $p_url['query'], $out );
unset( $out['tag'] );
$q = array();
foreach( $out as $o=>$t ) {
array_push( $q, "$o=$t");
}
$stripped = "https://" . $p_url['host'] . $p_url['path'] . "?" . implode( "&", $q );
Discussion
Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — xaosflux Talk 01:28, 11 December 2018 (UTC)
- I note that SQL query is not very efficient. There's no way to really avoid the table scan, but you should be able to greatly improve subsequent runs by remembering the maximum
el_id
value from just before the current run and only looking at rows with higher values next time. Anomie⚔ 18:28, 11 December 2018 (UTC)- Also you might consider batching the query: select with
AND el_id BETWEEN X AND X+10000
or something like that, process whichever rows you got from that, and repeat with increasing X until it's greater than the currentMAX(el_id)
. Ideally adjust the "10000" there so each batch doesn't take more than a second or two to return. Anomie⚔ 18:36, 11 December 2018 (UTC)
- Also you might consider batching the query: select with
Trial complete. I ended up implementing batching as suggested above. 100000 ids at a time, each query takes between 1 and 2 seconds to run. In the beginning, there were some issues with url encoding, but those have been resolved. SQLQuery me! 23:09, 11 December 2018 (UTC)
- Q: Is 673 pages the estimated total that have a tag, or just a small trial set? Asking because if it's a large number, it should be aware of archive URLs. Changing an archive URL will break the URL. We have many millions of archive URLs. There are a couple fairly simple ways to avoid archive URLs I can pass along if you would like. -- GreenC 23:42, 11 December 2018 (UTC)
- GreenC, That should be everything, and that's a very good catch. I could check the "host" portion of parse_url to make sure it contains either "amazon" or "amzn". I believe that would be sufficient.
- By the way - I imagine that we could probably reduce this to a one-time run as well, if someone wanted to make an edit filter or spam blacklist entry to stop these before they get started. SQLQuery me! 23:50, 11 December 2018 (UTC)
- Not sure how it's parsing the article for URLs, might it still pick up URLs in the query or path portion of another URL? The two main types lead with either a "/" or "?url=" like
archive.org/web/20181210010101/http://...
orwebcitation.org/456hsdus?url=http://
if it back checked for those leading character(s) should be safe. User:Headbomb came up with a regex for this I can try to track down if you'd like. -- GreenC 00:08, 12 December 2018 (UTC)
- Not sure how it's parsing the article for URLs, might it still pick up URLs in the query or path portion of another URL? The two main types lead with either a "/" or "?url=" like
- Don't really know what I can do to help here with regexes. However, I'll comment on blocking affiliates with an edit filter. Most often, those are just good faith copy-pastes of URLs. That shouldn't really be blocked, although an edit summary tag might be appropriate. Headbomb {t · c · p · b} 01:35, 12 December 2018 (UTC)
- I've updated it to require that the hostname returned by parse_url contains either "amazon.", or "amzn." - which will handle any issues surrounding archives. SQLQuery me! 04:25, 13 December 2018 (UTC)
- User:SQL: question: If the Wikisource contains
http://archive.org/web/20181210010101/http://amazon.com/..
a regex for an amazon URL would match onhttp://amazon.com/..
portion, and parse_url would OK it since it has an amazon hostname. -- GreenC 04:41, 13 December 2018 (UTC)
- User:SQL: question: If the Wikisource contains
- I've updated it to require that the hostname returned by parse_url contains either "amazon.", or "amzn." - which will handle any issues surrounding archives. SQLQuery me! 04:25, 13 December 2018 (UTC)
- Don't really know what I can do to help here with regexes. However, I'll comment on blocking affiliates with an edit filter. Most often, those are just good faith copy-pastes of URLs. That shouldn't really be blocked, although an edit summary tag might be appropriate. Headbomb {t · c · p · b} 01:35, 12 December 2018 (UTC)
array ( 'scheme' => 'http', 'host' => 'archive.org', 'path' => '/web/20181210010101/http://amazon.com/', )
- Of which, 'archive.org' would return false on a strpos( "archive.org", "amazon." ) call, skipping that URL. SQLQuery me! 04:46, 13 December 2018 (UTC)
- As this is appears to be a single-purpose account, please build out the user page a bit more to explain why this bot is doing what it does so that people looking at it will have an understanding that Wikipedia isn't in some us-vs-Amazon fight. — xaosflux Talk 03:16, 12 December 2018 (UTC)
- Xaosflux,
Doing..., but while I'm doing so - I thought it might be appropriate to rename the bot as well, removing the 'amazon' bit. Both for the reason you point out above, and that I might use it later for other affiliate link removals. What do you think? SQLQuery me! 04:16, 13 December 2018 (UTC)
- Xaosflux,