Wikipedia:Bots/Requests for approval/SQLBot-AmazonAffiliateRemoval

SQLBot-AmazonAffiliateRemoval

Operator: SQL (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 01:11, Tuesday, December 11, 2018 (UTC)

Function overview: Per Wikipedia:Spam#Affiliate_links, removes the affiliate (tag=) portion from amazon links.

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: Once complete, yes.

Edit period(s): Daily

Estimated number of pages affected: 673 in the first run

Namespace(s): Mainspace

Exclusion compliant (Yes/No): No

Function details: Looks for tag= in amazon urls, this query:

SELECT page_title, 
                el_to 
FROM   externallinks 
       JOIN page 
         ON el_from = page_id 
WHERE  ( el_to LIKE "%amazon.%tag=%" 
          OR el_to LIKE "%amzn.%tag=%" ) 
       AND page_namespace = 0;

To find pages, and URLs. Once found, removes ?tag= from every amazon url as found above.

The function used to strip the tag= portion of the url is:

$p_url = parse_url( $url );
parse_str( $p_url['query'], $out );
unset( $out['tag'] );
$q = array();
foreach( $out as $o=>$t ) {
        array_push( $q, "$o=$t");
}
$stripped = "https://" . $p_url['host'] . $p_url['path'] . "?" . implode( "&", $q );

Discussion

Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — xaosflux ^Talk 01:28, 11 December 2018 (UTC)[reply]

I note that SQL query is not very efficient. There's no way to really avoid the table scan, but you should be able to greatly improve subsequent runs by remembering the maximum el_id value from just before the current run and only looking at rows with higher values next time. Anomie⚔ 18:28, 11 December 2018 (UTC)[reply]
- Also you might consider batching the query: select with AND el_id BETWEEN X AND X+10000 or something like that, process whichever rows you got from that, and repeat with increasing X until it's greater than the current MAX(el_id). Ideally adjust the "10000" there so each batch doesn't take more than a second or two to return. Anomie ⚔ 18:36, 11 December 2018 (UTC)[reply]

Trial complete. I ended up implementing batching as suggested above. 100000 ids at a time, each query takes between 1 and 2 seconds to run. In the beginning, there were some issues with url encoding, but those have been resolved. SQL ^{Query me!} 23:09, 11 December 2018 (UTC)[reply]

Q: Is 673 pages the estimated total that have a tag, or just a small trial set? Asking because if it's a large number, it should be aware of archive URLs. Changing an archive URL will break the URL. We have many millions of archive URLs. There are a couple fairly simple ways to avoid archive URLs I can pass along if you would like. -- GreenC 23:42, 11 December 2018 (UTC)[reply]
GreenC, That should be everything, and that's a very good catch. I could check the "host" portion of parse_url to make sure it contains either "amazon" or "amzn". I believe that would be sufficient.

By the way - I imagine that we could probably reduce this to a one-time run as well, if someone wanted to make an edit filter or spam blacklist entry to stop these before they get started. SQL ^{Query me!} 23:50, 11 December 2018 (UTC)[reply]

Not sure how it's parsing the article for URLs, might it still pick up URLs in the query or path portion of another URL? The two main types lead with either a "/" or "?url=" like archive.org/web/20181210010101/http://... or webcitation.org/456hsdus?url=http:// if it back checked for those leading character(s) should be safe. User:Headbomb came up with a regex for this I can try to track down if you'd like. -- GreenC 00:08, 12 December 2018 (UTC)[reply]

Don't really know what I can do to help here with regexes. However, I'll comment on blocking affiliates with an edit filter. Most often, those are just good faith copy-pastes of URLs. That shouldn't really be blocked, although an edit summary tag might be appropriate. Headbomb {t · c · p · b} 01:35, 12 December 2018 (UTC)[reply]

I've updated it to require that the hostname returned by parse_url contains either "amazon.", or "amzn." - which will handle any issues surrounding archives. SQL ^{Query me!} 04:25, 13 December 2018 (UTC)[reply]

User:SQL: question: If the Wikisource contains http://archive.org/web/20181210010101/http://amazon.com/.. a regex for an amazon URL would match on http://amazon.com/.. portion, and parse_url would OK it since it has an amazon hostname. -- GreenC 04:41, 13 December 2018 (UTC)[reply]

GreenC, If you aren't familiar with the PHP's parse_url(), you can play with it here.

The results I get from your specific example are:

array (
'scheme' => 'http',
'host' => 'archive.org',
'path' => '/web/20181210010101/http://amazon.com/',
)

Of which, 'archive.org' would return false on a strpos( "archive.org", "amazon." ) call, skipping that URL. SQL ^{Query me!} 04:46, 13 December 2018 (UTC)[reply]

As this is appears to be a single-purpose account, please build out the user page a bit more to explain why this bot is doing what it does so that people looking at it will have an understanding that Wikipedia isn't in some us-vs-Amazon fight. — xaosflux ^Talk 03:16, 12 December 2018 (UTC)[reply]
Xaosflux, Doing..., but while I'm doing so - I thought it might be appropriate to rename the bot as well, removing the 'amazon' bit. Both for the reason you point out above, and that I might use it later for other affiliate link removals. What do you think? SQL ^{Query me!} 04:16, 13 December 2018 (UTC)[reply]