Wikipedia:Bots/Requests for approval/SQLBot-AmazonAffiliateRemoval

SQLBot-AmazonAffiliateRemoval

Operator: SQL (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 01:11, Tuesday, December 11, 2018 (UTC)

Function overview: Per Wikipedia:Spam#Affiliate_links, removes the affiliate (tag=) portion from amazon links.

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: Once complete, yes.

Edit period(s): Daily

Estimated number of pages affected: 673 in the first run

Namespace(s): Mainspace

Exclusion compliant (Yes/No): No

Function details: Looks for tag= in amazon urls, this query:

SELECT page_title, 
                el_to 
FROM   externallinks 
       JOIN page 
         ON el_from = page_id 
WHERE  ( el_to LIKE "%amazon.%tag=%" 
          OR el_to LIKE "%amzn.%tag=%" ) 
       AND page_namespace = 0;

To find pages, and URLs. Once found, removes ?tag= from every amazon url as found above.

The function used to strip the tag= portion of the url is:

$p_url = parse_url( $url );
parse_str( $p_url['query'], $out );
unset( $out['tag'] );
$q = array();
foreach( $out as $o=>$t ) {
        array_push( $q, "$o=$t");
}
$stripped = "https://" . $p_url['host'] . $p_url['path'] . "?" . implode( "&", $q );

Discussion

Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — xaosflux ^Talk 01:28, 11 December 2018 (UTC)[reply]

I note that SQL query is not very efficient. There's no way to really avoid the table scan, but you should be able to greatly improve subsequent runs by remembering the maximum el_id value from just before the current run and only looking at rows with higher values next time. Anomie⚔ 18:28, 11 December 2018 (UTC)[reply]
- Also you might consider batching the query: select with AND el_id BETWEEN X AND X+10000 or something like that, process whichever rows you got from that, and repeat with increasing X until it's greater than the current MAX(el_id). Ideally adjust the "10000" there so each batch doesn't take more than a second or two to return. Anomie ⚔ 18:36, 11 December 2018 (UTC)[reply]

Trial complete. I ended up implementing batching as suggested above. 100000 ids at a time, each query takes between 1 and 2 seconds to run. In the beginning, there were some issues with url encoding, but those have been resolved. SQL ^{Query me!} 23:09, 11 December 2018 (UTC)[reply]