Wikipedia:Bots/Requests for approval/SQLBot-AmazonAffiliateRemoval
Appearance
Operator: SQL (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 01:11, Tuesday, December 11, 2018 (UTC)
Function overview: Per Wikipedia:Spam#Affiliate_links, removes the affiliate (tag=) portion from amazon links.
Automatic, Supervised, or Manual: Automatic
Programming language(s): PHP
Source code available: Once complete, yes.
Edit period(s): Daily
Estimated number of pages affected: 673 in the first run
Namespace(s): Mainspace
Exclusion compliant (Yes/No): No
Function details: Looks for tag= in amazon urls, this query:
SELECT page_title,
el_to
FROM externallinks
JOIN page
ON el_from = page_id
WHERE ( el_to LIKE "%amazon.%tag=%"
OR el_to LIKE "%amzn.%tag=%" )
AND page_namespace = 0;
To find pages, and URLs. Once found, removes ?tag= from every amazon url as found above.
The function used to strip the tag= portion of the url is:
$p_url = parse_url( $url );
parse_str( $p_url['query'], $out );
unset( $out['tag'] );
$q = array();
foreach( $out as $o=>$t ) {
array_push( $q, "$o=$t");
}
$stripped = "https://" . $p_url['host'] . $p_url['path'] . "?" . implode( "&", $q );
Discussion
Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — xaosflux Talk 01:28, 11 December 2018 (UTC)
- I note that SQL query is not very efficient. There's no way to really avoid the table scan, but you should be able to greatly improve subsequent runs by remembering the maximum
el_id
value from just before the current run and only looking at rows with higher values next time. Anomie⚔ 18:28, 11 December 2018 (UTC)- Also you might consider batching the query: select
el_id BETWEEN X AND X+10000
or something like that, process whichever rows you got from that, and repeat with increasing X until it's greater than the currentMAX(el_id)
. Anomie⚔ 18:34, 11 December 2018 (UTC)
- Also you might consider batching the query: select