Jump to content

Wikipedia:WikiProject Red Link Recovery/Unlikely links

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Welsh (talk | contribs) at 09:21, 24 October 2012 (Comment on new features). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This page is for the discussion of the Unlikely Links tool, hosted on the toolserver at http://toolserver.org/~tb/unlikely/.


Ideas for future unlikeliness checks

  • Characters in the UTF-16 range may indicate corruption or untranslated foreign-language links
  • Anything that would trigger a rule from MediaWiki:Titleblacklist - if a page cannot be created for the target of a link, that link is suspect.
  • Mixes of language-specific characters - for example Icelandic and Romanian specific characters in the same red link
  • Badly formed template links

- TB (talk) 22:31, 15 December 2010 (UTC)[reply]

Common Double Letters/Tripple Letters

ii is a reasonably common double letter - skiing, Hawaii, various star names - perhaps it should be excluded from uncommon double letters. welsh (talk) 04:54, 24 December 2010 (UTC)[reply]

I've removed double i's for now. The letters in use were chosen by counting all instances of double lettes in article titles and selecting the least common 5. 'i' was indeed the most commonly present of the five selected. - TB (talk) 19:40, 24 December 2010 (UTC)[reply]

III is fairly common as well because it is 3 in Roman Numerals. III is used in the naming of films, kings, queens etc. John Cross (talk) 08:19, 9 September 2012 (UTC)[reply]

...and divisions in sports leagues ... John Cross (talk) 08:24, 9 September 2012 (UTC)[reply]

For sure. I've removed triple I's from the rule for now, but on the whole feel that this pattern has never been especially useful. - TB (talk) 15:43, 9 September 2012 (UTC)[reply]

Namespaces

The tool does not display whether a page is in the Portal: or Template: space, but rather leaves it unmarked, which then defaults to Main:. It's easy to see what's going on by doing a what links here? on the redlink, but flagging the namespace would be better. welsh (talk) 14:11, 24 December 2010 (UTC)[reply]

Fixed. - TB (talk) 19:33, 24 December 2010 (UTC)[reply]
Thanks welsh (talk) 12:40, 8 January 2011 (UTC)[reply]

Slow

The suggestions from the tool are taking a long time to display - several minutes in some cases. Is there anything like a tweak to indexes that could fix this? For example, triple letters towards the end of the alphabet. welsh (talk) 12:40, 8 January 2011 (UTC)[reply]

Alas, a simple index won't do the job in this case. Currently, a list of all red links in the English-language wikipedia is maintained and searched on demand for any matching a particular patten (the patterns can be seen here). The list is too long to brute-force search quickly, and the patterns too varied to index effectively. The real solution is I suppose to store pre-calculated lists, as the RLRL tool does - however, in the longer-term, I'm hoping to transform the tool into a more generalised 'red-link explorer', hence it's simplistic design for now. I'll ponde the matter more - inspiration might strike yet ;) - TB (talk) 10:16, 13 January 2011 (UTC)[reply]
I've adjusted a few things to hopefully improve performance a bit. More to come. - TB (talk) 21:48, 27 May 2011 (UTC)[reply]
I noticed the refresh was faster even without knowing anything had been changed! Well done welsh (talk) 23:32, 27 May 2011 (UTC)[reply]

List rebuilt

Redlink list rebuilt, and a few tweaks made to the tool to make it deal more sensibly with large numbers of whitelisted entries. - TB (talk) 07:54, 26 April 2011 (UTC)[reply]

New pattern added - 'All uppercase'

New pattern added - 'All uppercase'. This shows red links that are ALL IN UPPER CASE, of course ;) - TB (talk) 17:36, 27 April 2011 (UTC)[reply]

Cool new set! Lots of whitelist candidates (ships, satellites, asteroids, international standards, radio stations...) but many positives too. welsh (talk) 06:57, 28 April 2011 (UTC)[reply]

New pattern added - 'Offensive words'

New pattern added - 'Offensive words'. This shows red links matching a small selection of offensive English-language words. - TB (talk) 21:21, 21 May 2011 (UTC)[reply]

Sorting lists

Sometimes, maybe just for variety or efficiency of editing, it would be good to see the lists sorted by Containing Article rather than bad link name. This would be particularly useful in the very long ALL UPPERCASE class. welsh (talk) 09:18, 22 May 2011 (UTC)[reply]

I quite agree - in general the facilities for navigating lists of unlikely links are pretty crude. I'll see if I can't graft on a more flexible set of tools, hopefully including the ability to sort and further filter lists. - TB (talk) 11:22, 22 May 2011 (UTC)[reply]

Which way forwards?

Okay, I've tried quite a few approaches to improving this tool can find nothing that satisfies me, so I'm soliciting input on what folks want. My original intention was that it develop into a 'red link explorer' tool, allowing users to flexibly generate lists of red links of interest, hopefully for the purpose of fixing them. It turns out that there are a couple of showstoppers making this infeasible:

  1. The way the MediaWiki database is structured makes it time consuming to generate a list of all red links (around 4 hours currently)
  2. Likewise, the database structure makes it very hard to maintain such a list - normally one could run through the hundreds of edits made each minute and add/remove red links to keep the list of all red links up to date. Not possible :(
  3. The list of red links is large enough that waving it past even a simple regular expression takes double-digits seconds. Running arbitrary user-generated queries is likely to be problem-prone.

So, a new vision is needed. Anyone ? - TB (talk) 20:33, 6 July 2011 (UTC)[reply]

New pattern added - 'Double disambiguation'

New pattern added - 'Double disambiguation'. This shows red links ending in two bracketed terms - for example 1906_Australasian_Championships_(tennis)_(tennis) - TB (talk) 14:46, 25 August 2011 (UTC)[reply]

Target page

How about looking for links to "Target page name"? You get those when you click on the "redirect" icon in the edit box and don't change the text. I've fixed a few of those a few times.ospalh (talk) 19:04, 20 September 2011 (UTC)[reply]

Hi Ospalh. Nice idea - that's a new one by me, I tend to not use the javascripty goodies. The list you're after can be found using the normal "What Links Here" tool. Thinking this over, there are a few other similar "error indicator links" we should probably be checking periodically also:
Can you think of any more ? - TB (talk) 19:42, 20 September 2011 (UTC)[reply]

New set: Sabha constituencies

There are around 550 Lok Sabha constituencies, all of which AFAIK have pages. Spelling variations seem rife; I believe that most of the redlinks in this set should be fixable. - TB (talk) 12:04, 1 February 2012 (UTC)[reply]

New set: Co-ordinates

We have over 2800 red links containing geographic coordinates. The majority of these look to be poorly filtered automatically generated content. - TB (talk) 14:56, 23 April 2012 (UTC)[reply]

New set: Mosty non-English characters

These are links consisting mostly of multibyte unicode characters - Cyrillic, Greek, Armenian, Hebrew, Arabic and Syriac lettering, and Korean, Chinese, and Japanese ideographs mostly. Nearly 4500 red links match this at the time of writing; it looks like a mix of transwikied stuff and untranslated or only partly translated source. - TB (talk) 15:03, 23 April 2012 (UTC)[reply]

Updated

I've adjusted the way this tool works behind the scenes; it should now identify larger number of red links in each category. As always, please shout if problems. - TB (talk) 10:28, 13 June 2012 (UTC)[reply]

New feature: Check redirect and article titles

It is now possible to use this tool to check for articles and redirects matching the various patterns. So for example, once can search for articles containing mismatched brackets in their titles, or redirects containing HTML entities. N.B.; not all the patterns are particularly relevant to article and redirect titles - as with red links, matching a given pattern does not necessarily make a page or redirect title incorrect. - TB (talk) 20:52, 17 October 2012 (UTC)[reply]

New feature: Caching

The performance of this tool has never been great, and it has seemed particularly poor of late. To help mitigate this, I've added a layer of caching. You may find it slow (perhaps around 2 minutes) to bring up the first set of results for any given query, but should be pretty quick on the same set for an hour or two after this. As is always the case with caching, oddities may occur - I'll be tidying things up over the next week or two. Cheers. - TB (talk) 18:27, 21 October 2012 (UTC)[reply]

New feature: Sorting

It is now possible to sort the results of the various unlikely checks alphabetically or by title length. I've also tidied up the tools for scrolling through the lists a bit. - TB (talk) 20:38, 23 October 2012 (UTC)[reply]

New features

Some cool new features! Can I suggest that you add the Replag onto the top of the screen, as that helps understand what's going on? Missing spaces near brackets is a mine of broken links - there were a lot of chemicals and some placenames as false positives, but they have been whitelisted and there's c 8000 links worth looking at. Just a small bug - when you've used a sort key and mark something as whitelisted, the display reverts to the dont care sort rather than the one previously selected. Keep up the good work! welsh (talk) 09:21, 24 October 2012 (UTC)[reply]