Jump to content

User:Dispenser/Automatic linking problems

From Wikipedia, the free encyclopedia
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Title URL limitations
  • Control character (0x00 - 0x1F) are forbidden
  • Spaces ( , 0x20) are not allowed
  • Double quote (", 0x22) needs to be percent encoded
  • Number Signs (#, 0x23) are not allowed
  • Percent sign (%, 0x25) needs hexadecimal characters to follow, so cannot end with it
  • Angle brackets (<>, 0x3C 0x3E) are not allowed
  • Question mark (?, 0x3F) used as query string separator needs to be percent encoded, should have corresponding redirect
  • Square brackets ([], 0x5B 0x5D) are not allowed
  • Trailing underscore (_, 0x5F) are stripped
  • Braces ({}, 0x7B 0x7D) are not allowed
  • Vertical bar/Pipe (|, 0x7C) are not allowed
  • Non-ASCII characters (0x80 and higher) are normally percent encoded. Sometimes decode (ruwiki) for ease of use. We will focus on non-Unicode systems.

Assuming letters, numbers, underscore, and forward slash (A-Z, a-z, 0-9, _) are safe, these are the characters we need to test:

! $ % & ' ( ) * + , - . : ; = ? @ \ ^ ` ~
  1. [0-9A-Za-z\-.:_] not escaped
  2. [;:@$!*(),/] are converted back in GlobalFunctions.php (whenever I wrote this)
! $ ( ) * , - . : ; @ ~


-- 2> /dev/null; date; echo '
/* Trailing characters survey
 *
 * License: Public Domain
 * Run time: 20 minutes
 */
SELECT
  RIGHT(page_title, 1) AS "Trailing",
  SUM(page_is_redirect  = 0) AS "Articles",
  SUM(EXISTS (SELECT 1
    FROM page AS rd
    JOIN redirect ON rd.page_id = rd_from
    WHERE rd.page_namespace = 0 AND rd_namespace = 0
    AND rd_title = page.page_title
    AND rd.page_title = LEFT(rd_title, CHAR_LENGTH(rd_title) - 1)
  )) AS "Fixed",
  SUM(page_is_redirect != 0) AS "Redirects",
  CONCAT("[[",REPLACE(page_title,"_"," "),"]]") AS "Example"
FROM page
WHERE page.page_namespace = 0
  AND page.page_title REGEXP ".[!$-.:;=?@\\\\^\\`~]$"
GROUP BY 1
;-- ' | sql enwiki_p > ~/public_html/trails.txt; date
Article + Redirect suffixes usage
Articles Fixed Redirects Example
! 5186 1389 7200 !!
$ 19 0 148 $$
% 24 5 152 %$CLS%
& 6 1 10 &&
' 2329 177 3255 "Flipnote Studio 3D''
( 0 0 42 )'(
) 595329 643 897217 !!! (Chk Chk Chk)
* 41 6 209 (Z/nZ)*
+ 222 8 759 (NH4)+
, 5 4 354 "This Is Our Punk-Rock", Thee Rusted Satellites Gather + Sing,
- 93 16 762 ' or 1==1--
. 17475 1661 57474 "Buster" Collier, Jr.
: 7 1 179 (-:
; 1 0 30 (;
= 4 0 28 !=
? 2549 800 3590 !?
@ 16 1 34 $@
\ 2 0 41 +/'\
^ 1 0 17 A^
` 26 1 63 A`
d 151541 91 218950 !Xam Khomani Heartland
s 577484 10866 841081 !Alarma! Records
~ 20 0 452 "Yume" ~Mugen no Kanata~
  • Highlight: Default URL wont end in these characters.
Recommendations
  • Extend append implementation from ) and s to include ! . ? characters
  • Add a new test which removes the last character from greedy linking
Previous work
/* Stripped punctuation leading to different page
 *
 * License: Public domain
 * Run time:
 */
SELECT page.page_title, rd.page_title
FROM page
JOIN page AS rd ON rd.page_namespace=0 AND rd.page_is_redirect=1
               AND rd.page_title = LEFT(page.page_title, CHAR_LENGTH(page.page_title)-1)
LEFT JOIN redirect ON rd_from=rd.page_id AND rd_namespace=0 AND rd_title=page.page_title
WHERE page.page_namespace=0
AND page.page_is_redirect=0
AND page.page_title REGEXP ".[!$()*,-.:;?@~]$"
AND  rd_from IS NULL
LIMIT 100;
Resources