Przejdź do zawartości

Wikipedysta:SchlurcherBot

Treść strony nie jest dostępna w innych językach.
Z Wikipedii, wolnej encyklopedii

SchlurcherBot

Function overview: Convert links from http:// to https://

Programming language: C#

Source code available: Main C# script: commons:User:SchlurcherBot/LinkChecker

Function details: The link checking algorithm is as follows:

  1. The bot extracts all http-links from the parsed html code of a page
    • It searches for all href elements and extracts the links
    • It does not search the wikitext, and thus does not rely on any Regex
    • This is also to avoid any problems with templates that modify links (like archiving templates)
    • Links that are subsets of other links are filtered out to minimize search and replace errors
  2. The bot checks if the identified http-links also occur in the wikitext, otherwise they are skipped
  3. The bot checks if both the http-link and the corresponding https-link is accessible
    • This step also uses a blacklist of domains that were previously identified as not accessible
  4. If both links redirect to the same page, the http-link will be replaced by the https-link (the link will not be changed to the redirect page, the original link path will be kept)
  5. If both Links are accessible and return a success code (2xx), it will be checked if the content is identical
    1. If the content is identical, and the link is directly to the host, then the http-link will be replaced by the https-link
    2. If the content is identical but not the host, it will be checked if the content is identical to the host link, only if the content is different, then the http-link will be replaced by the https-link
      • This step is added as some hosts return the same content for all their pages (like most domain sellers, some news sites or pages in ongoing maintenance)
    3. If the content is not identical, it will be checked if the content is at least 99.9% identical (calculated via the en:Levenshtein distance)
      • This step is added as most homepages use dynamic IDs for certain elements, like for ad containers to circumvent Ad Blockers.
    4. If the content is at least 99.9% identical, the same host check as before will be performed.
    5. If any of the checked links fails (like Code 404), then nothing will happen.

Source for pages: The bot works on the list of pages identified through the external links SQL dump. The list was scrambled to ensure that subsequent edits are not clustered from a specific area.

Further comments: The bot respects the API:Etiquette and uses both a user-agent header as well as respects the maxlag parameter.

Edit page statistic:

Project Edites pages
commons 6'014'443
dewiki 95'432
enwiki 93'122
eswiki 19
frwiki 20'453
itwiki 17'022
plwiki 25'323
ptwiki 30
Date: 2025-10-22, Source: query/98360.

Status: (CentralAuth)

Project Request Pages Edit Description Used Status
commons Approved 31'145'089 Fix http to https Running…
dewiki Approved 1'888'381 Bot: http → https Running…
enwiki Approved 8'570'327 Bot: http → https Running…
eswiki Pending 2'191'542 Bot: http → https  Odłożyć na później
frwiki Approved 2'970'187 Bot: http → https Running…
itwiki Approved 2'359'233 Bot: http → https Running…
jawiki Allows global bots 994'375 Bot: http → https Working Waiting
plwiki Approved 1'527'763 Bot: http → https Running…
ptwiki Pending 1'214'889 Bot: http → https  Odłożyć na później
ruwiki Allows global bots 1'797'992 Bot: http → https Working Waiting
zhwiki Allows global bots 1'105'051 Bot: http → https Working Waiting