Module:Wikitext Parsing
![]() | This Lua module is used on approximately 17,800,000 pages, or roughly 28% of all pages. To avoid major disruption and server load, any changes should be tested in the module's /sandbox or /testcases subpages, or in your own module sandbox. The tested changes can be added to this page in a single edit. Consider discussing changes on the talk page before implementing them. |
![]() | This module can only be edited by administrators because it is transcluded onto one or more cascade-protected pages. |
This module provides some functions to help with the potential complex situation involved in modules like Module:Template parameter value, which intend to process the raw wikitext of a page and want to respect nowiki tags or similar reliably. This module is designed only to be called by other modules.
PrepareText
![]() | This module is rated as ready for general use. It has reached a mature form and is thought to be relatively bug-free and ready for use wherever appropriate. It is ready to mention on help pages and other Wikipedia resources as an option for new users to learn. To reduce server load and bad output, it should be improved by sandbox testing rather than repeated trial-and-error editing. |
![]() | This module is subject to page protection. It is a highly visible module in use by a very large number of pages, or is substituted very frequently. Because vandalism or mistakes would affect many pages, and even trivial editing might cause substantial load on the servers, it is protected from editing. |
PrepareText(text, keepComments)
will run any content within certain tags that disable processing (<nowiki>
, <pre>
, <syntaxhighlight>
, <source>
, <math>
) through mw.text.nowiki and remove HTML comments to avoid irrelevant text being processed by modules, allowing tricky syntax to be parsed through more basic means such as %b{}
.
If the second parameter, keepComments
, is set to true, the content of HTML comments will be passed through mw.text.nowiki instead of being removed entirely.
Any code using this function directly should consider using mw.text.decode to correct the output at the end if part of the processed text is returned, though this will also decode any input that was encoded but not inside a no-processing tag, which likely isn't a significant issue but still something worth considering.
ParseTemplates
![]() | This module is rated as beta, and is ready for widespread use. It is still new and should be used with some caution to ensure the results are as expected. |
![]() | This module is subject to page protection. It is a highly visible module in use by a very large number of pages, or is substituted very frequently. Because vandalism or mistakes would affect many pages, and even trivial editing might cause substantial load on the servers, it is protected from editing. |
ParseTemplates(InputText, dontEscape)
will attempt to parse all {{Templates}}
on a page, handling multiple factors such as [[Wikilinks]]
and {{{Variables}}}
among other complex syntax. Due to the complexity of the function, it is considerably slow, and should be used carefully. The function returns a list of template objects in order of appearance, which have the following properties:
- Args: A key-value set of arguments, not in order
- ArgOrder: A list of keys in the order they appear in the template
- Children: A list of template objects that are contained within the existing template, in order of appearance. Only immediate children are listed
- Name: The name of the template
- Text: The raw text of the template
If the second parameter, dontEscape
, is set to true, the inputted text won't be ran through the PrepareText
function.
require("strict")
--Helper functions
local function startswith(text, subtext)
return string.sub(text, 1, #subtext) == subtext
end
local function endswith(text, subtext)
return string.sub(text, -#subtext, -1) == subtext
end
local function allcases(s)
return s:gsub("%a", function(c)
return "["..c:upper()..c:lower().."]"
end)
end
--[[ Implementation notes
---- NORMAL HTML TAGS ----
Tags are very strict on how they want to start, but loose on how they end.
The start must strictly follow <[tAgNaMe](%s|>) with no room for whitespace in
the tag's name, but may then flow as they want afterwards, making
<div\nclass\n=\n"\nerror\n"\n> valid
There's no sense of escaping < or >
E.g.
<div class="error\>"> will end at \> despite it being inside a quote
<div class="<span class="error">error</span>"> will not process the larger div
If a tag has no end, it will consume all text instead of not processing
---- NOPROCESSING TAGS (nowiki, pre, syntaxhighlight, source) ----
(In most comments, <source> will not be mentioned. This is because it is the
deprecated version of <syntaxhighlight>)
No-Processing tags have some interesting differences to the above rules.
For example, their syntax is a lot stricter. While an opening tag appears to
follow the same set of rules, A closing tag can't have any sort of extra
formatting period. While </div a/a> is valid, </nowiki a/a> isn't - only
newlines and spaces are allowed in closing tags (except in <pre> tags, which
follow the rules of a regular html tag for formatting).
Both the content inside the tag pair and the content inside each side of the
pair is not processed. E.g. <nowiki |}}>|}}</nowiki> would have both of the |}}
escaped in practice.
When something in the code is referenced to as a "Nowiki Tag", it means a tag
which causes wiki text to not be processed, which includes <nowiki>, <pre>,
and <syntaxhighlight>
Since we only care about these tags, we can ignore the idea of an intercepting
tag preventing processing, and just go straight for the first ending we can find
If there is no ending to find, the tag will NOT consume the rest of the text in
terms of processing behaviour (though <pre> will appear to have an effect).
Even if there is no end of the tag, the content inside the opening half will
still be unprocessed, meaning {{X20|<nowiki }}>}} wouldn't end at the first }}
despite there being no ending to the tag.
Note that there are some tags, like <math>, which also function like <nowiki>
which are included in this aswell. Some other tags, like <ref>, have far too
unpredictable behaviour to be handled currently (they'd have to be split and
processed as something seperate - its complicated, but maybe not impossible.)
---- HTML COMMENTS AND INCLUDEONLY ----
HTML Comments are about as basic as it could get for this
Start at <!--, end at -->, no extra conditions. Simple enough
If a comment has no end, it will eat all text instead of not being processed
includeonly tags function mostly like a regular nowiki tag, with the exception
that the tag will actually consume all future text if not given an ending as
opposed to simply giving up and not changing anything. Due to complications and
the fact that this is far less likely to be present on a page, aswell as being
something that may not want to be escaped, includeonly tags are ignored during
processing
--]]
local validtags = {nowiki=1, pre=1, syntaxhighlight=1, source=1, math=1}
--This function expects the string to start with the tag
local function TestForNowikiTag(text)
local tagName = (string.match(text, "^<([^\n />]+)") or ""):lower()
if not validtags[tagName] then
return nil
end
local nextOpener = string.find(text, "<", 2) or -1
local nextCloser = string.find(text, ">", 2) or -1
if nextCloser > -1 and (nextOpener == -1 or nextCloser < nextOpener) then
local startingTag = string.sub(text, 1, nextCloser)
--We have our starting tag (E.g. '<pre style="color:red">')
--Now find our ending...
if endswith(startingTag, "/>") then --self-closing tag (we are our own ending)
return {
Tag = tagName,
Start = startingTag,
Content = "", End = "",
Length = #startingTag
}
else
local endingTag
if tagName == "pre" then --Looser restrictions for <pre>
endingTag = --no | so we just use 2 matches
string.match(text, "</[Pp][Rr][Ee]>") or
string.match(text, "</[Pp][Rr][Ee][ \t\n/][^<]*>")
else
endingTag = string.match(text, "</"..allcases(tagName).."[ \t\n]*>")
end
if endingTag then --Regular tag formation
local endingTagPosition = string.find(text, endingTag, nextCloser, true)
local tagContent = string.sub(text, nextCloser+1, endingTagPosition-1)
return {
Tag = tagName,
Start = startingTag,
Content = tagContent,
End = endingTag,
Length = #startingTag + #tagContent + #endingTag
}
else --Content inside still needs escaping (also linter error!)
return {
Tag = tagName,
Start = startingTag,
Content = "", End = "",
Length = #startingTag
}
end
end
end
return nil
end
local function TestForComment(text) --Like TestForNowikiTag but for <!-- -->
if startswith(text, "<!--") then
local commentEnd = string.find(text, "-->", 5, true)
if commentEnd then
return {
Start = "<!--", End = "-->",
Content = string.sub(text, 5, commentEnd-1),
Length = commentEnd+2
}
else --Consumes all text if not given an ending
return {
Start = "<!--", End = "",
Content = string.sub(text, 5),
Length = #text
}
end
end
return nil
end
--[[ Implementation notes
The goal of this function is to escape all text that wouldn't be parsed if it
was preprocessed (see above implementation notes).
Using keepComments will keep all HTML comments instead of removing them. They
will still be escaped regardless to avoid processing errors
--]]
local function PrepareText(text, keepComments)
local newtext = ""
while text ~= "" do
local NextCheck = string.find(text,"<[NnSsPpMm!]") --Advance to the next potential tag we care about
if not NextCheck then --Done
newtext = newtext .. text
break
end
newtext = newtext .. string.sub(text,1,NextCheck-1)
text = string.sub(text, NextCheck)
local Comment = TestForComment(text)
if Comment then
if keepComments then
newtext = newtext .. Comment.Start .. mw.text.nowiki(Comment.Content) .. Comment.End
end
text = string.sub(text, Comment.Length+1)
else
local Tag = TestForNowikiTag(text)
if Tag then
local newTagStart = "<" .. mw.text.nowiki(string.sub(Tag.Start,2,-2)) .. ">"
local newTagEnd =
Tag.End == "" and "" or --Respect no tag ending
"</" .. mw.text.nowiki(string.sub(Tag.End,3,-2)) .. ">"
local newContent = mw.text.nowiki(Tag.Content)
newtext = newtext .. newTagStart .. newContent .. newTagEnd
text = string.sub(text, Tag.Length+1)
else --Nothing special, move on...
newtext = newtext .. string.sub(text, 1, 1)
text = string.sub(text, 2)
end
end
end
return newtext
end
--[=[ Implementation notes
This function is an alternative to Transcluder's getParameters which considers
the potential for a singular { or } or other odd syntax that %b doesn't like to
be in a parameter's value. Also theoretically faster as it does a singular pass
through the text instead of multiple gsub runs (though we shall see as this
slowly grows more complex as I theory this).
When handling the difference between {{ and {{{, mediawiki will attempt to match
as many sequences of {{{ as possible before matching a {{
E.g.
{{{{A}}}} -> { {{{A}}} }
{{{{{{{{Text|A}}}}}}}} -> {{ {{{ {{{Text|A}}} }}} }}
If there aren't enough triple braces on both sides, the parser will compromise
for a template interpretation.
E.g.
{{{{A}} }} -> {{ {{ A }} }}
Setting dontEscape will prevent running the input text through EET. Avoid
setting this to true if you don't have to set it.
TODO: This entire "bounds" method of exclusion is seeming to be significantly expensive. This needs proper thought to fix
Returned values:
A table of all templates found in string form, listed in chronological order
A table with a key-value link between a template's string form and their data
--]=]
--Helper functions
local function boundlen(pair)
return pair.End-pair.Start+1
end
local function CreateBoundsBlacklist(boundlist)
local blacklist = {}
for _,bounds in next,boundlist do
for _,bound in next,bounds do
for i = bound.Start,bound.End do
blacklist[i] = true
end
end
end
return blacklist
end
local function CreateBoundsBlacklistWithinBounds(container, boundlist) --These names are getting stupid
local blacklist = {}
for _,bounds in next,boundlist do
for _,bound in next,bounds do
if bound.Start > container.Start and bound.End < container.End then
for i = bound.Start,bound.End do
blacklist[i] = true
end
end
end
end
return blacklist
end
local function FindWithBlacklist(text, pattern, init, blacklist, ...)
while true do
local s, e = string.find(text, pattern, init, ...)
if s then
if blacklist[s] then --illegal match
init = e+1
else --legal match
return s, e
end
else --no match
return
end
end
end
local function ClearTableWhitespace(t)
local maxindex = 0
for i,_ in next,t do
maxindex = math.max(i, maxindex)
end
local nexti = 1
for i = 1,maxindex do
local d = t[i]
if d then
t[i] = nil
t[nexti] = d
nexti = nexti + 1
end
end
return t
end
--Main function
local function ParseTemplates(_text, dontEscape)
--Setup
if not dontEscape then
_text = PrepareText(_text)
end
local function finalise(text)
if not dontEscape then
return mw.text.decode(text)
else
return text
end
end
--Step 1: Find and escape the content of all wikilinks on the page, which are stronger than templates (see implementation notes)
local scannerPosition = 1
local wikilinks = {}
local openWikilinks = {}
while true do
local NextOpen = string.find(_text, "%[%[", scannerPosition) or 9e9
local NextClose = string.find(_text, "%]%]", scannerPosition) or 9e9
if NextOpen == NextClose then --Done (both 9e9)
break
end
scannerPosition = math.min(NextOpen, NextClose)+2 --+2 to pass the [[ / ]]
if NextOpen < NextClose then --Add a [[ to the pending wikilink queue
table.insert(openWikilinks, NextOpen)
else --Pair up the ]] to any available [[
if #openWikilinks >= 1 then
local start = table.remove(openWikilinks) --Pop the latest [[
table.insert(wikilinks, {Start=start, End=NextClose+1}) --Note the pair
end
end
end
local WikilinkBlacklist = CreateBoundsBlacklist({wikilinks})
--Step 2: Find the bounds of every valid template, figuring out if a set should be treated as {{ or {{{ as needed
local scannerPosition = 1
local templates = {}
local openBrackets = {}
while true do
local NextOpen, OEnd = FindWithBlacklist(_text, "{{+", scannerPosition, WikilinkBlacklist)
local NextClose, CEnd = FindWithBlacklist(_text, "}}+", scannerPosition, WikilinkBlacklist)
NextOpen = NextOpen or 9e9
NextClose = NextClose or 9e9
if NextOpen == NextClose then --Done (both 9e9)
break
end
local BoundStart = math.min(NextOpen, NextClose) --Skip to next notable block
local BoundEnd = math.min(OEnd or 9e9, CEnd or 9e9)
scannerPosition = BoundStart --Get to the {{ / }} set
if NextOpen < NextClose then --Add the {{+ set to the queue
local BracketCount = #string.match(_text, "^{+", scannerPosition)
table.insert(openBrackets, {Start=BoundStart, End=BoundEnd})
else --Pair up the }} to any available {{, accounting for {{{ / }}}
local BracketCount = #string.match(_text, "^}+", scannerPosition)
while BracketCount >= 2 and #openBrackets >= 1 do
local OpenSet = table.remove(openBrackets)
if boundlen(OpenSet) >= 3 and BracketCount >= 3 then --Dump the {{{Var}}} from the list
BracketCount = BracketCount - 3
OpenSet.End = OpenSet.End - 3
scannerPosition = scannerPosition + 3
else --We have a table (both sides have 2 spare, but at least one side doesn't have 3 spare)
templates[OpenSet.End-1] = {Start=OpenSet.End-1, End=scannerPosition+1} --Done like this to ensure chronological order
BracketCount = BracketCount - 2
OpenSet.End = OpenSet.End - 2
scannerPosition = scannerPosition + 2
end
if boundlen(OpenSet) >= 2 then --Still has enough data left, leave it in
table.insert(openBrackets, OpenSet)
end
end
end
scannerPosition = BoundEnd --Now move past the bracket set
end
ClearTableWhitespace(templates) --Fix into chronological ordering
--Step 3: Re-trace the templates using their known bounds, collecting our parameters with (slight) ease
local function HandleArgLogic(data, text, blacklist)
if data.Name then
local equals = FindWithBlacklist(text, "=", 1, blacklist)
if equals then
data.Args[finalise(mw.text.trim(string.sub(text, 1, equals-1)))] = finalise(mw.text.trim(string.sub(text, equals+1)))
else
data.Args[tostring(data._NextIndex)] = finalise(text) --not trimmed
data._NextIndex = data._NextIndex + 1
end
else
data.Name = mw.text.trim(text)
end
end
local AllTemplates = {}
local templateData = {}
for _,template in ipairs(templates) do
local InnerBlacklist = CreateBoundsBlacklistWithinBounds(template, {wikilinks, templates})
local innerText = string.sub(_text, template.Start, template.End)
local TemplateString = finalise(innerText)
table.insert(AllTemplates, TemplateString)
if not templateData[TemplateString] then
local data = {Args = {}, Text = TemplateString, _NextIndex = 1}
local scannerPosition = 3
while true do
local NextPipe = FindWithBlacklist(innerText, "|", scannerPosition, InnerBlacklist)
if not NextPipe then
HandleArgLogic(data, string.sub(innerText, scannerPosition, -3), InnerBlacklist)
break
end
HandleArgLogic(data, string.sub(innerText, scannerPosition, NextPipe-1), InnerBlacklist)
scannerPosition = NextPipe + 1
end
data._NextIndex = nil
templateData[TemplateString] = data
end
end
--Finished, return
return AllTemplates, templateData
end
local p = {}
--Main entry points
p.PrepareText = PrepareText
p.ParseTemplates = ParseTemplates
--Extra entry points, not really required
p.TestForNowikiTag = TestForNowikiTag
p.TestForComment = TestForComment
return p
--[==[ console tests
local s = [=[Hey!{{Text|<nowiki | ||>
Hey! }}
A</nowiki>|<!--AAAAA|AAA-->Should see|Shouldn't see}}]=]
local out = p.PrepareText(s)
mw.logObject(out)
local s = [=[<!--
Hey!
-->A]=]
local out = p.TestForComments(s)
mw.logObject(out); mw.log(string.sub(s, 1, out.Length))
local a,b = p.ParseTemplates([=[
{{User:Aidan9382/templates/dummy
|A|B|C
|<nowiki>D</nowiki>
|<pre>E
|F</pre>
|G|=|a=|A = [[}}]]{{Text|1==<nowiki>}}</nowiki>}}|A B=Success}}
]=])
mw.logObject(a); mw.logObject(b)
]==]