Module:Text

Module documentation[view] [edit] [history] [purge]

This Lua module is used on approximately 1,780,000 pages, or roughly 3% of all pages.
To avoid major disruption and server load, any changes should be tested in the module's /sandbox or /testcases subpages, or in your own module sandbox. The tested changes can be added to this page in a single edit. Consider discussing changes on the talk page before implementing them.

This module depends on the following other modules:

Module:Yesno (sandbox)

Text – Module containing methods for the manipulation of text, wikimarkup and some HTML.

Functions for templates

All methods have an unnamed parameter containing the text.

The return value is an empty string if the parameter does not meet the conditions. When the condition is matched or some result is successfully found, strings of at least one character are returned.

char

Creates a string from a list of character codes.

1: Space-separated list of character codes
*: Number of repetitions of the list in parameter 1; (Default 1).
errors: 0 – Silence errors

concatParams

Combine any number of elements into a list, like table.concat() in Lua.

From a template:

1: First element; missing and empty elements are ignored.
2 3 4 5 6 …: Further list elements

From Lua

args: table (sequence) of the elements
apply: Separator between elements; defaults to |
adapt: optional formatting, which will be applied to each element; must contain %s.

containsCJK

Returns whether the input string contains any CJK characters

Returns nothing if there are no CJK characters

removeDelimited

Remove all text between delimiters, including the delimiters themselves.

getPlain

Remove wikimarkup (except templates): comments, tags, bold, italic, nbsp

isLatinRange

Returns some content, unless the string contains a character that would not normally be found in Latin text.

Returns nothing if there is a non-Latin string.

isQuote

Returns some content if the parameter passed is a single character, and that character is a quote, such as '.

Returns nothing for multiple characters, or if the character passed is not a quote.

listToText

Formats list elements analogously to mw.text.listToText().

The elements are separated by a comma and space ; the word "and" appears between the first and last.

Unnamed parameters become the list items.

Optional parameters for #invoke:

format – Every list element will first be formatted with this format string; see here for how to construct this string. The string must contain at least one %s sequence.
template=1 – List elements should be taken from the calling template.

Returns the resulting string.

quote

Wrap the string in quotes; quotes can be chosen for a specific language.

1: Input text (will be automatically trimmed); may be empty.
2: (optional) the ISO 639 language code for the quote marks; should be one of the supported languages (in German)
3: (optional) 2 for second level quotes. This means the single quote marks in a statement such as: Jack said, “Jill said ‘fish’ last Tuesday.”

quoteUnquoted

Wrap the string in quotes; quotes can be chosen for a specific language. Will not quote an empty string, and will not quote if there is a quote at the start or end of the (trimmed) string.

1: Input text (will be automatically trimmed); may be empty.
2: (optional) the ISO 639 language code for the quote marks; should be one of the supported languages (in German)
3: (optional) 2 for second level quotes. This means the single quote marks in a statement such as: Jack said, “Jill said ‘fish’ last Tuesday.”

removeDiacritics

Removes all diacritical marks from the input.

1

Input text

sentenceTerminated

Is this sentence terminated? Should work with CJK, and allows quotation marks to follow.

Returns nothing if the sentence is unterminated.

ucfirstAll

The first letter of every recognized word is converted to upper case. This contrasts with the parser function {{ucfirst:}} which changes only the first character of the whole string passed.

A few common HTML entities are protected; the implementation of this may mean that numerical entities passed (e.g. &) are converted to & form

uprightNonlatin

Takes a string. Italicized non-Latin characters are un-italicized, unless they are a single Greek letter.

zip

Combines a tuple of lists by convolution. This is easiest to explain by example: given two lists, list1 = "a,b,c" and list2 = "1,2,3", then
zip(list1, list2, sep = ",", isep = "-", osep = "/")
outputs
a-1/b-2/c-3

1, 2, 3, … – Lists to be combined
sep – A separator (in Lua regex form) used to split the lists. If empty, the lists are split into individual characters.
sep1, sep2, sep3, … – Allows a different separator to be used for each list.
isep – Output separator; placed between elements which were at the same index in their lists.
osep – Output separator; placed between elements which had different original indices; i.e. between the groups joined with isep

split

Splits a string into chunks at the specified delimiter, and returns the first (or user-specified) chunk. This is non-Unicode-aware implementation of mw.text.split which, for ASCII-only text, can be up to 60 times faster.

1 (or text) – the text to be split
2 (or pattern) – the pattern to use when splitting the text. By default, this can be a string library pattern.
3 (or plain) – if set to "true", pattern will be interpreted as plain text, not a pattern.
4 (or index) – The chunk to return. If omitted, the first chunk will be returned. Can be set to a negative number to count from the end (e.g. -1 will return the last chunk).

Examples and test page

There are tests available (in German) to illustrate this in practice.

Use in another Lua module

All of the above functions can be called from other Lua modules. Use require(); the below code checks for errors loading it:

local lucky, Text = pcall( require, "Module:Text" )
if type( Text ) == "table" then
    Text = Text.Text()
else
    -- In the event of errors, Text is an error message.
    return "<span class=\"error\">" .. Text .. "</span>"
end

You may then call:

Text.char( apply, again, accept )
Text.concatParams( args, separator, format )
Text.containsCJK( s )
Text.removeDelimited( s )
Text.getPlain( s )
Text.isLatinRange( s )
Text.isQuote( c )
Text.listToText( table, format )
Text.quote( s, lang, mode )
Text.quoteUnquoted( s, lang, mode )
Text.removeDiacritics( s )
Text.sentenceTerminated( s )
Text.split( text, pattern, plain ) – non Unicode version of mw.text.split
Text.gsplit( text, pattern, plain ) – non Unicode version of mw.text.gsplit
Text.ucfirstAll( s )
Text.uprightNonlatin( s )

Usage

This is a general library; use it anywhere.

Dependencies

Module:Yesno
Module:Text/data --- Lua patterns and information about quotes

--[=[ 2014-09-27
Text utilities
]=]



local Text = { }
local patternCJK        = false
local patternLatin      = false
local patternTerminated = false



Text.concatParams = function ( args, apply, adapt )
    -- Concat list items into one string
    -- Parameter:
    --     args   -- table (sequence) with numKey=string
    --     apply  -- string (optional); separator (default: "|")
    --     adapt  -- string (optional); format including "%s"
    -- Returns: string
    local collect = { }
    for k, v in pairs( args ) do
        if type( k ) == "number" then
            v = mw.text.trim( v )
            if v ~= "" then
                if adapt then
                    v = mw.ustring.format( adapt, v )
                end
                table.insert( collect, v )
            end
        end
    end
    return table.concat( collect,  apply or "|" )
end -- Text.concatParams()



Text.containsCJK = function ( analyse )
    -- Is any CJK code within?
    -- Parameter:
    --     analyse  -- string
    -- Returns: true, if CJK detected
    local r
    if not patternCJK then
        patternCJK = mw.ustring.char( 91,
                                       13312, 45,  40959,
                                      131072, 45, 178207,
                                      93 )
    end
    if mw.ustring.find( analyse, patternCJK ) then
        r = true
    else
        r = false
    end
    return r
end -- Text.containsCJK()



Text.listToText = function ( args, adapt )
    -- Format list items similar to mw.text.listToText()
    -- Parameter:
    --     args   -- table (sequence) with numKey=string
    --     adapt  -- string (optional); format including "%s"
    -- Returns: string
    local collect = { }
    for k, v in pairs( args ) do
        if type( k ) == "number" then
            v = mw.text.trim( v )
            if v ~= "" then
                if adapt then
                    v = mw.ustring.format( adapt, v )
                end
                table.insert( collect, v )
            end
        end
    end
    return mw.text.listToText( collect )
end -- Text.listToText()



Text.sentenceTerminated = function ( analyse )
    -- Is string terminated by dot, question or exclamation mark?
    --     Quotation, link termination and so on granted
    -- Parameter:
    --     analyse  -- string
    -- Returns: true, if sentence terminated
    local r
    if not patternTerminated then
        patternTerminated = mw.ustring.char( 91,
                                             12290,
                                             65281,
                                             65294,
                                             65311 )
                            .. "!%.%?…][\"'%]‹›«»‘’“”]*$"
    end
    if mw.ustring.find( analyse, patternTerminated ) then
        r = true
    else
        r = false
    end
    return r
end -- Text.sentenceTerminated()



Text.ucfirstAll = function ( adjust )
    -- Capitalize all words
    -- Precondition:
    --     adjust  -- string
    -- Returns: string with all first letters in upper case
    local r = " " .. adjust
    local i = 1
    local c, j, m
    if adjust:find( "&" ) then
        r = r:gsub( "&amp;",      "&#38;" )
             :gsub( "&lt;",       "&#60;" )
             :gsub( "&gt;",       "&#62;" )
             :gsub( "&nbsp;",    "&#160;" )
             :gsub( "&thinsp;", "&#8201;" )
             :gsub( "&zwnj;",   "&#8204;" )
             :gsub( "&zwj;",    "&#8205;" )
             :gsub( "&lrm;",    "&#8206;" )
             :gsub( "&rlm;",    "&#8207;" )
        m = true
    end
    while i do
        i = mw.ustring.find( r, "%W%l", i )
        if i then
            j = i + 1
            c = mw.ustring.upper( mw.ustring.sub( r, j, j ) )
            r = string.format( "%s%s%s",
                               mw.ustring.sub( r, 1, i ),
                               c,
                               mw.ustring.sub( r, i + 2 ) )
            i = j
        end
    end -- while i
    r = r:sub( 2 )
    if m then
        r = r:gsub(     "&#38;", "&amp;" )
             :gsub(     "&#60;", "&lt;" )
             :gsub(     "&#62;", "&gt;" )
             :gsub(    "&#160;", "&nbsp;" )
             :gsub(   "&#8201;", "&thinsp;" )
             :gsub(   "&#8204;", "&zwnj;" )
             :gsub(   "&#8205;", "&zwj;" )
             :gsub(   "&#8206;", "&lrm;" )
             :gsub(   "&#8207;", "&rlm;" )
             :gsub( "&#X(%x+);", "&#x%1;" )
    end
    return r
end -- Text.ucfirstAll()



Text.uprightNonlatin = function ( adjust )
    -- Ensure non-italics for non-latin text parts
    --     One single greek letter might be granted
    -- Precondition:
    --     adjust  -- string
    -- Returns: string with non-latin parts enclosed in <span>
    local r
    if not patternLatin then
        patternLatin = mw.ustring.char(   94, 91,
                                           7, 45,  591,
                                        8194, 45, 8250,
                                          93, 42, 36 )
    end
    if mw.ustring.match( adjust, patternLatin ) then
        -- latin only, horizontal dashes, quotes
        r = adjust
    else
        local c
        local j    = false
        local k    = 1
        local m    = false
        local n    = mw.ustring.len( adjust )
        local span = "%s%s<span style='font-style:normal'>%s</span>"
        local flat = function ( a )
                -- isLatin
                return  a <= 591   or   ( a >= 8194  and  a <= 8250 )
              end -- flat()
        local form = function ( a )
                return string.format( span,
                                      r,
                                      mw.ustring.sub( adjust, k, j - 1 ),
                                      mw.ustring.sub( adjust, j, a ) )
              end -- form()
        r = ""
        for i = 1, n do
            c = mw.ustring.codepoint( adjust, i, i )
            if c > 64  or  c == 38  or  c == 60 then    -- '&' '<'
                if flat( c ) then
                    if j then
                        if m then
                            if i == m then
                                -- single greek letter.
                                j = false
                            end
                            m = false
                        end
                        if j then
                            local nx = i - 1
                            local s  = ""
                            for ix = nx, 1, -1 do
                                c = mw.ustring.sub( adjust, ix, ix )
                                if c == " "  or  c == "(" then
                                    nx = nx - 1
                                    s  = c .. s
                                else
                                    break -- for ix
                                end
                            end -- for ix
                            r = form( nx ) .. s
                            j = false
                            k = i
                        end
                    end
                elseif not j then
                    j = i
                    if c >= 880  and  c <= 1023 then
                        -- single greek letter?
                        m = i + 1
                    else
                        m = false
                    end
                end
            elseif m then
                m = m + 1
            end
        end -- for i
        if j  and  ( not m  or  m < n ) then
            r = form( n )
        else
            r = r .. mw.ustring.sub( adjust, k )
        end
    end
    return r
end -- Text.uprightNonlatin()



-- Export
local p = { }

function p.concatParams( frame )
    local args
    local template = frame.args.template
    if type( template ) == "string" then
        template = mw.text.trim( template )
        template = ( template == "1" )
    end
    if template then
        args = frame:getParent().args
    else
        args = frame.args
    end
    return Text.concatParams( args,
                              frame.args.separator,
                              frame.args.format )
end

function p.containsCJK( frame )
    return Text.containsCJK( frame.args[ 1 ] or "" ) and "1" or ""
end

function p.listToText( frame )
    local args
    local template = frame.args.template
    if type( template ) == "string" then
        template = mw.text.trim( template )
        template = ( template == "1" )
    end
    if template then
        args = frame:getParent().args
    else
        args = frame.args
    end
    return Text.listToText( args, frame.args.format )
end

function p.sentenceTerminated( frame )
    return Text.sentenceTerminated( frame.args[ 1 ] or "" ) and "1" or ""
end

function p.ucfirstAll( frame )
    return Text.ucfirstAll( frame.args[ 1 ] or "" )
end

function p.uprightNonlatin( frame )
    return Text.uprightNonlatin( frame.args[ 1 ] or "" )
end

function p.zip(frame)
	local lists = {}
	local seps = {}
	local defaultsep = frame.args["sep"] or ""
	local innersep = frame.args["isep"] or ""
	local outersep = frame.args["osep"] or ""
	
	-- Parameter parsen
	for k, v in pairs(frame.args) do
		local knum = tonumber(k)
		if knum then lists[knum] = v else
			if string.sub(k, 1, 3) == "sep" then
				local sepnum = tonumber(string.sub(k, 4))
				if sepnum then seps[sepnum] = v end
			end
		end
	end
	-- sofern keine expliziten Separatoren angegeben sind, den Standardseparator verwenden
	for i = 1, math.max(#seps, #lists) do
		if not seps[i] then seps[i] = defaultsep end
	end

	-- Listen splitten
	local maxListLen = 0
	for i = 1, #lists do
		lists[i] = mw.text.split(lists[i], seps[i])
		if #lists[i] > maxListLen then maxListLen = #lists[i] end
	end

	local result = ""
	for i = 1, maxListLen do
		if i ~= 1 then result = result .. outersep end
		for j = 1, #lists do
			if j ~= 1 then result = result .. innersep end
			result = result .. (lists[j][i] or "")
		end
	end
	return result
end

-- removes all diacritics from the input string, be decomposing the characters, removing the combining diacritical marks and recomposing the remaining characters
function p.removeDiacritics(frame)
	local combiningDiacriticalMarks = "[" .. mw.ustring.char(0x0300) .. "-" .. mw.ustring.char(0x036F) .. "]"
	return mw.ustring.toNFC(mw.ustring.gsub(mw.ustring.toNFD(frame.args[1] or ""), combiningDiacriticalMarks, ""))
end

p.Text = function ()
    return Text
end -- p.Text

return p

Functions for templates

Examples and test page

Use in another Lua module

Usage

Dependencies

See also