Module talk:Plain text
![]() | This module was considered for deletion on 2018 May 5. The result of the discussion was "no consensus". |
strip_apostrophe_markup
@Galobtter: The function string.gsub() is quite forgiving, so you don't need to test for each case. Also ' doesn't need to be escaped when used as a search pattern. You can't sensibly export the strip_apostrophe_markup function, so it should be local, or could just go inline. You can simplify strip_apostrophe_markup to
local function strip_apostrophe_markup(txt) txt = txt:gsub("'''''", ""):gsub("''''", ""):gsub("'''", ""):gsub("''", "") return txt end
In the main function, text should be a local variable:
local text = frame.args[1]
I don't like altering code while others are developing it, so I'll leave you to update it as you see fit. --RexxS (talk) 19:56, 14 April 2018 (UTC)
- RexxS the second point - yeah I forgot to localize - regarding the strip_apostrophe_markup(txt), yeah I was also wondering why there were so many ifs etc, but I was too lazy to look over it (as you can see, I just copied it from Module:Citation/CS1/COinS). Wonder if the same change should be done on Module:Citation/CS1/COinS - ping Trappist the monk on that Galobtter (pingó mió) 20:05, 14 April 2018 (UTC)
- It's best to use the ustring library (as you have done), mainly because the module is likely to be reused in other languages, so your new code ends up not quite as simple, but is still fine. Nice work! --RexxS (talk) 20:36, 14 April 2018 (UTC)
- Thanks! Yeah it is good to allow easy reuse., plus we ourselves use unicode characters occasionally for places I believe Galobtter (pingó mió) 20:43, 14 April 2018 (UTC)
- We do sometimes use unicode characters for places, but interestingly string.gsub copes perfectly well with all of the Latin diacriticals and Greek or Cyrillic script that I've tried: Module talk:RexxS #Test stripApost. There will almost certainly be some characters that trip it up, but they won't be common. --RexxS (talk) 21:16, 14 April 2018 (UTC)
- That is rather interesting, I think perhaps as long there is nothing being done to the unicode characters themselves it may be ok Galobtter (pingó mió) 06:06, 15 April 2018 (UTC)
- I suspect that it's a case of not using a function that makes use of absolute positioning within the string, because the byte count that the string library uses much of the time is obviously going to be incorrect with unicode characters. We probably just struck lucky with gsub
. --RexxS (talk) 20:55, 15 April 2018 (UTC)
- I suspect that it's a case of not using a function that makes use of absolute positioning within the string, because the byte count that the string library uses much of the time is obviously going to be incorrect with unicode characters. We probably just struck lucky with gsub
- That is rather interesting, I think perhaps as long there is nothing being done to the unicode characters themselves it may be ok Galobtter (pingó mió) 06:06, 15 April 2018 (UTC)
- We do sometimes use unicode characters for places, but interestingly string.gsub copes perfectly well with all of the Latin diacriticals and Greek or Cyrillic script that I've tried: Module talk:RexxS #Test stripApost. There will almost certainly be some characters that trip it up, but they won't be common. --RexxS (talk) 21:16, 14 April 2018 (UTC)
- Thanks! Yeah it is good to allow easy reuse., plus we ourselves use unicode characters occasionally for places I believe Galobtter (pingó mió) 20:43, 14 April 2018 (UTC)
- It's best to use the ustring library (as you have done), mainly because the module is likely to be reused in other languages, so your new code ends up not quite as simple, but is still fine. Nice work! --RexxS (talk) 20:36, 14 April 2018 (UTC)
I replaced the mw.ustring.gsub with plain gsub because ustring is a lot slower than gsub and is not needed in this module. The optimization is not necessary but since people are looking at the code I thought it worth mentioning that wikitext will always use UTF-8 and that means Lua gsub with the patterns in this module will work well. Lua gsub works in any language with a pattern like '[12]'
('1' or '2') but mw.ustring.gsub would be needed for a pattern like ['১২']
(that might be used at the Bengali Wikipedia to search for their equivalent). In the first case (Lua gsub), the pattern finds the first location matching any of the bytes between [
and ]
. In the Bengali case, each digit is three bytes in UTF-8, so there are six bytes between the square brackets. If Lua gsub were used, it would look for any of those bytes. Johnuniq (talk) 09:47, 18 April 2018 (UTC)
Could remove indentations
Can be comnbined with leading spaces: gsub("^[:;%s]+", "")
— 𝐆𝐮𝐚𝐫𝐚𝐩𝐢𝐫𝐚𝐧𝐠𝐚 (talk) 20:31, 24 May 2021 (UTC)
Performance improvements (and other) in the sandbox
I made a few performance (and other) improvements to this module in the sandbox based on the work with Module:User scripts table (for which I started using Module:Plain text, and ended up forking and customising it for the needs there). The two performance improvements are:
- Use greedy [^x]+x instead of ungreedy .-x whenever possible; and
- Use a single gsub for all File:, Category:, Media:, etc, instead of a gsub for each.
— 𝐆𝐮𝐚𝐫𝐚𝐩𝐢𝐫𝐚𝐧𝐠𝐚 ☎ 13:48, 21 June 2021 (UTC)
nowiki text removed?
The documentation example has in its example: <nowiki>?</nowiki>
(question mark in nowiki tag).
The module removes this wikitext altogether, including the question mark. Why is this "other stuff" to be removed? -DePiep (talk) 13:41, 2 September 2021 (UTC)
Tag stripping
Currently, this module strips both the tags and their contents for all HTML-style tags, except for <span>
, <i>
, <b>
, <em>
, and <strong>
(and the last three only because I just added cases for them). However, there are a variety of other tags which are valid in wikitext, and which contents arguably should be kept after discarding the tags themselves, e.g. <h2>
, <dfn>
, <sup>
, <u>
. These could continue to be added here individually, but I think it's probably simpler to reverse the module's behavior, and only discard contents of tags for a curated list, and otherwise keep the contents.
The main issue I can see with that would be for <sub>
and <sup>
, where just stripping them often results in confusing text, e.g. stripping "232" would produce "232", or "ve" producing "ve"; in these cases it might be better to replace the tags with "^"/"_" (resulting, for the aforementioned examples, in output of "2^32" and "v_e") or other appropriate characters (though the suggested characters, I believe, are the ones most often used for indicating super/subscript when formatting options are limited). 「ディノ奴千?!」☎ Dinoguy1000 04:13, 6 October 2021 (UTC)