Help:Searching/Regex

Regular expressions

This covers enough of the regular expressions to get started answering questions about wikitext contents on the wiki. Regex are about using meta characters to create patterns that match any literal characters. The pattern you give will match a target, character by character. To make some positions match with multiple possibilities, metacharacters are needed, and they are from the same keyboard characters that are also in the wikitext.

Metacharacters

The left curly bracket is a metacharacter, and so the regexp pattern given must "escape" any opening curly bracket \{ in the target "{" intending to match a template in the wikitext. All target text (all wikitext) is literal text, but we can backslash "escape" the regex metacharacters \. \? \+ \* \| \{ \[ \] \( \) \" \\ \# \@ \< \~ when we refer to them as literal characters in the wikitext we are interested in mining. Search will ignore the backslash wherever it is meaningless or unnecessary: \n matches n, and so on. So although you don't need to backslash escape & or > or }, it is safe to do so. An unnecessary backslash will not cause your pattern to fail, but what will is using certain characters literally— [ ] . * + ? | { ( ) " \ # @ < ~ .

[0-9] will match any digit, [a-y] any lowercase letter except z, [zZ] any z, (and so on). So square brackets mean "character class".
Dot . will match a newline, or any character in the targeted position

The number of sequential digits or characters these symbols match is expressed by following it with a quantifying metacharacter:

* means zero or more
+ means one or more
? means zero or one

of the character it follows after. The number of times it matches can also be given in a range, a{2} a{2,} a{2,5} matches exactly 2, 2 or more, or 2-5 a's. So curly brackets mean "quantifier".

The parenthesis are a grouping mechanism, so we can quantify more than just the previous character, and so we can make boundaries for a set of alternative matches. (See alternation below.)
The quotation marks are an escape mechanism, like square brackets or the backslash.
The angle brackets stand for numerals, not digits. Say <5-799>, to match 5–799, in one to three positions. Compare this with the alternative: [0-9]{1,3} could match ones, tens, or thousands as, 0-999 or 00-999 or 000-999.
Tilde ~ looks ahead and negates the next character. In other words if the pattern matches in this position, then un-match it if the next character is ~character.

It is not safe to search for a lone @ because that single metacharacter matches literally everything; you can use \@ to find all pages that use an "at" symbol.

Similarly find all pages that use the number zero, Search returns an error to search for a lone 0; use one of the three escape mechanisms for 0 or @.

"0"
\0
[0]

or find a larger pattern around the zero you seek. Although zero is not a metacharacter, these escape mechanisms work.

The rest of wiki regex is pretty straightforward. Characters stand for themselves unless they are metacharacters. If they are metacharacters they are escaped if outside of a character class.

Character classes

A character class means "literal characters", plural. It means "literal", and so normally you don't have to escape a metacharacter character in a character class; they're already square-brackets escaped. The /slash delimiters/ mean we must of course escape any slash character, even inside a character class. No other character in a character class except slash always needs escaping; but because ] and - have special meaning (metacharacter) to a character class, they must be escaped sometimes: those two are also literal (escaped) metacharacters if they are the first character, but otherwise they must be also, like dash, be escaped: only backslash-escape works as the escape mechanism in a character class.

A character class can serve to escape metacharacters, so [-|*\/.{\]] or []|*\/.{\-] means "either a dash OR pipe OR star OR slash OR dot OR left curly bracket or a right square bracket". So [][.?+*|\/{}()\-]" or [-[.?+*|\/{}()\]]" works to find all the metacharacters in the wikitext, all of them except the backslash. Neither [\] nor [\\] allows us to OR a literal backslash. To OR a backslash character, there's alternation with the pattern \\ to handle that case. (See below.)

A character class understands the "inverse" of itself, [^abc] is "not a or b or c". A character class stands for a single character in a targeted position, so it's not really an inverse of a set, but rather a NOT of a character.

Currently character classes are limited to an expansion of four characters, so [0-9] would require three searches [0-3], [4-7], and [8-9]. The alphabet would require seven searches. This is to guarantee regex will work without overloading the search engine. See task T106685.

Note that constructs such as \d (digit) or \a (alphabetic), used in some other regex implementations, are not accepted.

Alternation

Finally, alternation is a class of regex that contains alternative possibilities for a match, say an AA or a BB, or a CC:

"AA" OR "BB" OR "CC" to Word search an entire page
AA|BB|CC to regexp search a two-character position
(AA|BB|CC) where used within a larger regexp because an alternation finds the longest pattern, and so the parentheses define that boundary, but it's a boundary you don't have to make if an alternation is the entire regexp pattern.