Help:Searching/Regex/Sandboxing
Regular expressions are little computer programs, so it is characteristic of regex searches that they must always be tested to achieve there potential precision and thoroughness. But only a few of these intensive searches are technically able to run at a time against the database. A sandbox minimizes your footprint, and guarantees that you will never run an untested regexp on every namespace in the wiki, even if your default search would let you do that.
Both search links and regex require testing in a sandbox. Although developing a search link may target the entire wiki, a regexp should target as few pages as possible.
Search link templates like {{Regex}} are the best, quickest, and safest way to hone regex searches of the entire wiki[1] because
- you just type in the regexp
- it starts by searching only one page, by adding prefix:{{FULLPAGENAME}} for you
- you get an informational link you can share to make points on talk pages
For regex targeting template usage, the equivalent is {{tlre}}.
Regex sandboxing uses an ad hoc sandbox made from any already-saved, real-content, page containing the target data, where it then searches its own page using prefix:{{FULLPAGENAME}}. There it uses Show preview to test a temporary search link template instead of using the Wikipedia Search box.[2] The search results page can be used to modify the query further.[3]
Use of a sandbox enables the smallest possible footprint by using filters to limit the search domain. The first domain it targets its own page in an ad hoc sandbox. Once your regexp pattern is honed, you can safely increase the search domain.
Sandboxing procedure
Regexp searches are restricted on the server, so the sandbox method reduces the regex search footprint by starting with the prefix:{{FULLPAGENAME}} filter every time. The prefix: filter can also filter a namespace by specifying that only page names that start with given letters are searched.
When using insource:/regex/, always use a filter. The order of items entered in the query does not matter. (They are first optimized.) Filters include the parameters
- intitle:
- incategory:
- hastemplate:
- prefix:
- linksto:
- insource:
Filters also include any bare word search terms.
Fullpagename is namespace:pagename. Knowing this you can adjust your Prefix parameter. Although prefix starts by filtering down to one page, it accepts up to a namespace:, and it also accepts the beginning letter(s) of set of pagenames if you want to reduce the namespace search domain. It doesn't accept All:.
The procedure here is an iterative, read-evaluate-modify cycle.
- Find an existing fullpagename with the wikitext instances you are interested in mining. Or create one yourself, and save it to the database so the query will find it.
- Open the wikitext, and enter a search link with the specific insource:/regexp/ and prefix:fullpagename directives.
- Show preview, and check the query as displayed by the search link. Activate the search link. Note the bold text in each match.
- Go back in your browser. Modify the regexp. Cycle. (Or don't go back, you may need to reset the query at the search box.)
- Expand the search domain, and test the accuracy of those results. You can trim or expand the number of the results using prefix:.
Caveat emptor: if you change the target for an immediate retesting, you'll have to save and purge, but not if you just change the regexp.
Examples
As an ad hoc sandbox, you can show the wikitext of a section like this, already saved in the database, modify some of the patterns in the regex-search-link template calls on here, do a Show Preview, and see what matches when you click on the newly formed "search the database" link, all quite safely, and without changing a thing in the database.
The template calls that produce "1 ft/s, 2 sq ft, 3 m/s, 4 m*s-2, 5 ft.s-2, 6 °C/J, and 7 J/C" appear in the wikitext of this section like this:
- {{val|1|ul=ft/s|fmt = commas}}
- {{val|2|u=ft2}}
- {{val|3|u=m/s| fmt =commas }}
- {{val|4|u=m*s-2}}
- {{val|5|u=ft.s-2}}
- {{val|6|u=C/J}}
- {{val|7|ul=J/C}} → 7 J/C
Note how the above targets are |numbered|, then click on these links.
Query | Search link | Answer |
---|---|---|
Q1 Using {{search link}}, does this page employ template Val ? | {{search link|hastemplate: Val}} → hastemplate: Val
|
A. Yes, because its fullpagename shows on the search results. |
Q2 Using {{search link}} correctly, does this page use Val's fmt parameter? | {{search link|insource:/\{[Vv]al\{{!}}[^}]*fmt/ prefix:{{FULLPAGENAME}}}} →
insource:/\{[Vv]al\|[^}]*fmt/ prefix:Help:Searching/Regex/Sandboxing |
A2.1. Look for 1 and 3 in the search results in bold text. (Uses an appropriate filter.) |
Using {{regex}} instead... | {{regexp|\{[Vv]al\{{!}}[^}]*fmt}} →
insource:/\{[Vv]al\|[^}]*fmt/ prefix:Help:Searching/Regex/Sandboxing |
A2.2 Less typing than {{search link}}. |
Using {{tlre}} instead... | {{tlre|Val|pattern=fmt}} →
|
A2.3 Easiest for templates. |
Q3. Who uses u=ft OR ul=ft? (one-letter differs) | {{regex|pattern = ul?=ft}} →
|
A. Look for 1, 2, and 5 in bold text. |
Using {{tlre}}... | {{tlre|val|pattern = ul?=ft}} →
| |
Q4. AND of these, who also uses fmt=commas after that? | {{regex|pattern = ul?=ft.*commas}} →
insource:/ul?=ft.*commas/ prefix:Help:Searching/Regex/Sandboxing |
A. No context shown, but article title is shown. A half a Bug? |
Who has one space before the word "commas"? | {{regex|. commas}} → insource:/. commas/ prefix:Help:Searching/Regex/Sandboxing
|
A. 1 but not 2.
|
Q5. Who uses either u or ul with "ft" OR uses "fmt=commas". Allow for
all the possible usage of spaces between the named arguments and there values. |
{{regex|(ul? *=*ft|fmt *= *commas)}} → insource:/regexp/ prefix:Help:Searching/Regex/Sandboxing
|
A. 1, 2, 3, and 5.
|
Q6. Who uses ft or m, in |u= or |ul= ?
|
{{regex|ul? *=*(ft|m)}} → insource:/m)/ prefix:Help:Searching/Regex/Sandboxing
|
A. 1, 2, 3, 4, and 5.
|
Q7. Who uses . or * in the unit code? | {{tlre|val|pattern=u *= *(\.|\*)/}} → Testing u *= *(\. on this page
|
A. 4 and 5. |
Who uses a pipe? | {{regex|\|}} → insource:/\/ prefix:Help:Searching/Regex/Sandboxing
|
All of them |
Q8. Who uses / or - within the |u= or |ul= paramter?
|
{{tlre|ul? *=*[^|}]+(\/|-)}} → Testing -) on this page
|
A. 1,3,4,5,6 and 7.
|
Q9. Where is Val used in the template namespace with no units
parameters, that is no u or ul, or up, or upl. |
{{tlre|val|~(u[lp].)}} → Testing ~(u[lp].) on this page
|
A. In the 15 or so articles listed.
|
Q10 | {{tlre|[^.0-9][0-9]\|-\| prefix::|Who converts single digits using a dash?}} → Who converts single digits using a dash?
|
A Around 11. |
In Q2, notice how the MediaWiki software ignores the spaces around parameters, but how in Q4 the same MediaWiki software processes the spaces inside parameters. Q2 might have been solved with a plain insource:val fmt search because "fmt" and "val" are whole words, and fmt is rarely seen apart from inside Val. How about hastemplate:val insource:fmt?
- ^ This is true even though the values of template arguments don't handle the pipe character or the equals character very well. The pipe character needs to be escaped using {{!}} unless it is in a wikilink. The equals sign needs to be escaped using {{=}}, unless you use the named argument
parameter name = value with = sign
to disambiguate the equals sign. (The software can never do it.) - ^ The reasons to save a real sandbox to the database are either to transclude it or to search for it.
- ^ If you are conserned about leaving edit boxes vulnerable to unwanted saving, you can back over it in the browser and then run over the intervening history by choosing a different line of traversal by clicking any link.