Help:Searching/Regex/Sandboxing
Regular expressions are little computer programs, so it is characteristic of regex searches that they must always be tested to achieve there potential precision and thoroughness. But only a few of these intensive searches are technically able to run at a time against the database.[1] A sandbox minimizes your footprint, and guarantees that you will never run an untested regexp on every namespace in the wiki, even if your default search would let you do that.
Both search links and regex require testing in a sandbox. Although developing a search link may target the entire wiki, a regexp should target as few pages as possible.
The order of filters and other items entered in the query does determine the order they are applied. (The query is first optimized.) Filters include the parameters
- intitle:
- incategory:
- hastemplate:
- prefix:
- linksto:
- insource:
Filters also include any bare word search terms.
Fullpagename is namespace:pagename. Knowing this you can adjust your Prefix parameter. Although prefix starts by filtering down to one page, it accepts up to a namespace:, and it also accepts the beginning letter(s) of set of pagenames if you want to reduce the namespace search domain. It doesn't accept All:.
Regex sandboxing uses an ad hoc sandbox made from any already-saved, real-content, page containing the target data, where it then develops itself by searching its own page using prefix:{{FULLPAGENAME}}. There it uses Show preview to test a temporary search link template instead of using the Wikipedia Search box.[2] The search results page can be used to modify the query further.[3]
Use of a sandbox enables the smallest possible footprint by using filters to limit the search domain. The first domain it targets its own page in an ad hoc sandbox. Once your regexp pattern is honed, you can safely increase the search domain, but a regex search is best run with other items on the line, not alone. When using insource:/regex/, always use a filter, even on a polished rexexp.
Sandboxing procedure
Rather than use the search box, where entering an equals sign and a pipe character, and spaces is a straightforward matter, it is still easiest to use a search link template — {{sl}}, {{slre}}, or {{tlre}} — on the page with sample data, you just focus on the sample data and on typing in the regexp (with template character caveats). It is characteristic of regex development that you must have a sample of the data you target so that you can study it while writing its pattern.
Regexp searches are restricted on the server, so the sandbox method always begins the development of a new query to run on only one page: prefix:{{FULLPAGENAME}}.
The procedure here is an iterative, read-evaluate-modify cycle.
- Navigate to a page with the wikitext instances you are interested in mining. Or create one yourself, and save it to the database so the query will find it.
- Open the wikitext, and enter a search link with the specific insource:/regexp/ and prefix:fullpagename directives.
- Show preview, and check the query as displayed by the search link. Activate the search link. Note the bold text in each match.
- Go back in your browser. Modify the regexp. Cycle. (Or don't go back, you may need to reset the query at the search box.)
- Expand the search domain, and test the accuracy of those results. You can trim or expand the number of the results using prefix:.
Caveat emptor: if you change the target for an immediate retesting, you'll have to save and purge, but not if you just change the regexp.
Examples
As an ad hoc sandbox, you can show the wikitext of a section like this, already saved in the database, modify some of the patterns in the regex-search-link template calls on here, do a Show Preview, and see what matches when you click on the newly formed "search the database" link, all quite safely, and without changing a thing in the database.
The template calls that produce "1 ft/s, 2 sq ft, 3 m/s, 4 m*s-2, 5 ft.s-2, 6 °C/J, and 7 J/C" appear in the wikitext of this section like this:
- {{val|1|ul=ft/s|fmt = commas}}
- {{val|2|u=ft2}}
- {{val|3|u=m/s| fmt =commas }}
- {{val|4|u=m*s-2}}
- {{val|5|u=ft.s-2}}
- {{val|6|u=C/J}}
- {{val|7|ul=J/C}} → 7 J/C
Note how the above targets are |numbered|, then click on these links.
Query | Search link | Answer |
---|---|---|
Q1 Using {{search link}}, does this page employ template Val ? | {{search link|hastemplate: Val}} → hastemplate: Val
|
A. Yes, because its fullpagename shows on the search results. |
Q2 Using {{search link}} correctly, does this page use Val's fmt parameter? | {{search link|insource:/\{[Vv]al\{{!}}[^}]*fmt/ prefix:{{FULLPAGENAME}}}} →
insource:/\{[Vv]al\|[^}]*fmt/ prefix:Help:Searching/Regex/Sandboxing |
A2.1. Look for 1 and 3 in the search results in bold text. (Uses an appropriate filter.) |
Using {{regex}} instead... | {{regexp|\{[Vv]al\{{!}}[^}]*fmt}} →
insource:/\{[Vv]al\|[^}]*fmt/ prefix:Help:Searching/Regex/Sandboxing |
A2.2 Less typing than {{search link}}. |
Using {{tlre}} instead... | {{tlre|Val|pattern=fmt}} →
|
A2.3 Easiest for templates. |
Q3. Who uses u=ft OR ul=ft? (one-letter differs) | {{regex|pattern = ul?=ft}} →
|
A. Look for 1, 2, and 5 in bold text. |
Using {{tlre}}... | {{tlre|val|pattern = ul?=ft}} →
| |
Q4. AND of these, who also uses fmt=commas after that? | {{regex|pattern = ul?=ft.*commas}} →
insource:/ul?=ft.*commas/ prefix:Help:Searching/Regex/Sandboxing |
A. No context shown, but article title is shown. A half a Bug? |
Who has one space before the word "commas"? | {{regex|. commas}} → insource:/. commas/ prefix:Help:Searching/Regex/Sandboxing
|
A. 1 but not 2.
|
Q5. Who uses either u or ul with "ft" OR uses "fmt=commas". Allow for
all the possible usage of spaces between the named arguments and there values. |
{{regex|(ul? *=*ft|fmt *= *commas)}} → insource:/regexp/ prefix:Help:Searching/Regex/Sandboxing
|
A. 1, 2, 3, and 5.
|
Q6. Who uses ft or m, in |u= or |ul= ?
|
{{regex|ul? *=*(ft|m)}} → insource:/m)/ prefix:Help:Searching/Regex/Sandboxing
|
A. 1, 2, 3, 4, and 5.
|
Q7. Who uses . or * in the unit code? | {{tlre|val|pattern=u *= *(\.|\*)/}} → Testing u *= *(\. on this page
|
A. 4 and 5. |
Who uses a pipe? | {{regex|\|}} → insource:/\/ prefix:Help:Searching/Regex/Sandboxing
|
All of them |
Q8. Who uses / or - within the |u= or |ul= paramter?
|
{{tlre|ul? *=*[^|}]+(\/|-)}} → Testing -) on this page
|
A. 1,3,4,5,6 and 7.
|
Q9. Where is Val used in the template namespace with no units
parameters, that is no u or ul, or up, or upl. |
{{tlre|val|~(u[lp].)}} → Testing ~(u[lp].) on this page
|
A. In the 15 or so articles listed.
|
Q10 | {{tlre|[^.0-9][0-9]\|-\| prefix::|Who converts single digits using a dash?}} → Who converts single digits using a dash?
|
A Around 11. |
In Q2, notice how the MediaWiki software ignores the spaces around parameters, but how in Q4 the same MediaWiki software processes the spaces inside parameters. Q2 might have been solved with a plain insource:val fmt search because "fmt" and "val" are whole words, and fmt is rarely seen apart from inside Val. How about hastemplate:val insource:fmt?
- ^ See, for example, Searches that kill
- ^ The reasons to save a real sandbox to the database are either to transclude it or to search for it.
- ^ If you are concerned about leaving an ad hoc sandbox vulnerable to unintended changes, you can go backwards in your browser history all the way behind it, then start a new history. Clicking any link will then remove access to the intervening history by choosing a different line of traversal.