Wikipedia:Request a query/Tips and schemas
Tips
If you don't know SQL or you are rusty, SQLBolt has a great set of interactive exercises.
Speed
enwiki is a massive database. For example, the revisions table contains one billion rows. SQL queries that would execute in a fraction of a second on a smaller database may take over an hour and then time out when run on Quarry and enwiki. Here are some tips to make your SQL queries as efficient as possible.
- Test your query on a smaller wiki, such as simplewiki (612,000 articles) or sawiki (11,000 articles).
- Use a LIMIT.
- Omit ORDER BY if you don't need it.
- Be aware that the WHERE, GROUP BY, and ORDER BY fields are the most likely to slow down a query. Try to use primary key fields for these whenever possible. For example, WHERE log_id > 90267267 will be faster than WHERE log_timestamp > 20200415012345, because log_id is a primary key field.
- SQL optimizer tool - Run https://sql-optimizer.toolforge.org/ on your query. It will suggest special tables that you can use to speed up your query.
- Alternative views - The replica databases (used by Quarry) have copies of certain tables that are exactly the same, except they have a different primary key. Using one of these tables and properly using its corresponding primary key can speed up queries. See the "Alternative views" section below. Example: using logging_userindex instead of logging.
- Sub-views - The replica databases (used by Quarry) have copies of certain tables that omit unnecessary rows. Look into using these tables instead of the main table when joining certain tables, to speed up your queries. Example: When joining actor and recentchanges, use actor_recentchanges.
- If you only need to check recent page revisions (last 30 days), use the recentchanges table instead of the revision table.
- If a query scans millions of rows, consider breaking it into multiple queries. Query time versus number of rows increases exponentially, not linearly.
Other
- Timestamps - Be aware that date/time fields are in an unusual format. They are BINARY(14) integers in the format YYYYMMDDHHMMSS. You can use code such as WHERE rev.rev_timestamp > DATE_FORMAT(DATE_ADD(NOW(), INTERVAL -1 YEAR), "%Y%m%d%H%i%s") to do conversions.
- Be aware that there is a table called actor, with fields actor_id and actor_name (username). Instead of JOINing the user table, you will often need to JOIN the actor table instead.
- Timeouts - A query will time out after awhile. The timeout is often 30 minutes, but sometimes it will go for longer. You may want to hit "Submit query", and then revisit the page later.
- When selecting page_title, you will usually also want to select page_namespace, so you can figure out the exact page name. page_title does not include namespace prefixes such as Talk:, User:, etc.
Databases
Databases follow the pattern: [languageCode][typeOfWiki]
- Language codes
- en - English
- See the rest of the languages codes here, in the "Wiki" column of the table.
- Types of Wikis
- wiki - Wikipedia (encyclopedia)
- wikibooks
- wikinews
- wikiquote
- wikisource
- wikiversity
- wikivoyage
- wiktionary
- Wikis that aren't encyclopedias but still use MediaWiki software tend to end in "wiki":
- commonswiki
- mediawikiwiki
- metawiki
- wikidatawiki
Tables
https://www.mediawiki.org/w/index.php?title=Manual:Database_layout/diagram&action=render
Accessible tables
- abuse_filter
- abuse_filter_action
- abuse_filter_history
- abuse_filter_log
- actor
- actor_archive
- actor_filearchive
- actor_image
- actor_ipblocks
- actor_logging
- actor_mat
- actor_oldimage
- actor_recentchanges
- actor_revision
- actor_user
- archive
- babel
- category
- categorylinks
- change_tag
- change_tag_def
- comment
- comment_filearchive
- comment_image
- comment_ipblocks
- comment_logging
- comment_oldimage
- comment_protected_titles
- comment_recentchanges
- comment_revision
- content
- content_models
- externallinks
- filearchive
- flaggedimages
- flaggedpage_config
- flaggedpage_pending
- flaggedpages
- flaggedrevs
- flaggedrevs_promote
- flaggedrevs_statistics
- flaggedrevs_stats
- flaggedrevs_stats2
- flaggedrevs_tracking
- flaggedtemplates
- geo_tags
- global_block_whitelist
- image
- imagelinks
- ip_changes
- ipblocks
- ipblocks_restrictions
- iwlinks
- l10n_cache
- langlinks
- linter
- logging
- module_deps
- oldimage
- ores_classification
- ores_model
- page
- page_assessments
- page_assessments_projects
- page_props
- page_restrictions
- pagelinks
- pagetriage_log
- pagetriage_page
- pagetriage_page_tags
- pagetriage_tags
- pif_edits
- protected_titles
- recentchanges
- redirect
- revision
- site_identifiers
- site_stats
- sites
- slot_roles
- slots
- templatelinks
- transcode
- updatelog
- user
- user_former_groups
- user_groups
- user_properties
- user_properties_anon
- wbc_entity_usage
- wikilove_log
Hidden tables
Not all tables are available in Quarry. Certain tables are intentionally omitted. This is an incomplete list.
- bot_passwords
- interwiki
- job
- log_search
- objectcache
- page_props
- page_restrictions
- querycache
- querycachetwo
- querycache_info
- searchindex
- text
- uploadstash
- user_newtalk
- watchlist
- watchlist_expiry
Hidden fields
Fields in certain tables are also unavailable.
- user_properties
Alternative views
These tables have the same data as their parent tables, but different primary keys, which makes using them faster for some types of SQL queries.
- archive_userindex
- filearchive_userindex
- ipblocks_ipindex
- logging_logindex
- logging_userindex
- oldimage_userindex
- recentchanges_userindex
- revision_userindex
Sub-views
These tables have only the data needed to JOIN with another specific table. The rest of the data has been deleted. This is for faster querying when JOINing with those particular tables.
- actor_filearchive
- actor_image
- actor_ipblocks
- actor_logging
- actor_oldimage
- actor_protected_titles
- actor_recentchanges
- actor_revision
- comment_filearchive
- comment_image
- comment_ipblocks
- comment_logging
- comment_oldimage
- comment_protected_titles
- comment_recentchanges
- comment_revision
Tables left over from "actor" migration
"compat" might stand for "compatibility". These tables are compatible with the pre-actor versions of MediaWiki, and will eventually be removed.
- archive_compat
- filearchive_compat
- image_compat
- ipblocks_compat
- logging_compat
- oldimage_compat
- page_compat
- protected_titles_compat
- recentchanges_compat
- revision_actor_temp
- revision_compat