Jump to content

User:Broken Viking/Wikipedia data dump usage guide

From Wikipedia, the free encyclopedia

Other languages: 🇩🇪 Deutsch

Wikipedia data dump usage guide
🇬🇧🇺🇸

[edit]

This dump of data from Wikipedia contains all of the article content from the main namespace as it stood on the date the dump was taken. It does not contain any of the additional content that is hosted on Wikipedia - User pages, edit histories, talk pages and so on - Or any of the images and other multimedia content that is hosted on Wikimedia Commons. It is intended to provide a copy of Wikipedia's textual encyclopedic content for offline and archival use e.g. in institutions where heavy use of Wikipedia justifies local hosting of the content, or in situations where internet access is poor or non-existent. It is available for download and use by anybody, subject to the Wikipedia Terms of Use.

Note: Non-English versions of this usage guide may have been machine translated and contain errors or abnormal language. Examples given in this usage guide may refer to data in older versions of the Wikipedia data dumps, which might not correlate to the data in this dump, and following the examples using the data in this dump may yield unexpected results.

It is asked that people downloading these data dumps please consider seeding the torrents wherever this is lawful and possible for them to do so, and to download future dumps via BitTorrent whenever they are available through that avenue. This helps to preserve Wikipedia content and make it readily accessible to others in a fashion that facilitates faster downloads, eases data service burden and cost for the Wikimedia Foundation and the mirror sites (Primarily academic institutions) presently undertaking the lion's share of this burden, a burden that - For anybody who runs a torrent appliance as a matter of routine - Can be shouldered with nothing more than setting Wikipedia torrents to seed indefinitely, and seeding them in the same way as one would any other torrent.

A list of available torrents for the Wikipedia dumps in most European languages can be found Here and Here.

ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking · Contact/Talk Page · Contributions) September 2025CE

Data dump contents

[edit]

This data dump contains the following:

  • The original md5 checksum file for the dump process, obtained directly from Wikipedia,
  • The original sha1 checksum file for the dump process, obtained directly from Wikipedia,
  • This usage guide, null-padded to place the end of the index file on a 2MiB cluster boundary for technical reasons related to data handling in Bittorrent and the use of web-seeds for torrents having multiple files,
  • The index file LLwiki-YYYYMMDD-pages-articles-multistream-index.txt.bz2, and
  • The data file LLwiki-YYYYMMDD-pages-articles-multistream.xml.bz2

In the above; LL refers to the ISO 639 language code for the language of the dump, and YYYYMMDD refers to the date the dump was taken in four digit year, two digit month and two digit day format.

Straightforward approach for general users

[edit]

The most straightforward way to use the data dump is through a software application, such as one of the following:

Note that the above software are separate projects that are not endorsed by Wikipedia or the Wikimedia Foundation.

Web browser extensions compatible with Wikimedia data dumps are also available, and generally present the dump content in a manner that mirrors Wikipedia itself less images and other media. This may be the most straightforward approach for people wishing to keep an offline copy of Wikipedia on a single computer that has the Firefox or Chrome/Chromium browsers installed.

Users wishing to develop their own solutions can find more technical information in the following sections.

Data dump layout - A quick primer

[edit]

The data file is a large XML database, heavily compressed using bzip2 in multistream mode. In a nutshell this means that the compression algorithm is reset periodically during data compression, allowing for small parts of the data to be extracted quickly without having to decompress the entire file, and therefore the dump does not need to be decompressed before use. In most use cases the index file (Which is also compressed using bzip2) is essential and should be kept alongside the data file to which it relates.

The index file is a (compressed) text file containing an entry for every article stored in the data file, one entry per line, expressed as follows:

  • [Decompression start offset] : [Article number] : [Article name]

And an example of this (Taken from the index file for the French Wikipédia dump of 2025-09-01)...

  • 1257148823:868143:Borgia (Italie)

Which indicates that the French language article about Borgia (Italy) - Article number 868,143 - Can be retrieved by seeking to byte offset 1,257,148,823 in the French Wikipédia dump file, then decompressing data from that point until article 868,143 appears in the decompressed output stream and is handled as desired.

It should be noted that - Unlike the data file - The index file may not employ multistream compression, and therefore must be decompressed (Generally; On-the-fly) from the start when searching for article locations in the data file. It should also be noted that article numbers are not sequential, articles may not necessarily be listed in alphabetical order, and one should expect to find (And should develop their software to sanely accommodate) articles and text with any number or name (Including names with characters from extended Unicode ranges e.g. Telugu) occurring at any point in either file.

Finally; It need be noted that the multistream method compresses data for a specific count of articles (Nominally 100-500,000) per block of compressed data, and not to a specific length of compressed data. This means that the compressed sections of the data file will be of varying lengths, and crucially means that the data file cannot be accommodated like a disk image where data is arranged in sectors of even length, meaning it is not possible to compute and jump directly to the position where a specific data item is stored.

Finding and extracting desired article content

[edit]

On Linux/Mac computers with bzip2 and grep available, simple and rudimentary searches can be performed from the terminal. This is demonstrated below when searching the Icelandic language dump from 2025-09-01 for the word Krone (Common Nordic word for Crown, also used as a unit of currency in those countries) looking for the entry about the Danish Crown (Money):

user@host:~/wikipedia$ bzip2 -dc ./iswiki-20250901-pages-articles-multistream-index.txt.bz2 | grep -i krone

2964631:4057:Leopold Kronecker
21385214:58494:Kronecker δ
37627242:107628:Norsk krone
37627242:107630:Krone
41364632:121564:Kronecker táknið
41454632:122120:Dansk krone
41962563:124165:Die im Reichsrat vertretenen Königreiche und Länder und die Länder der heiligen ungarischen Stephanskrone

And then selectively decompressing the relevant part of the iswiki-20250901-pages-articles-multistream.xml.bz2 file to obtain the desired article by it's number (122120) using grep (Also passed through cat to add line numbers to the terminal output for ease of reference):

user@host:~/wikipedia$ dd if=./iswiki-20250901-pages-articles-multistream.xml.bz2 bs=1 skip=41454632 count=1048576 2>>/dev/null | bzip2 -dc | cat -n | grep -iC10 122120

1016	      <origin>1454975</origin>
1017	      <model>wikitext</model>
1018	      <format>text/x-wiki</format>
1019	      <text bytes="26" sha1="5ckksk6qexmqasl4jw0hl2mx96ekcvx" xml:space="preserve">#Redirect [[Júanveldið]]</text>
1020	      <sha1>5ckksk6qexmqasl4jw0hl2mx96ekcvx</sha1>
1021	    </revision>
1022	  </page>
1023	  <page>
1024	    <title>Dansk krone</title>
1025	    <ns>0</ns>
1026	    <id>122120</id>
1027	    <redirect title="Dönsk króna" />
1028	    <revision>
1029	      <id>1454976</id>
1030	      <timestamp>2014-03-29T22:16:33Z</timestamp>
1031	      <contributor>
1032	        <username>Werddemer</username>
1033	        <id>35666</id>
1034	      </contributor>
1035	      <comment>Tilvísun á [[Dönsk króna]]</comment>
1036	      <origin>1454976</origin>

Manipulating the previous command somewhat, we can see that - When seeking to and decompressing from offset 41,454,633 (Remember: The offset value is the number of bytes to be skipped over) the article begins at line 1023 and ends at line 1042, allowing us to effect a cleaner extract with the following command:

user@host:~/wikipedia$ dd if=./iswiki-20250901-pages-articles-multistream.xml.bz2 bs=1 skip=41454632 count=1048576 2>>/dev/null | bzip2 -dc | head --lines=1042 | tail --lines=$(( 1042 - 1022 ))

  <page>
    <title>Dansk krone</title>
    <ns>0</ns>
    <id>122120</id>
    <redirect title="Dönsk króna" />
    <revision>
      <id>1454976</id>
      <timestamp>2014-03-29T22:16:33Z</timestamp>
      <contributor>
        <username>Werddemer</username>
        <id>35666</id>
      </contributor>
      <comment>Tilvísun á [[Dönsk króna]]</comment>
      <origin>1454976</origin>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="27" sha1="sqgdn80at5yg6ud1fume8j6niha9azj" xml:space="preserve">#Redirect [[Dönsk króna]]</text>
      <sha1>sqgdn80at5yg6ud1fume8j6niha9azj</sha1>
    </revision>
  </page>

Presenting the XML source for a redirect to Dönsk króna (The equivalent title in Icelandic) along with date/time of latest revision, which user created that revision, and some other useful information like data integrity hashes (Stored as Binary sequences encoded in Base64). The full article text (In MediaWiki markup) is held in the <text/> attribute, which will be the primary focus in the majority of use cases.

Of course, this approach is only appropriate for very occasional look-ups where manual searches are reasonable in nature e.g. when the Wikipedia dumps are employed as a fall-back for internet connection failure or situations where online access is unreasonably costly. However following the example given above with comparable searches in this dump will help users to understand the process for extracting article data from Wikipedia dumps, and consider their own solutions for automating this process if they wish to.

In closing; It should be noted that the process outlined above will prove slow and resource-intensive with the larger dumps (English, German, French) and on lower resource equipment. In particular, the use of grep to search decompressed output from the index file means that the entire file will be decompressed through memory every time a search is made.

Implementation considerations

[edit]

How best to implement an offline copy of the Wikipedia database will vary widely depending on scope and intended use cases, meaning there is no single or „right“ way to implement an offline or local Wikipedia repository. However a good starting point may be to consider the scope, userbase and devices that the repository will be intended to serve:

  • For a repository serving a single device (A dedicated encyclopedia PC, for example); Storing the dump and installing a Wikipedia-capable reader application on the PC will be the quickest and easiest approach.
  • If you are implementing a repository for yourself or family use, and it's possible to store the dump(s) on a NAS that's accessible by every device that will be using it; A single copy of the dump stored on the NAS and available through a network share to readers (Kiwix etc) installed on each device is likely to be the easiest option, though these will usually only be accessible when the device is on the same network as the NAS.
  • More technical users might wish to expand on the previous suggestion by implementing a web server with a (sub)domain/dynamic DNS to expand accessibility of that shared resource to family/friends through the Internet. Such implementations should give careful consideration to security and whether client authentication should be employed - In particular; Never implementing a web server on a device where confidential information is stored - As web servers/shares/IoT with read-only content are still vulnerable to DDoS and other forms of cyberattack.
  • Implementations for small organisations (Schools, hobbyist and academic groups) may wish to consider a web-focussed reader application installed on a server to render the content as HTML, making access available to any unmodified/stock device with a web browser. For outreach projects such an implementation could be made portable through the use of a small form-factor PC (Intel Nuc or similar) or modified mobile device, and a connected wireless router - Possibly battery driven, if required.
  • Larger implementations or those where access to the dumps is going to be frequent/intensive are likely to need to consider more optimised approaches to data access; In particular that content rendering and service is almost certainly going to need to be placed on the server-side (Top tip; Rendering output to HTML 4,01 Transitional makes the content accessible to any web browser produced after 1998, eliminating any need for dedicated client applications) and that the dump will need to be stored on high performance equipment. In particular; Consideration should be given to decompressing the dump, re-indexing its contents in a suitable database (MySQL or so) to eliminate the decompression stage, and having these items served from high performance Enterprise SSDs or similar low-latency storage.
  • Implementations approaching commercial or public service query levels may benefit from decompressing the database and storing individual articles uncompressed in nearline filesystems/containers employing large sector lengths sufficient to accommodate most articles in one sector for each. This will eliminate decompression and significantly reduce filesystem overheads at the cost of an induced storage inefficiency.
  • For off-grid implementations in remote locations; The dumps can be stored and transported on most forms of digital media and will be accessible on most computers built after the year 2000 with an appropriate and compatible Linux installation either capable of providing an X display and a web browser, or employing a terminal-level reader application. It might be easiest to implement such installations as live USB keys/disk images with the OS and dump mounted read-only to reduce user and incremental damage to the OS, supporting operational longevity in environments where technical support is not readily available/a considerable distance away. Depending on the implementation and remoteness/accessibility of the location it may be wise to consider support for dump updates from (micro) SD cards transported by carrier pigeon or similar unmanned methods to provide updates to site without necessitating human transport.

v1,01 en-GB/US: September 2025

Goin' loco, en Reino Unido...