Jump to content

Manual:importDump.php

From mediawiki.org
Recommended method for general use, but slow for very big data sets. See #Importing English Wikipedia or other large wikis, below.

importDump.php file is a maintenance script to import XML dump files into the current wiki. It reads pages from an XML file as produced from Special:Export or dumpBackup.php , and saves them into the current wiki. It is one of MediaWiki's maintenance scripts, and is located in the maintenance folder of your MediaWiki installation.

Description of operation

[edit]

The script reports ongoing progress in 100-page increments (by default), reporting the number of pages and revisions imported per second for each increment, so you can monitor its activity, and see that it hasn't hung. Can take 30 or more seconds between increments.

The script is robust, as it skips past previously loaded pages, rather than overwrites them, so that it can pick up where it left off fairly quickly after being interrupted and restarted. It still displays progress increments while doing this, which skips by pretty fast.

Pages will be imported preserving the timestamp of each edit. Due to this feature, if a page being imported is older than the existing page, it will only populate the page history, but it won't replace the most recent revision with an older one. If that behavior is not desired, existing pages should be deleted first prior to import, or they'll need to be edited, reverting to the last imported revision found in the page history.

The wiki is usable during the import.

The wiki looks weird missing most of the templates, and with so many red links, but it gets better as the import proceeds.

Examples

[edit]

If you have shell access, you can call importdump.php from within the maintenance folder like this (add paths as necessary):

php importDump.php --conf ../LocalSettings.php /path_to/dumpfile.xml.gz --username-prefix=""

or this:

php importDump.php < dumpfile.xml

where dumpfile.xml is the name of the XML dump file. If the file is compressed and that has a .gz or .bz2 file extension (but not .tar.gz or .tar.bz2), it is decompressed automatically.

Due to this bug it may be necessary to specify --username-prefix="" when importing files.

Afterwards use ImportImages.php to import the images:

php importImages.php ../path_to/images
Running importDump.php can take quite a long time. For a large Wikipedia dump with millions of pages, it may take days, even on a fast server. Add --no-updates for faster import. Also note that the information in Import about merging histories, etc. also applies.

After running importDump.php, you may want to run rebuildrecentchanges.php in order to update the content of your Special:Recentchanges page.

If you imported a dump with the --no-updates parameter, you'll need to run rebuildall.php to populate all the links, templates and categories.

Options

[edit]
Option/Parameter Description
--report Report position and speed after every n pages processed.
--namespaces Import only the pages from namespaces belonging to the list of pipe-separated namespace names or namespace indexes.
--dry-run Parse dump without actually importing pages.
--debug Output extra verbose debug information.
--uploads Process file upload data if included (experimental).
--no-updates Disable link table updates. Is faster but leaves the wiki in an inconsistent state. Run rebuildall.php after the import to correct the link table.
--image-base-path Import files from a specified path.
--skip-to Start from the given page number, by skipping first n-1 pages.
--username-prefix Adds a prefix to usernames. Due to this bug it may be necessary to specify --username-prefix="" when importing files.

FAQ

[edit]

How to setup debug mode?

[edit]

Use command line option --debug.

How to make a dry run (no data added to the database)?

[edit]

Use command line option --dry-run

Error messages

[edit]

Failed to open stream

[edit]

In case you get an error "failed to open stream: No such file or directory", make sure that the specified file does exist and that PHP has access to it.

Error while running importImages

[edit]

Typed

[edit]
roots@hello:~# php importImages.php /maps gif bmp PNG JPG GIF BMP

Error

[edit]
> PHP Deprecated:  Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/mcrypt.ini on line 1 in Unknown on line 0
> Could not open input file: importImages.php

Cause

[edit]

Before running importImages.php you first need to change directories to the maintenance folder which has the importImages.php maintenance script.

Error while running MAMP

[edit]
DB connection error: No such file or directory (localhost)

Solution

[edit]

Using specific database credentials

$wgDBserver         = "localhost:/Applications/MAMP/tmp/mysql/mysql.sock";
$wgDBadminuser      = "XXXX";
$wgDBadminpassword  = "XXXX";

Importing English Wikipedia or other large wikis

[edit]

For very large data sets, importDump.php may take a long time (days or weeks); there are alternate methods which can be much faster for full site restoration, see Manual:Importing XML dumps .

If you can't get the other methods to work, here are some pointers for using importDump.php for importing large wikis, to reduce import time as much as possible...

Parallelizing the import

[edit]

You could try running importDump.php multiple times simultaneously on the same dump, using the option --skip-to...

In an experiment on Ubuntu, the script was run (on a decompressed dump) multiple times in separate windows simultaneously using the --skip-to option. On a quad-core laptop computer, running the script in 4 windows sped up import by a factor of 4. In the experiment, the --skip-to parameter was set 250,000 to 1,000,000 pages apart per instance, and the import was monitored (checked on from time to time), to stop each instance before catching up to another.

Note Note: This experiment was not tried running multiple instances without the "--skip-to" parameter, to avoid potential clashing -- if you try this without --skip-to, or you let the instances catch up to each other, please post your findings here. In this experiment, 2 of the windows caught up, and no error messages resulted. The instances of the script appeared to be jumping past each other.

Using --skip-to differs from normal operation, in that progress increments are not displayed during the skip, instead, it's just the (blinking) cursor. After a few minutes, the increment reports begin to display.

Data segmentation

[edit]

It may be a good idea to segment the data first, with an xml splitter, before importing it in parallel. Then run importDump.php on each segment in a separate window, which would avoid potential clashes. (If you successfully split the dump so it works in this process, please post how to, here).

Import the most useful namespaces first

[edit]

To speed up import of the most useful parts of the wiki, use the --namespaces parameter. Import templates first, because articles without working templates look awful. Then import articles. Or, do both at the same time, in multiple windows, as described above, starting templates first, as they import faster and the articles window(s) won't catch up.

Note Note: The main namespace doesn't have a prefix, and so it must be specified using a 0. "Main" and "Article" fail to run and return errors.

Once complete, this will necessitate using importDump.php again to get the pages in all the other namespaces.

Estimating how long it will take

[edit]

Before you can estimate how long an import will take, you've got to find out how many total pages are in the wiki you are importing. That is displayed at Special:Statistics in each wiki. As of October 2023, the English Wikipedia had over 59,000,000 pages, including all page types such as talk pages, redirects, etc, but not including pictures ("files").

To see how fast the import is going, go to the page Special:Statistics in the wiki you are are importing into. Note the time and jot down the total pages. Then come back later and see by how much that number has changed. Convert that to pages per day, and then divide that figure into the total pages for the wiki you are importing, to see how many days the import will take.

For example, in the experiment mentioned above, importing using parallelization, and looking at the total pages in Special:Statistics, the wiki is growing about 1,000,000 pages per day. Therefore, it will take around 59 days at that rate to import the 59,000,000 pages (as of October 2023) in the English Wikipedia (not including pictures).

Notes

[edit]

Since MediaWiki 1.29 (task T144600), importDump.php doesn't update statistics. You should run initSiteStats.php manually after the import to update page and revision counts.

Troubleshooting

[edit]

See Also: meta:Data dumps/ImportDump.php#Common problems and solutions

If errors occur when importing files, it may be necessary to use the --username-prefix option.

See also

[edit]