User:Arizcraf
Data import guide
This guide has been created for anyone wishing to import data into Wikidata. You may find these related resources helpful: |
This guide has been created for anyone wishing to import data into Wikidata. The main focus is to show the user how to prepare data that is going to be ingested, into a Wikidata conform format. The guide contains four phases. Each phase contains multiple steps. The phases are:
- Data preparation
- Data Synching
- Data Ingestion
- Import Review
This guide expects the user to already own the data. In case you do not have data to be imported but still, want to help the community many alternatives are available. Following are two examples:
- Play games like the Wikidata Game or The Distributed Game.
- Help other users importing their data sets. The data set import page shows data sets that are in progress (Browse data sets imports)
Introduction
[edit]Importing data into Wikidata requires many skills, however, the process can be broken down into four phases containing individual steps. With this, the Wikidata community can work together to import data. The prerequisite skills to get started importing data are:
- Creating and editing wiki pages including interacting with Wikidata community members.
- Moving information into a spreadsheet and duplicating sheets within a spreadsheet
Introduction to Wikidata
[edit]If you don’t have an account in Wikidata, create one (create an account). You don’t need it for simple inserts or edits. There are benefits of having an account like being able to have batch processed or that your changes are linked to your account and not to your IP-address.
Make yourself comfortable with the community. The community is always happy to help you in case you need any assistance.
Wikidata Chat
IRC Channel
Facebook
Development team support
It is always recommended to use tools that make sharing the content of the data easy as this makes it easier for the community to help.
Tools
[edit]Sharing non-ingested data: Google sheets
Automated Tools: Open Refine
Manual Tools: mix'n'match
Insert Data: QuickStatements
Data structure
[edit]Wikidata stores its data in an RDF data structure. This means it is represented in semantic triples. A semantic triple contains a subject, a predicate, and an object and states a fact. An example for Douglas Adams (Q42) is:
“Douglas Adams place of birth Cambridge”
Whereas “Douglas Adams” is the subject, “place of birth” is the predicate and “Cambridge” is the object. This predicate only points in one direction. This means that if you look up the item Cambridge you won’t find a predicate to Douglas Adams.
Some triples are bi-directional. Another example for Douglas Adams is:
“Douglas Adams father Christopher Douglas Adams”
and
“Christopher Douglas Adams son Douglas Adams”
Phase 1: Data preparation
[edit]The goal of this phase is to have the data to be imported in a format that can be mapped to the data structure of Wikidata.
Step 1: Data definition
[edit]In the first step, we want to check if this new data import can be embedded into existing WikidataProjects or data set imports in Wikidata. The WikidataProjects and data sets can be found on the following two pages.
If you find that your data can be embedded in any WikidataProjects or data set please contact the owner. With this, you help the tracking of what kind of data has been ingested into Wikidata.
Step 2: License
[edit]In this step, the question of ownership needs to be answered. Unlike other WikidataProjects from Wikimedia, data in Wikidata is licensed under the CCO License (CCO License definition). Therefore, data imported must have the same license. The licensing of the data has to be defined by the data owner.
Step 3: Data source reference
[edit]To enhance the reliability of the data, it is always recommended to have references to the source of the data. This means you should be able to answer where the data is coming from. Are the sources for the data available? Ideally, the data is online available for cross-references.
Step 4: Cumulate data
[edit]Import your data into a spreadsheet. Create separate spreadsheets for every data type you have. As an example, create one sheet for all your opera and another sheer for all the roles.
Step 5: Data cleaning
[edit]The text data in your spreadsheet should be concise. Meaning that the label of the items should not have filler words or links. This preparation step makes the data matching easier.
Step 6: Data set
[edit]This step can be skipped in case that in step 1 a WikidataProjects or a data set was found.
Register that data is available to be imported.
When the data is publicly available create a data set entry in the Dataset Import page. This helps the community to keep track of the dataset which is in the process to be imported. Keep your newly created data set import page up to date. With this, you make it easier for the community to help.
Step 7: Structure data
[edit]Format the data
For better data matching and import, the spreadsheets should be structured. Wikidata provides a template for how the structure could look like (Template).
As an example, taken from a relational database. Every table gets normalized to the third normal form. This makes every row unique and the attributes linked to the row are independent.
Here is a bad example of a table.
ID | Name | Description | Import description | Place of birth | Inception |
---|---|---|---|---|---|
1 | Douglas Adams | British author and humorist | British author and humorist | Cambridge | 1st century |
This example is bad because of the last column “inception”. This column is related to the city Cambridge and not with the author Douglas Adams. If necessary, this table needs to be transformed into two tables.
The first table represents the author Douglas Adams
ID | Name | Description | Import description | Place of birth |
---|---|---|---|---|
1P | Douglas Adams | British author and humorist | British author and humorist | Cambridge |
The second table represents the city.
ID | Name | Description | Import description | Inception |
---|---|---|---|---|
1C | Cambridge | City in Cambridgeshire, England | City in Cambridgeshire, England | 1st century |
Step 8: Property check
[edit]Important! This step can be quite technical. Here the data structure is matched with the ontology in Wikidata. If you're unsure how to do this contact someone from the Wikidata community through Partnership page.
The properties defined in the previous step should exist in Wikidata so you can import them correctly. To achieve this you can do the following steps.
- Use the search function to search for items that may hold similar information stored in a way that could be copied for this data set.
- Check the list of properties to find properties for your data.
- Showcase items provide examples of items with very rich levels of data within Wikidata
In case properties of your items are not yet in Wikidata you can create new proposals for a new property. Before creating such a proposal double-check the Wikidata list of properties and Special page for the list of properties. Instructions on what to do before creating a proposal and how to create a proposal can be found on the property proposal page.
Phase 2: Data Synching
[edit]The goal of this phase is to replace duplication from the data to be imported. This means replacing the data in the source data with links to the data in Wikidata.
Note: To help the Wikidata quality to get better it is recommended to take notes of all types of issues encountered. Examples for that are:
- Duplicates
- Wrong properties
- Wrong graphs set up
- Logical issues
- Etc.
Step 1: Data item matching
[edit]The data to be imported needs now to synchronized with the data in Wikidata. Depending on the type of data available it makes sense to use tool which either matches the data in an automated way or manually. In case your expecting your data to be simply insertable into the existing Wikidata data structure, it makes sense to take an automated approach. If you’re expecting a lot of new data or manual intervention, it’s easier to start with a manual approach. Ideally, in the end, all items available in Wikidata should have a Q-Number as an identifier.
Proposed tool: Open Refine
Both approaches will most probably need manual interaction.
Example:
ID | Name | Description | Import description | Place of birth |
---|---|---|---|---|
Q42 | Douglas Adams | British author and humorist | British author and humorist | Q350 |
Step 2: Properties matching
[edit]In this step, the existing properties and attributes in Wikidata are matched with the ones in your data.
All properties in your spreadsheet must be replaced with the corresponding P number. In case that step 7 or 8 of Phase 1 has been skipped it might be possible that some properties do not yet exist in Wikidata. The attributes of your data item which are items themselves need to be replaced by the corresponding Q number. This step can be done automated or manually.
Proposed tool: Open Refine
You most probably will need manual interaction in the end.
ID | Name | Description | Import description | P19 |
---|---|---|---|---|
Q42 | Douglas Adams | British author and humorist | British author and humorist | Q350 |
Step 3: Item Cleansing
[edit]Newly created items should follow the Wikidata guidelines. For labels Wikidata lists these pointers:
- A label is like a page title that describes what the item is about. It should be as short as possible (e.g. Earth, not Planet Earth)
- Labels do not have to be unique as they are disambiguated by descriptions—more on this later
- Use the most common name (e.g. cat not Felis catus) and only capitalize proper nouns (like London, Jupiter, or Hillary Clinton—but not city, planet, or politician)
For descriptions Wikidata lists following guidelines:
- Keep it short—descriptions are not sentences.
- Try to be as accurate and as neutral as possible—avoid using information that will change over time or that is considered controversial and biased.
- Descriptions should not normally begin with initial articles like "the" or "a".
- If you're stuck, Wikipedia is a good resource for coming up with descriptions for items—often the first two sentences of the item's article will provide enough information.
“…descriptions are used to disambiguate labels by providing more details about an item …. It's ok to have multiple items with the same label as long as each item has a different description.”
Phase 3: Data Ingestion
[edit]After this phase, your data has been ingested into Wikidata.
Step 1: Tool selection
[edit]When the data has been synced with the data structure available in Wikidata the data needs to be imported. Depending on the available structure automated tools can be used. In case you need help for bots creation you can contact the development team.
Tip: Depending on what tools have been before they can insert the data. If the data has been cleaned up thoroughly creating QuickStatements out of excel is recommended. To use QuickStatements you need to have an account set as autoconfirmed. To achieve this status the Wikidata account used needs to be at least four days old and have already done 50 edits or inserts into the system.
Step 2: Import data
[edit]Import the data using the tools selected. Keep in mind that data ingestion takes time. Inserting entries can take up to a few seconds. A load of several thousands of entries will take a few hours to be loaded. It is recommended to create several batches. The batches can be loaded independently and checked after each load.
It is highly recommended to test the import with a small sample first and review the imported data for any potential issues.
Phase 4: Review Import
[edit]After the last phase, the community is informed of your data being embedded into Wikidata.
Step 1: Check data import
[edit]After the data has been ingested, you need to check if it was successful. This can be done either by the Wikidata search function or by building a query on the Wikidata query service. The later is more complex but works faster. In case you need help creating a query you can find help in the chat or you can request for a query to be built (Query Request)
Step 2: Close Data set
[edit]The last step of the import is to mark the data set page created in phase 1 step 5 as completed. With this, you help the community to track what data sets have been imported. This is done by changing the property “progress status” to “complete when editing the page.