Wikipedia:Wikipedia Signpost/2015-12-09/Op-ed
Wikidata: Knowledge from different points of view
This is the third in a series of recent Signpost op-eds about Wikidata, including "Wikidata: the new Rosetta Stone" and "Whither Wikidata?".
Wikidata has recently celebrated its 3rd birthday. In these three short years it has managed to become one of the most active Wikimedia projects, won prizes, and is starting to show its true potential for improving Wikipedia. It is being used more and more both inside and outside Wikimedia every day. At the core of Wikidata is the desire to give more people more access to more knowledge. This is what we should be held accountable to. And I am the first to admit that we still have a long way to go.
What beliefs are at the core of Wikidata? Is it a database like any other?
We built Wikidata with a few core beliefs in mind that shine through everywhere. The most fundamental one is that the world is complicated and there is no single truth--especially in a knowledge base that is supposed to serve many cultures. This belief is expressed in many decisions, big and small:
- Wikidata allows you to express many different points of view about the same data point and they can live side-by-side. It allows you to express much more nuance than any other database I know.
- Wikidata is not about the truth but about what other sources say. When different sources claim different things, we can record them and expose them to the reader to interpret and decide.
- Wikidata doesn’t restrict you. You can say that a city has a cat as a mayor. (Because doh! This really happened.)
All this comes at a cost. My life would be a lot easier if we decided to just build a simple yet stupid database ;-) However we went this way to allow for a more pluralistic worldview as we believe it is crucial in a knowledge base that supports all Wikimedia projects and more. Here are some examples where we are starting to show this potential:
- Jerusalem having several values for country and capital of
- Jesus Christ having several values for father
- Chelsea Manning having several values for sex and gender
The goal here is to describe the world in a useful way. Even with the possibilities we have built into Wikidata, it will not be possible to truly represent the whole complexity of the world. Natural language, and thus Wikipedia, is much more suited for that and will continue to be. But there is value in a knowledge base for the many pieces of information we encounter every day that do not require that level of nuance. Today already a lot of great things are being built using data from Wikidata. Here are just a few of them:
- Inventaire, a website that uses data from Wikidata to build author profiles among other things.
- Genewiki improving gene-related articles on English Wikipedia and more
- Wikipedia gender indicator giving us a detailed analysis of our content gaps and biases with the help of Wikidata
- AskPlatypus answering our natural language questions based on data in Wikidata
- Histropedia allowing us to build beautiful timelines powered by Wikidata
- Games making it very easy to meaningfully contribute to our projects
Structured data is changing the world around us right now. And I am working towards having a free and open project at the center of it that is more than a dumb database.
Is Wikidata’s data bad? Is Wikipedia’s data better? Does it matter?
For Wikidata to truly give more people more access to more knowledge, the data in Wikidata needs to be of high quality. Right now, no one denies that the quality of the data in Wikidata is not as good as we would like it to be and that there is still a lot of work to do. Where opinions differ is how to get there. Some say adding more data is the way to go, as that will lead to more use and thereby more contributions. Others say removing data and re-adding it with more scrutiny is the only way to go. Others say let’s improve what we have and make usage more attractive. All of them have merit depending on where you are coming from. At the end of the day what will decide is action based on community consensus. Data quality is a topic close to my heart, so I have been thinking a lot about this. We are tackling the topic from many different angles:
More eyes on the data: The belief behind this is that the more people are exposed to data from Wikidata the better the quality will become. To achieve this, we have already done quite some work including improving the integration of Wikidata’s changes in the watchlist and recent changes on Wikipedia and the other Wikimedia projects. Next we are building the ArticlePlaceholder extension and automated list articles for Wikipedia based on the data in Wikidata. We will additionally make it easier to re-use the data in Wikidata for third parties. We will also look into building more streamlined processes for allowing data-reusers to report issues easily to create good feedback loops.
Automatically find and expose issues: The belief behind this is that to handle a large amount of data in Wikidata, we need tools to support the editors in their work. These automatic tools help detect potential issues and then make editors aware of them, so they can look into them and fix them as appropriate. To achieve this, we already have internal consistency checks (to easily spot issues like people who are older than 150 years or an identifier for an external database that has the wrong format). We have also worked on checking Wikidata’s data against other databases and flagging inconsistencies for the editors to investigate. Furthermore, more and more visualizations turn up that make it easier to get an overview of a larger part of the data and spot outliers and gaps. And probably the most important part is machine-learning tools like ORES that help us find bad edits and other issues. We have made great progress in this area in 2015 and will realize more of this potential in 2016. Overall the fact that Wikidata consists of structured data makes it much easier to automatically find and fix issues than on Wikipedia.
Raise the number of references: The belief behind this is that we should have references for many of the statements in Wikidata, so people can verify them as needed. This is also important to stay true to our initial goal of stating what other sources say. We have just recently made it easier to add references hopefully leading to more people adding references. More will be done in this area. The primary sources tool helps by suggesting references for existing statements. And the recently accepted IEG grant for StrepHit will boost this even further. And last but not least, there is a rather active group of editors working on WikiProject Source MetaData. All this will help us raise the number of referenced statements in Wikidata. We have already seen it increase massively from 12.7% to 20.9% over the past year because of these measures as well as a change in attitude.
Encourage great content: Wikidata as a project needs to build processes that lead to great content. It starts with valuing high-quality contributions more and highlighting our best content. We have showcase items for a while now which are supposed to put a spotlight on our best items. The process is currently undergoing a change to make it run more smoothly and encourage more participation.
Make quality measurable: We are working on various metrics to meaningfully track the quality of Wikidata’s data. So far the easiest and most-used metric is the number of references Wikidata has and how many of those refer to a source outside Wikimedia. We should however take into account that Wikidata also has a very significant amount of trivial, self-evident, or editorial statements that do not need a reference. One example of this is the link to the image on Wikimedia Commons. More than three million statements are "instance of: human"! The percentage of references to other Wikimedia projects is especially high for these trivial statements. On the other hand, the percentage of references to better sources is much higher for non-trivial statements like population data. The existing metric is too simplistic to truly capture what quality on Wikidata means. We need to dive deeper and look at quality from many more angles. This will include things like regular checks of a small random subset of the data.
All of those building blocks are being worked on or are already in place. Already today in its arguably imperfect state, Wikidata is helping Wikipedia raise its quality by finding longstanding issues on Wikipedia that only became apparent because of Wikidata, like a Wikipedia having two articles about the same topic without being aware of it. Or two Wikipedias having different data about a person without any useful reference. Wikidata gives a good way to finally expose and correct these mistakes. Once we have a data point and a good reference for it on Wikidata, it can be scrutinised more thoroughly and then used much more widely than before.
Trust and believing in ourselves
Do we trust our own model and way of working? Wikipedia started just the same way as Wikidata. It didn’t have high-quality data and it certainly didn’t have a lot of references for its articles. But with a lot of dedicated work this changed and today Wikipedias (at least the biggest ones!) are of fairly high quality. I see no reason why we can’t do this for Wikidata once again--with an amazing community, better tools at our hands, and the lessons we have learned in Wikipedia. But let’s also not fall into the trap of demanding perfection.
What do we do now?
- Encourage more re-users of Wikidata’s data to give their users a way back to Wikidata. Histropedia and Inventaire are two examples of re-users doing that already and it is a mutually beneficial partnership.
- Make it easier to use Wikidata’s data inside and outside of Wikimedia.
- Improve existing quality tools around Wikidata and make more use of them.
- Make existing knowledge diversity tools easier to use, promote them more and make more use of them.
- Make the outside world more aware of knowledge diversity and plurality.
- Increase the diversity in our contributor base to cover more cultures and worldviews.
At the end of the day, Wikidata is a chance to raise the quality bar across all our projects together. Let’s make it reality. That’s how we give more people more access to more knowledge every day.
- Lydia Pintscher is the Product Manager for Wikidata at Wikimedia Deutschland.
Discuss this story
Compatibility of Wikidata's CC0 licence with Wikipedia's CC-BY-SA licence
Lydia, Wikidata imports large amounts of data from Wikipedia infoboxes, templates etc. which it then republishes it under the CC0 licence.
Unlike Wikipedia's CC BY-SA licence, which retains the right to attribution and imposes on re-users the obligation to ShareAlike, assuring that Wikipedia will be named as the source, CC0 waives all authors' rights. This means that content compiled in Wikipedia under the CC BY-SA licence is republished by Wikidata under a licence that does away with the rights contributors here were told they possessed when they contributed to Wikipedia.
Given this background – and the information on database rights provided by the WMF Legal Team in https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights – could you comment on these recent mailing list posts [2][3]? Originally people were told that Wikidata did "not plan to extract content out of Wikipedia at all", but would "provide data that can be reused in the Wikipedias".
Note also this post on de:WP calling for an RfC to determine whether the community would be willing to agree to waive its CC BY-SA rights for Wikidata, and that DBpedia, engaged in a similar endeavour of data extraction from Wikipedia, made a conscious choice to use the same licence as Wikipedia. Thank you. Andreas JN466 21:23, 12 December 2015 (UTC) expanded by --Andreas JN466 16:52, 13 December 2015 (UTC)[reply]
Usefulness of unsourced Wikidata imports to Wikipedia
Lydia, most Wikipedias do not allow users to cite one Wikipedia article as a source in another. It's a basic principle of Verifiability. If Wikidata imports unsourced content from various Wikipedia language versions' infoboxes, it seems to me that these imported Wikidata contents can only be used to populate infoboxes in another Wikipedia if they contain an external source (something that is generally not the case in Wikidata today). Moreover, this source would have to be visible in every Wikipedia that draws on the relevant Wikidata content, as otherwise the content would be unverifiable for the reader. Does that match your vision? Andreas JN466 21:23, 12 December 2015 (UTC)[reply]
I am concerned that some might view the Wikidata material as an acceptable "reliable source" whether directly or indirectly for actual articles. Such a position, I fear, would be a wondrous Pandora's box indeed. Instead, I suggest that a wall be placed here at the outset - let those who are completely outside Wikipedia be able to use Wikidata, but estop those within this walled area from using it. Collect (talk) 13:01, 13 December 2015 (UTC)[reply]
Building a multi-lingual repository of useful data
Speaking as a Wikidatan, I enjoy working on Dutch 17th-century paintings and last year I joined the "d:Wikidata:WikiProject sum of all paintings" group. Wikidata is constantly improving the number of properties that can be used on paintings and I have helped propose and model the usage of these. Today I add reference statements because in the beginning I didn't know how. If all statements are from the same reference (like a museum website) I just used the "Described at url" property to link back to the website entry because I was used to working that way on Commons. We are still searching for ways to describe paintings in terms of movement, style, and period. I have re-used a number of references added by others on Wikipedia projects which were very beneficial to me as a Wikipedian, so now it's my way of giving back, by adding these to Commons images or Wikidata items where appropriate. Each project has its own community of volunteers and there don't seem to be many who venture out into the others like I do. I have been a member of the Wikipedia:Visual arts group on Wikipedia as well, but on Wikidata the interaction with like-minded people happens less on talk pages, because we don't speak a common language. Wikidata enables data-sharing by leveling the playing field to all mono-lingual players. I am used to working on image files of paintings on Wikimedia Commons, where we have similar language issues, but there we also have lots of complicated discussions about copyright problems. On Wikidata it doesn't matter whether you are working on a painting collection of modern art or 17th-century art, because the data model is basically the same. We may not be able to show you an image, but we can tell you where it is and all sorts of other things about copyrighted images. In some cases we can link out to a picture of it somewhere.
As far as data quality goes, what I think a lot of people don't understand is that if one Wikipedia is in disagreement with another Wikipedia on any issue (such as a painting attribution to its creator), on Wikidata both statements can reside side by side with the "Normal" rank. Currently Wikipedia only has two ranks for "points of view" in statements; namely published and deleted. As it is now, everything deleted from Wikipedia just disappears, whether it is an alternate point of view or pure vandalism. On the other hand, everything that is published has a ring of "truth" to it, whether or not it's under discussion. The Wikidata item, like any wiki, has these published ("Normal") and deleted state for statements, but it also allows these two extra states ("Preferred" and "Deprecated"). We see the deprecated state used a lot for past designations, so for example in buildings from the project Wiki Loves Monuments where old buildings are re-purposed over the years and so on. The "Preferred" status is used to indicate the value that is considered as coming from the latest, or "most reliable source". This may be seen as a controversial issue, namely choosing which source, and one can argue that this is of course a matter of opinion, but it is much more often a matter of consilience. I have used the "Preferred" value several times for painting attributions, using catalogs by leading art historians as references. When in doubt, I allow multiple statements to reside side by side with the "Normal" rank. We see this occasionally on Wikipedia with a lead statement such as "...is a painting by XXX or associated workshop". Wikidata can add precision to this statement by actually naming the individuals of that workshop to whom the painting has been attributed in the past. Jane (talk) 16:44, 13 December 2015 (UTC)[reply]