Wikidata talk:Mismatch Finder
Geo-data related questions:
edit- How can I register a database for 1:1 matching?
- What is the plan for visualizing the detected extreme distances? ( GPS data - P625 )
context:
I am helping in the https://www.naturalearthdata.com/ - wikidata concordances - and this is a public domain geo-database ( + https://whosonfirst.org/ )
There are a lot of data errors on both sides. ( as usual )
My ideal use case will be:
- I am exporting the "Natural_earth" locality data to CSV - for any website. ( URL )
- I am registering the join columns for 1:1 mapping
- I am registering the matching columns like
- ne:"name" should be equal wd:"English label"
- text differences category: exact match ; unaccent match; partial match; low jarowinkler distance ; high jarowinkler distance
- GPS
- distance category: | < 5 km "Ideal" | 5-20Km "Maybe" | >=20km "Error" | >=500Km "High Priority Error"
- ne:"ISO" should be equal the wikidata country code
- ne:"name" should be equal wd:"English label"
The text matching is hard - so some custom data cleaning/customisation function would be useful. for example matching "rivers" I am using some regexp precleaning for this words "(river|rivire|rio|le|de|saint|st.|creek|cr.|fork|fk.)"
The GPS comparison is not so easy - you have to prepare
- point vs point(P625)
- polygon vs point(P625)
- ... line vs point(P625)
- ... line vs. multiple point - ( rivers )
- ...
Multiple distance category would be ideal
- POI distance ( like museums .. >1km distance is not ideal )
- city distance
- country distance ( as a point )
Thank you for working on this!
--ImreSamu (talk) 08:41, 21 June 2021 (UTC)
- In the first versions the tool will not be as sophisticated. Initially it'll just be able to import mismatches that have been found by an external process and present them for review. So you could write a small script comparing naturalearthdata with Wikidata, generate a file containing the mismatching statements and then load that file into the Mismatch Finder. --Lydia Pintscher (WMDE) (talk) 10:37, 7 July 2021 (UTC)
The need to take care of errors at Sources
editMohammed_Sadat_(WMDE) Problems we have seen
- Dictionary of Swedish National Biography ID (P3217)
- excellent research but has a very simple web with just text and no Linked data, no version management
- Larske found out the they had > 300 errors because cut/paste errors - would be good to track those errors in Mismatch Finder and also that we have told them about the error
- I tested to track with Categories see Category:P3217_Error
- I think we reported 300 errors and it took them 3 years to fix...
- Larske found out the they had > 300 errors because cut/paste errors - would be good to track those errors in Mismatch Finder and also that we have told them about the error
- excellent research but has a very simple web with just text and no Linked data, no version management
- Eionet bathing Water ID (P9616) we got coordinates from the Swedish agency but many was not good enough for Wikipedia
- I used deprecated and created issues on GITHUB see SPARQL with links GITHUB
- has tried to get feed back on those errors but its nearly impossible...
- Europeana entity (P7704) had very bad quality see T243764 --> we need to "warn" or maybe use deprecated and reason for deprecated rank (P2241) with a reason for those external identifiers
- more problems/errors I have found see "One way to design a system to be a good external identifier in Wikidata"
- Lack of Error reporting I think Wikidata:Mismatch Finder should also be a ticket system so we can reference an issue. I have connected more than 15 sources with Wikidata and done 1 million edits and I guess I have found 1 external source that has given me an helpdesk id --> to make this work the external source with an error needs to be tracked in our system with an unqiue ID. I think GITHUB Issues is perfect see example github.com/salgo60/EuropeanBathingWater
- Salgo60 (talk) 16:05, 21 June 2021 (UTC)
- Yes there will be a way to indicate that the error is in the external source. Given how different the various institutions are in accepting feedback we'll probably have something super simple at the beginning and then see how we can expand it as people are using it. --Lydia Pintscher (WMDE) (talk) 10:43, 7 July 2021 (UTC)
- How about using Wikidata's standard approach for this: add the incorrect but referenced value with deprecated rank. --- Jura 07:19, 19 July 2021 (UTC)
WikiTree WikiTree person ID (P2949) WikiTree+ and the WikiTree Data Doctors Project
editMaybe a good candidate to synch Mismatch_Finder with...
Email sent to @Mohammed_Sadat_(WMDE), Lesko987a:... Lesko987a please correct if something is wrong as I haven't been part of WikiTree since 2017....
- Lesko987a is a lead designer of a project/system that since 2016 is checking > 22 million profiles in WikiTree weekly with external sources like Wikidata. The differences are reported back to the end-users of WikiTree in a report (example) and also on every WikiTree profile (example Douglas Adams (Q42) = Adams-32825 has suggestion list with 23 suggestions on 282 related profiles..)
- Video: WikiTree+ is and why it was created
- Lesko987a updates WikiTree person ID (P2949) see change log
- Example weekly reports with suggestions 2021-07-11.... first report April 2016
- Page with all Wikidata Suggestions and Wikidata Suggestions Videos
- Examples of Suggestions related to Wikidata
- Suggestion 541 Wikidata - Clue for Father - Examples / new
- Suggestion 542 WikiData - Possible Father on WikiData - Examples / |https://www.softdata.si/wt/Err_20210711/542_New_0.htm new]
- Suggestion 543 Wikidata - Clue for Mother - Examples / new
- Suggestion 544 WikiData - Possible Mother on WikiData - Examples / new
- Suggestion 546 Wikidata - Possible spouse - Examples / new
- Suggestion 551 Wikidata - Missing gender - Examples Err_20210711/551_0000-0000_0 / [new]
- Suggestion 552 Wikidata - Different gender - Examples Err_20210711/552_60_0 / new
- Suggestion 553 Wikidata - Empty Birth Date - Examples Err_20210711/553_0000-0000_0 / Open / [ new]
- Suggestion 554 Wikidata - Imprecise birth date - Examples Open / Hidden / New
- Suggestion 555 Wikidata - Different birth date / new
- Suggestion 556 Wikidata - Empty Death Date / new
- Suggestion 557 Wikidata - Imprecise death date / new
- Suggestion 558 Wikidata - Different death date / new
- Data Error 559 Wikidata - Missing birth location / new
- Suggestion 561 Wikidata - Missing death location / new
- Suggestion 563 Wikidata - Possible duplicate by father / new
- Suggestion 564 Wikidata - Possible father / new
- Suggestion 565 Wikidata - Possible duplicate by mother / new
- Suggestion 566 Wikidata - Possible mother / new
- Suggestion 567 WikiData Double entry - Duplicate / new
- ...
- WikiTree members working on suggestions are organized in a WikiTree project called Data Doctors Project, and all WikiTree members are encouraged to work on suggestions for their managed profiles.
- more than > 200 Rules are defined see Space:Wikidata_Suggestions, Space:FindAGrave_Suggestions ....
- Example Wikidata George Washington (Q23) = WikiTree Washington-11
- This profile has suggestions in WikiTree link
- as an active WikiTree member you could also see suggestions on all profiles you manage example Suggestions managed by User Aleš Trtnik
- This profile has suggestions in WikiTree link
- When a suggestion is corrected, all WikiTree members are encouraged to comment when they correct suggestions, and Data Doctor Project members must leave a comment on the Suggestion Status Page based on the action taken as described on the Suggestion Status Page instructions. Explaining the various reports and actions used in correcting suggestions is the the Suggestions Reports and Suggestions Status Help Page. The "Comment Hints" change with each suggestion, based on the frequency of use, and reflect the most common action taken for the suggestion across the WikiTree database.
- In WikiTree, correcting suggestions is hardcore with weekly ranking lists updated every 5 minutes, creating videos explaining suggestions and how to fix them, youtube hangouts, module for statistics of progress
- Discussion in WikiTree related to Wikidata see G2G Tag Wikidata; example "Errors on Wikidata that originate on WikiTree"
- Agreed-upon recommendations related to Wikidata usage on WikiTree.
- "The feedback process to Wikidata"
- I have no knowledge if Wikidata is updated with what is fixed in WikiTree
- I guess the Wikidata:Database_reports/Constraint_violations/P2949 maybe indicates that
- Salgo60 (talk) 02:40, 17 July 2021 (UTC)
- Not sure if this is a great testcase: Wikitree is a wiki and Wikidata may not have found yet the optimal way to keep multiple diverging parent/child relations in sync. --- Jura 07:53, 19 July 2021 (UTC)
- Hi, Data validation is designed to improve data on WikiTree. I could also check the things in the other direction to inform WikiData of the data mismatch or lack of it. I will see how you will design things on wikidata end and if those errors will actually be corrected. But it is often very hard to resolve the data difference, since you must go back to the actual sources to decide what is correct. Most of the time wikitree users don't correct wikidata, since they are not the users here, but lately they comment the data difference on the profiles, since invalid data on the internet tends to be copied around and tends to come back to WikiTree over and over again.
I was thinking of populating wikidata with dates/relations from wikitree, but I didn't decided to do so, since I didn't find an easy way to keep the data up to date. I don't like the one time data dumps, like The Peerage did last year. It caused a lot of problems on WikiTree, since they have many mistakes. Lesko987a (talk) 09:48, 19 July 2021 (UTC)
- Lesko987a I think a step 1 could be that when an users has checked your suggestion that decision if it imnoacts Wikidata should be input to Wikidata and hopoefully someone fixed this.... I guess we will see a lot of lesson learned BUT if a skilled person tells this is wrong in WIkidata then that should be a high prio to fix....
- the problems I have seen with Wikidata <-> Wikitree diffs is that is mostly profiles in areas I have no skills in --> I cant tell if it is correct or not... we have a lot to learn
- agree about "The Peerage dump" related problems we should not add more frustration instead create an interaction to create trust....
- WikiTree discussion Dec 6, 2019 "Bot import of The Peerage data in Wikidata - Good or Bad News?"
- - Salgo60 (talk) 12:49, 19 July 2021 (UTC)
- Lesko987a I think a step 1 could be that when an users has checked your suggestion that decision if it imnoacts Wikidata should be input to Wikidata and hopoefully someone fixed this.... I guess we will see a lot of lesson learned BUT if a skilled person tells this is wrong in WIkidata then that should be a high prio to fix....
- If the reverse of the report could easily be generated and regularly refreshed, that could be interesting. Personally, I would be ok with getting regular imports of dates and places from WikiTree. Either they would be new for Wikidata or get appropriate ranks. It's a bit more complicated with family relationship, but maybe someone has figured out in the meantime a reasonable way to keep multiple possibilities consistent. --- Jura 10:40, 19 July 2021 (UTC)
Support in the api for external reviewers decision
edit- see Phabricator T285849#7220398
- Salgo60 (talk) 10:03, 19 July 2021 (UTC)
- I spoke yesterday a little bit about this at LD4 see video - Salgo60 (talk) 04:38, 22 July 2021 (UTC)
Bad link
editI think your placeholder link is meant to be: https://mismatch-finder.toolforge.org/ - Fuzheado (talk) 20:03, 21 June 2021 (UTC)
- ah, yes. Thanks! -Mohammed Sadat (WMDE) (talk) 07:16, 22 June 2021 (UTC)
A few questions
editWould the mismatches store have an API that I (or someone else) could submit potential mismatches to? Would the mismatches system automatically determine which values failed to match between Wikidata vs an external database? e.g. would I write a script to query Wikidata and an external database (let's say IGDB) to find any video games that have a different release date on one vs the other, or would I write a script to dump every video game on IGDB with its release date and then the mismatches system finds mismatches itself?
My other question: I run vglist.co, which is a website that pulls a lot of data for video games from Wikidata. Would you expect that I could implement a "Report Issue" feature on my site where the user would be able to report a data problem that'd forward it to this mismatches system? e.g. a user sees that the release date for Super Mario Bros is wrong, reports that, and then my site would forward that report into the mismatches system (although I'm not sure how useful that'd be, since the report would only have the release date from vglist, which would be identical to the one in Wikidata)?
Thanks for all your work as usual, this tool looks like it could be very useful :) Nicereddy (talk) 04:47, 30 June 2021 (UTC)
- For your first question: you'll have a way to upload a CSV with the mismatches. I'm not yet sure if we'll be able to have that in the first version or if we'll go with opening a ticket in phabricator and then we upload it initially. Either way there will be a way to do that in the future.
- For your second question: that's an interesting usecase. I think it should be possible with a few hacks. We could for example abuse the mismatching value and make it say "user reported a mistake on vglist.co" or so. Would be cool! --Lydia Pintscher (WMDE) (talk) 10:49, 7 July 2021 (UTC)
Interesting approach. Looks promising. Just a few points:
- Formatting tweaks:
- If "184746" is the QID (e.g. Q184746), the "Q" should be included.
- Property labels should match those on Wikidata ("date of birth", not "Date of birth", see Property:P569)
- It's unclear where the dates in the sample would link to
- Missing info:
- The key used to match Wikidata and the external reference should be included
- provide a link to the external resource (based on a formatter URL, see phab:T285851#7220098)
- If mismatches based on different catalogues are presented, the catalogue should be identified too
- Note that we could have multiple catalogues with "VIAF ID" as key and these aren't necessarily with data from VIAF
- Status:
- About various options for "status", see Wikidata:Project_chat#Fun_with_Mismatches:_typology.
- clickable icons to select some status might help: e.g.
- next to values from Wikidata: incorrect, preferred, conflation
- next to values from External source: incorrect, preferred, conflation
- between the two: both are equally correct
- next to key: key mismatch
- For "upload" to work:
- each catalogue should have information associated with it to populate references (see phab:T285851#7220098).
- a way to map values of the external source to Wikidata items should be provided (e.g. every value "male" from a catalogue → Q6581097). The screen should have a link to go there.
- There should probably be also a screen to view mismatches from a given catalogue only
Detection of circular mismatches and historical edits with detected mismatch.
editI'm looking forward to Mismatch Finder being available for further testing. One question I have is how will Mismatch Finder help in monitoring for circular or past conflicting changes due to two or more external databases having conflicting values for the same Wikidata Q and each changing the Wikidata value, but not their own? Since not all changes are made through Mismatch Finder will there be any analysis of a Q's change history (including merged item history) per Q with checks for past values and sources that are conflicting. It is important to bring those to the attention of users so a more complex situation could be highlighted for further investigation and prevent circular changes due to system assisted editing. Wolfgang8741 (talk) 14:33, 29 July 2021 (UTC)
- At least initially the tool will unfortunately not be able to deal very well with this. But I will note it down as something to figure out. Thanks! -- Lydia Pintscher (WMDE) (talk) 16:13, 18 October 2021 (UTC)
Mismatch from Wikipedia
editHi Firstly, thank you for working on this. I have no idea how I missed this project for so long in the weekly updates. I see the latest update on the site is from June, maybe it's time for an overhaul of the pages? :)
Now, down to business: I see in the sections above that I can import mismatches from a script. As a matter of fact, I have such a script identifying dob/dod differences between Wikidata and RO.wp. My questions are:
- Are Wikipedia articles a valid datasource for this project?
- Can I already import my data somewhere? If yes, could you point me to some docs?
Thank you. Strainu (talk) 18:46, 16 October 2021 (UTC)
- Yeah the page needs an overhaul by now :D
- I would say Wikipedias can definitely be considered, yes.
- Right now you unfortunately can not upload your data yet because we are not quite ready for it yet. But the instructions are here already: https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md -- Lydia Pintscher (WMDE) (talk) 16:15, 18 October 2021 (UTC)
Can we start use the tool?
editI have some issues between WIkidata and Nobelprize.org with data about Nobelprize winners that I would like to test use this tool to track. Please let me know how to do that
- errors found and documented see T300428#7661347
- Salgo60 (talk) 23:10, 4 February 2022 (UTC)
- We are ironing out a few remaining issues but should be ready to go in the next days. Lydia Pintscher (WMDE) (talk) 10:57, 5 February 2022 (UTC)
Paraphrasing concerns from Tagishsimon
editSome interesting points are made in this Twitter thread, by User:Tagishsimon; to paraphrase:
- where can we see which catalogues Mismatch Finder checks against?
- [bug] - "if I throw 275 items at it, I get a 431 error. If I throw 500 items at it ... does nothing at all."
- no integration with tools [like] Petscan, QuickStatements, Mix'n'Match, WDQS, or pagepile, etc.
But do read the whole thread. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 12:11, 23 February 2022 (UTC)
- Thanks Andy,
- There is definitely still a lot that can be improved about the Mismatch Finder. We wanted to get this out now to hear where additional work makes sense and is most needed. So far I've definitely heard discoverability of the data. I'm thinking of potentially solving this with a button to get random mismatches and/or a clearer list of the external sources that are currently in the store. You can see the most recent uploads of mismatches here: https://mismatch-finder.toolforge.org/store/imports but that can certainly be improved more.
- As for integration in other tools: I am not yet sure how that would look like but I'd love to hear if someone has concrete ideas/requests.
- It doing nothing: That shouldn't happen and we'll look into it.
- In general the tool has a bit of a cold-start problem because initially we don't have a ton of mismatches uploaded into the tool. We hope that will change over the next days and weeks. (We're already in touch with a few people who said they'd be willing and are able to provide us with mismatches for the Mismatch Finder.)
- Cheers Lydia Pintscher (WMDE) (talk) 15:49, 23 February 2022 (UTC)
- Does this work now?
- I got an email from alan.ang@wikimedia.de, Partner Manager (Wikidata) "in case you are having mismatch issues between your data and that of Wikidata’s, you may wish to check out the Mismatch Finder tool (see attached). With Mismatch Finder, you will be able to inform Wikidata editors of the mismatches between your data and that of Wikidata’s. Editors will then be able to reconcile these mismatches that eventually improve the quality of the data in your projects."
- But https://mismatch-finder.toolforge.org/random says "There are currently no mismatches available for review"
- Can I submit mismatches? is it at https://mismatch-finder.toolforge.org/store/imports?
- A lot of the CSVs at https://mismatch-finder.toolforge.org/store/imports are red
- To sum it up, the tool appears still to be problematic.(cc @Pigsonthewing @Salgo60)
- Cheers! Vladimir Alexiev (talk) 20:59, 26 January 2023 (UTC)
- Sorry for only getting back to you now.
- Yes Alan reached out to various people who might be in a position to provide mismatches because they for example have internal quality assurance processes for Wikidata's data.
- When you looked at it all previously uploaded mismatches had expired. We have new uploads now and they are going to be monthly. We are working on getting more.
- The imports page was mostly red because there was an upload that wasn't yet adapted to the new upload CSV format. It has now been adjusted as well.
- Yes you can upload mismatches. I'll need to add your account to the allow list first. (We put that in place for the beginning to have a bit more control over what goes into the system initially.) Would you like me to? Alternatively you can also send the CSV to me and I will handle the upload. I'll also document it a bit better that this is necessary. Lydia Pintscher (WMDE) (talk) 15:30, 9 February 2023 (UTC)
- Does this work now?
What imports are set up aleady?
editIs there a list of imports that are done one-off/on a continuous basis?
For example, I am sitting on a huge amount of potential mismatches, has anyone imported those? Magnus Manske (talk) 09:37, 6 December 2023 (UTC)
- Hi @Magnus Manske,
- You can see the latest uploads at https://mismatch-finder.toolforge.org/store/imports. As you can see only Mike Peel has so far set up regular uploads that cover mismatches between English Wikipedia and Wikidata. I think your additional mismatches would be very useful. Are you interested in uploading them yourself? Alternatively we have a students team starting to work on getting more mismatches early next year and it might be a good task for them to get started to get your data into the right format and upload it. Lydia Pintscher (WMDE) (talk) 16:56, 8 December 2023 (UTC)
- I made some views in the mix'n'match DB, and expose them as JSON:
- duplicate_items is a list of potentially duplicate items
- mismatched_items is a list of, well, more potentially duplicate items
- time_mismatch is a list of items that have different time values (usually birth/death) with a source
There is one more (multile values for an external ID property on WD) but that can be found better with SPARQL these days... Let me know if this works for you, and if you prefer the toolforge DB views instead. --Magnus Manske (talk) 11:18, 17 January 2024 (UTC)
- @Magnus ManskeThank you! We'll have a look. Lydia Pintscher (WMDE) (talk) 11:32, 17 January 2024 (UTC)
connection to Global Fact Sync?
editDid this work synchronize watches with GlobalFactSync? Sj (talk) 18:02, 1 August 2024 (UTC)
Wikimedia news article
editI was looking at the report on unveiling discrepancies at https://tech-news.wikimedia.de/en/2024/08/13/unveiling-discrepancies-first-experiences-with-finding-mismatches-on-wikidata-and-how-you-can-too/
The report is interesting but only slightly because there is no attempt to determine what the cause of the discrepancies are. Given that several of the sources are sources about people I postulated that the most of the mismatches were the result of incorrect conflation of information about two different people due to over-eager matching between identifying information about a person in Wikidata and identifying information about a person in an external source. I have seen this problem in several places before, including DBpedia. Given that bots are an important source of information in Wikidata I further postulated that many of these mismatches would be the result of bots.
Before attempting to contact the people who did the work for more information I looked into the sole example on the page - Athiel Mbaha (Q446773), who has a FIDE ID in Wikidata and a birth date in Wikidata that does not match the FIDE birth date for the chess player with that FIDE ID. Athiel Mbaha (Q446773) also has a lot of football-related information, which certainly does not increase the probability of being a chess player. What I found is that Athiel Mbaha (Q446773) is both a footballer and a chess player but that the B-date (2008) on his FIDE page at https://ratings.fide.com/profile/15201759 appears to be wrong.
So this example doesn't provide any support for my postulations. It does, however, point out a problem that tools like the Mismatch Finder have. It is tempting to assume that sources like FIDE are correct and Wikidata is wrong. But this is not always the case. Instead of comparing Wikidata to a single source the finding mismatches tool should be set up to compare Wikidata with a variety of sources. That would allow for much better determination of where the error is. The mismatch tool should also show any reference information from Wikidata.
I spent quite a bit of time finding that Wikidata appears to be correct and FIDE appears to be incorrect. I don't have the ability to fix FIDE data. Is there a way to record the full results of my investigation so that the next time a mismatch on this fact shows up my results also show up? I tried to find a mismatch for Athiel Mbaha (Q446773) to see how the tool works but https://mismatch-finder.toolforge.org/results?ids=Q446773 shows no mismatches and there doesn't appear to be any way to see the history of mismatches for Q446773.
The tool also does not appear to have any way of saying just what information in Wikidata is wrong, if the problem is in Wikidata. Instead users should be able to point to edits that were made to fix the problem on Wikidata, which would, for example, be able to distinguish between wrong birth date information and wrong external ID information and other causes for the kind of mismatch that was found for Athiel Mbaha (Q446773). Peter F. Patel-Schneider (talk) 19:52, 19 August 2024 (UTC)