A Machine Learning Approach For Data Quality Control of Earth Observation Data Management System
A Machine Learning Approach For Data Quality Control of Earth Observation Data Management System
1
Profisee Group, Inc, https://profisee.com/blog/dba-data-quality-
trends-making-waves-in-data-management/.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 15,2022 at 22:33:26 UTC from IEEE Xplore. Restrictions apply.
semi-supervised, unsupervised, reinforcement are utilized to missing files, and notify data users. Sometimes, the issue is
monitor, manage, and improve data quality. They can help not reported and handled timely; it will cause longer data
check data completeness, validate data, identify latency, latency, even losing the missing data permanently. Therefore,
standardize data, remove duplicates, and offer suggestions of a machine learning method is put forward to automate and
system configuration [12], and have been developed in improve the set of defined processes for data quality control.
commercial data quality software, like IBM InfoSphere® As seen in Figure 1, the Random Forest algorithm is
QualityStage®, and Syncsort Trillium. leveraged to build a machine learning model. The model
SCDR includes many processes to check quality of the captures the outputs of data quality rules check as the input
earth observation data in various formats, structures, and of a set of features, and is trained and tested against known
sizes derived from multiple sources [9]. The limitations of results before deployment. In addition, the model also
current method are time-consuming, human intervention, and generates quality report, raises alarms of new quality issue,
inefficient. For example, when an exception of missing data and activates the defined processes.
occurs, the system administrator needs to identify the root
causes, contact data provider, fix the issues, backlog the
The domain knowledge on each earth observation data so do the ones failed retrieval of metadata
product is required to create data quality rules. The information or with wrong metadata information
commonly encountered data quality issues and the built-in when ingesting it.
quality rules include: 3) Data latency
1) Data completeness The system outages or maintenances of both data
The completeness check is built to check whether any providers and SCDR cause delay of data availability.
files are missing according to the observation The latency threshold of each collected product is set
continuity, channels, or other information. The data based on its average latency.
gaps caused by satellite maintenance or anomaly are 4) Data duplication
ignored. The checksum value embedded in the database is
2) Bad data used to check if the data is existing in the system.
In the download process, the files failed validation by Regarding different versions of same data, a version
checksum or with zero-byte in size are labelled as bad control is configured to determine which ones are
and quarantined in the specified location. Moreover, kept.
3102
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 15,2022 at 22:33:26 UTC from IEEE Xplore. Restrictions apply.
When data quality issues occur, the machine learning [3] L. Franchina, F. Sergiani, “High Quality Dataset for Machine
model will report them to data quality dashboard, notify them Learning in the Business Intelligence Domain”, In Proceedings
to the right person, and activate the related processes to of SAI Intelligent Systems Conference, Cham, 2019, pp. 391-
401.
handle them. The defined processes are given in order of
[4] T. Cagala, D. Bundesbank, “Improving data quality and closing
importance: data gaps with machine learning”, IFC Bulletins chapters, 46,
1) Correcting data: correct the bad files like wrong date 2017.
and time in the file name; [5] A. L. Ali, F. Schmid, “Data quality assurance for volunteered
2) De-duplicating data: remove the duplicated data from geographic information”, In International Conference on
data repository; Geographic Information Science, Cham, 2014, pp. 126-141),
3) Re-pulling data: obtain data from original data Cham.
providers or other sources to fill gaps caused by [6] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann,
outage, bad data, or other reasons; and A. Grafberger, “Automating large-scale data quality
verification”, Proceedings of the VLDB Endowment, 2018
4) Restoring ingestion: restore data ingestion
pp.1781-1794.
interrupted by outages; [7] A. Parthy, A., L. Silberstein, E. Kowalczyk, J. P. High, A.
5) Reconfiguring system: use system backup Nagarajan, and A. Memon, “Using machine learning to
configurations to balance load of data pulling and recommend correctness checks for geographic map data”, In
ingestion among hosts and disks when the system Proceedings of the 41st International Conference on Software
outage causes long data latency; Engineering: Software Engineering in Practice, 2019, pp. 223-
6) Restarting system: restart database and routine jobs 232.
of data collection and ingestion if necessary. [8] W. Han, J. Brust, “Central satellite data repository supporting
It is impossible that the built-in rules can cover all quality research and development”, AGU Fall Meeting, 2015.
[9] W. Han, M. Jochum, “Near real-time satellite data quality
issues. When the issue is marked as new, a new quality rule
monitoring and control”, 2016 IEEE International Geoscience
need to be built manually and the model will be trained to and Remote Sensing Symposium (IGARSS), Beijing, 2016, pp.
learn how to handle new issue. The machine learning method 206-209.
greatly reduces complexity and manual efforts in data quality [10] W. Han, M. Jochum, “Assessing a central satellite data
control of SCDR system. repository and its usage statistics”, 2018 IEEE International
Geoscience and Remote Sensing Symposium (IGARSS),
3. CONCLUSIONS Valencia, 2018, pp. 6528-6531.
[11] W. Han, M. Jochum, “Latency analysis of large volume
Applying machine learning in management system of earth satellite data transmissions”, 2017 IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), Fort
observation data is an emerging topic. The challenges are
Worth, 2017, pp. 383-387.
constantly evolving. This paper presents applying method of [12] W. Dai, K. Yoshigoe, and W. Parsley, “Improving data quality
machine learning in complement to built-in quality rules to through deep learning and statistical models”, In Information
automate data quality control with less manual and time- Technology-New Generations, 2018, pp. 515-522.
consuming tasks. Implementation of an automated and
intelligent management system for big earth observation data
still requires a long-term plan. We will investigate the
applications of machine learning to automate and optimize
functions of data cataloging, metadata management, and data
dissemination.
4. ACKNOWLEDGEMENTS
5. REFERENCES
3103
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 15,2022 at 22:33:26 UTC from IEEE Xplore. Restrictions apply.