A Machine Learning Approach For Data Quality Control of Earth Observation Data Management System

Uploaded by

Wesley Lima D. Vale

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

A Machine Learning Approach For Data Quality Control of Earth Observation Data Management System

Uploaded by

Wesley Lima D. Vale

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

A MACHINE LEARNING APPROACH FOR DATA QUALITY CONTROL OF EARTH

OBSERVATION DATA MANAGEMENT SYSTEM

Weiguo Han1, 2, Matthew Jochum2

1
Cooperative Programs for the Advancement of Earth System Science, University Corporation for
IGARSS 2020 - 2020 IEEE International Geoscience and Remote Sensing Symposium | 978-1-7281-6374-1/20/$31.00 ©2020 IEEE | DOI: 10.1109/IGARSS39084.2020.9323615

Atmospheric Research, 3090 Center Green Drive, Boulder, CO 80301, USA

2
Center for Satellite Applications and Research (STAR), National Oceanic and Atmospheric
Administration
5830 University Research Court, College Park, MD 20740, USA
ABSTRACT OpenStreetMap (OSM) [5], automate verification of large-
scale data quality like constraint suggestions and anomaly
In the big data era, innovative technologies like cloud detection [6], and recommend geospatial data check and
computing, artificial intelligence, and machine learning are correctness [7].
increasingly utilized in the large-scale data management As a centralized data management system within a
systems of many industry sectors to make them more scalable research organization, STAR Central Data Repository
and intelligent. Applying them to automate and optimize (SCDR) integrates various types of earth observation data
earth observation data management is a hot topic. To improve from multiple sources [8]. Collecting over than 800,000 files
data quality control mechanisms, a machine learning method (~ 5.25 TB) for more than 600 earth observation data products
in combination with built-in quality rules is presented in this every day, SCDR becomes a reliable data source of our
paper to evolve processes around data quality and enhance scientists and developers. During the past years, we kept on
management of earth observation data. The rules of quality enhancing and evolving this system by adoption of innovative
check are set up to detect the common issues, including data technologies to facilitate STAR’s research and development
completeness, data latency, bad data, and data duplication, activities [9 - 11]. In this paper, we will study how to utilize
and the machine learning model is trained, tested, and a machine learning approach for data quality monitoring and
deployed to address these quality issues automatically and control of SCDR.
reduce manual efforts.
2. METHODS
Index Terms— Big Data, Machine Learning, Earth
Observation Data, Data management, Data Quality, Random Machine learning is increasingly leveraged to solve various
Forest real world problems from image recognition to self-driving
cars and recommender systems. It can be employed in the
1. INTRODUCTION tasks of data management systems, including data creation,
data maintenance, data quality, data discovery, etc. [1] By
Data warehouse modernization, machine learning, and automating repetitive manual task, machine learning saves
modern data hubs are listed as the top three trends of today’s the costs for data correction, data transform, data
development of data management system 1. Machine learning recommendation, data linking, duplication removal, capacity
helps optimize system performances of data management in management, and so on. It offers an intelligent approach to
data cataloging, data provenance, metadata management, enhance data management functions, and has been adopted in
data quality, data security, data exploration, data data management of many industrial sectors like banking,
dissemination, and so on [1]. health care, e-learning, stock market, retail, manufacturing,
A robust data quality process is a key component of the and among others.
modern data management system. Machine learning Data quality monitoring and control is one of most
techniques bring the significant impacts on the traditional important parts in the modern data management system.
data quality control mechanism based on user experiences Considering the characteristics of volume, variety, and
and predefined rules [2, 3]. They were used to identify and velocity of big earth observation data, data quality is one of
remediate data quality issues [4], check the quality of main concerns for data managers, data consumers and IT
Volunteered Geographic Information (VGI) data like technicians. New machine learning methods like supervised,

1
Profisee Group, Inc, https://profisee.com/blog/dba-data-quality-
trends-making-waves-in-data-management/.

978-1-7281-6374-1/20/$31.00 ©2020 IEEE 3101 IGARSS 2020

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 15,2022 at 22:33:26 UTC from IEEE Xplore. Restrictions apply.
semi-supervised, unsupervised, reinforcement are utilized to missing files, and notify data users. Sometimes, the issue is
monitor, manage, and improve data quality. They can help not reported and handled timely; it will cause longer data
check data completeness, validate data, identify latency, latency, even losing the missing data permanently. Therefore,
standardize data, remove duplicates, and offer suggestions of a machine learning method is put forward to automate and
system configuration [12], and have been developed in improve the set of defined processes for data quality control.
commercial data quality software, like IBM InfoSphere® As seen in Figure 1, the Random Forest algorithm is
QualityStage®, and Syncsort Trillium. leveraged to build a machine learning model. The model
SCDR includes many processes to check quality of the captures the outputs of data quality rules check as the input
earth observation data in various formats, structures, and of a set of features, and is trained and tested against known
sizes derived from multiple sources [9]. The limitations of results before deployment. In addition, the model also
current method are time-consuming, human intervention, and generates quality report, raises alarms of new quality issue,
inefficient. For example, when an exception of missing data and activates the defined processes.
occurs, the system administrator needs to identify the root
causes, contact data provider, fix the issues, backlog the

Figure 1 Machine learning based data quality control

The domain knowledge on each earth observation data so do the ones failed retrieval of metadata
product is required to create data quality rules. The information or with wrong metadata information
commonly encountered data quality issues and the built-in when ingesting it.
quality rules include: 3) Data latency
1) Data completeness The system outages or maintenances of both data
The completeness check is built to check whether any providers and SCDR cause delay of data availability.
files are missing according to the observation The latency threshold of each collected product is set
continuity, channels, or other information. The data based on its average latency.
gaps caused by satellite maintenance or anomaly are 4) Data duplication
ignored. The checksum value embedded in the database is
2) Bad data used to check if the data is existing in the system.
In the download process, the files failed validation by Regarding different versions of same data, a version
checksum or with zero-byte in size are labelled as bad control is configured to determine which ones are
and quarantined in the specified location. Moreover, kept.

3102

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 15,2022 at 22:33:26 UTC from IEEE Xplore. Restrictions apply.
When data quality issues occur, the machine learning [3] L. Franchina, F. Sergiani, “High Quality Dataset for Machine
model will report them to data quality dashboard, notify them Learning in the Business Intelligence Domain”, In Proceedings
to the right person, and activate the related processes to of SAI Intelligent Systems Conference, Cham, 2019, pp. 391-
401.
handle them. The defined processes are given in order of
[4] T. Cagala, D. Bundesbank, “Improving data quality and closing
importance: data gaps with machine learning”, IFC Bulletins chapters, 46,
1) Correcting data: correct the bad files like wrong date 2017.
and time in the file name; [5] A. L. Ali, F. Schmid, “Data quality assurance for volunteered
2) De-duplicating data: remove the duplicated data from geographic information”, In International Conference on
data repository; Geographic Information Science, Cham, 2014, pp. 126-141),
3) Re-pulling data: obtain data from original data Cham.
providers or other sources to fill gaps caused by [6] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann,
outage, bad data, or other reasons; and A. Grafberger, “Automating large-scale data quality
verification”, Proceedings of the VLDB Endowment, 2018
4) Restoring ingestion: restore data ingestion
pp.1781-1794.
interrupted by outages; [7] A. Parthy, A., L. Silberstein, E. Kowalczyk, J. P. High, A.
5) Reconfiguring system: use system backup Nagarajan, and A. Memon, “Using machine learning to
configurations to balance load of data pulling and recommend correctness checks for geographic map data”, In
ingestion among hosts and disks when the system Proceedings of the 41st International Conference on Software
outage causes long data latency; Engineering: Software Engineering in Practice, 2019, pp. 223-
6) Restarting system: restart database and routine jobs 232.
of data collection and ingestion if necessary. [8] W. Han, J. Brust, “Central satellite data repository supporting
It is impossible that the built-in rules can cover all quality research and development”, AGU Fall Meeting, 2015.
[9] W. Han, M. Jochum, “Near real-time satellite data quality
issues. When the issue is marked as new, a new quality rule
monitoring and control”, 2016 IEEE International Geoscience
need to be built manually and the model will be trained to and Remote Sensing Symposium (IGARSS), Beijing, 2016, pp.
learn how to handle new issue. The machine learning method 206-209.
greatly reduces complexity and manual efforts in data quality [10] W. Han, M. Jochum, “Assessing a central satellite data
control of SCDR system. repository and its usage statistics”, 2018 IEEE International
Geoscience and Remote Sensing Symposium (IGARSS),
3. CONCLUSIONS Valencia, 2018, pp. 6528-6531.
[11] W. Han, M. Jochum, “Latency analysis of large volume
Applying machine learning in management system of earth satellite data transmissions”, 2017 IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), Fort
observation data is an emerging topic. The challenges are
Worth, 2017, pp. 383-387.
constantly evolving. This paper presents applying method of [12] W. Dai, K. Yoshigoe, and W. Parsley, “Improving data quality
machine learning in complement to built-in quality rules to through deep learning and statistical models”, In Information
automate data quality control with less manual and time- Technology-New Generations, 2018, pp. 515-522.
consuming tasks. Implementation of an automated and
intelligent management system for big earth observation data
still requires a long-term plan. We will investigate the
applications of machine learning to automate and optimize
functions of data cataloging, metadata management, and data
dissemination.

4. ACKNOWLEDGEMENTS

The authors would like to thank STAR’s IT team lead (Joseph

Brust), Data Management Working Group members, and
researchers for their valuable comments and suggestions. The
contents are solely the opinions of the authors and do not
constitute a statement of policy, decision, or position on
behalf of NOAA or the U.S. government.

5. REFERENCES

[1] G. Nelson, “Data Management Meets Machine Learning”, SAS

Conference Proceedings: SAS Global Forum, 2018.
[2] R. Dhana, N. G. Venkat, and R.V. Vijay, “Data Quality Issues
in Big Data”, IEEE International Conference on Big Data (Big
Data), 2015.

3103

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 15,2022 at 22:33:26 UTC from IEEE Xplore. Restrictions apply.

Free Ebook To Learn and Design Your Own Microcontroller Projects
100% (2)
Free Ebook To Learn and Design Your Own Microcontroller Projects
111 pages
GIS Data Management_Presentation
No ratings yet
GIS Data Management_Presentation
16 pages
Expert Systems With Applications: Feyza Gürbüz, Lale Özbakir, Hüseyin Yapici
No ratings yet
Expert Systems With Applications: Feyza Gürbüz, Lale Özbakir, Hüseyin Yapici
9 pages
Sensors 23 02952 v3
No ratings yet
Sensors 23 02952 v3
37 pages
Flood Succour Module 1
No ratings yet
Flood Succour Module 1
16 pages
POST-PN-0628
No ratings yet
POST-PN-0628
7 pages
Unique Identification Synopsis
No ratings yet
Unique Identification Synopsis
9 pages
The University of Chicago Press The Society For Healthcare Epidemiology of America
No ratings yet
The University of Chicago Press The Society For Healthcare Epidemiology of America
7 pages
Makalah Tentang Data Mining
No ratings yet
Makalah Tentang Data Mining
6 pages
Data Management For Improved Blasting
No ratings yet
Data Management For Improved Blasting
11 pages
Data Management For Improved Blasting
No ratings yet
Data Management For Improved Blasting
11 pages
Learning To Learn Ecosystems From Limited Data - A Meta-Learning Approach
No ratings yet
Learning To Learn Ecosystems From Limited Data - A Meta-Learning Approach
16 pages
Dhaval Bhatti Research Paper 2
No ratings yet
Dhaval Bhatti Research Paper 2
4 pages
Scalable Data Storage For PV Monitoring Systems: 1 Anastasios Kladas 2 Bert Herteleer 3 Jan Cappelle
No ratings yet
Scalable Data Storage For PV Monitoring Systems: 1 Anastasios Kladas 2 Bert Herteleer 3 Jan Cappelle
5 pages
Computer Communications: Hsin-Yao Hsu, Gautam Srivastava, Hsin-Te Wu, Mu-Yen Chen
No ratings yet
Computer Communications: Hsin-Yao Hsu, Gautam Srivastava, Hsin-Te Wu, Mu-Yen Chen
10 pages
Preprocessing in Data Mining: Edgar Acu Na
No ratings yet
Preprocessing in Data Mining: Edgar Acu Na
5 pages
Data-Driven and Physics-Based Approaches For Wind Turbine High-Speed Shaft Bearing Prognostics
No ratings yet
Data-Driven and Physics-Based Approaches For Wind Turbine High-Speed Shaft Bearing Prognostics
6 pages
Oil Spill Detection using Machine Learning Models, Research Paper
No ratings yet
Oil Spill Detection using Machine Learning Models, Research Paper
5 pages
Implementation_and_Evaluation_of_Smart_Agriculture_System_based_on_Big_Data_Analytics
No ratings yet
Implementation_and_Evaluation_of_Smart_Agriculture_System_based_on_Big_Data_Analytics
4 pages
365-Article Text-2346-1-10-20230515
No ratings yet
365-Article Text-2346-1-10-20230515
8 pages
N225_N243_N280_Cloud Analytics
No ratings yet
N225_N243_N280_Cloud Analytics
9 pages
Automated Forest Health Monitoring and Optimal Harvest Prediction System for Sustainable Resource Management
No ratings yet
Automated Forest Health Monitoring and Optimal Harvest Prediction System for Sustainable Resource Management
11 pages
Development of An Integrated Wireless Sensor Network Micro-Environmental Monitoring System
No ratings yet
Development of An Integrated Wireless Sensor Network Micro-Environmental Monitoring System
9 pages
Acquire v4
No ratings yet
Acquire v4
12 pages
Sub Raja 2020
No ratings yet
Sub Raja 2020
5 pages
Algorithm and Approaches To Handle Large Data-A Survey
No ratings yet
Algorithm and Approaches To Handle Large Data-A Survey
5 pages
Real-Time Synchrophasor Data Anomaly Detection and Classification Using Isolation Forest, Kmeans, and Loop
No ratings yet
Real-Time Synchrophasor Data Anomaly Detection and Classification Using Isolation Forest, Kmeans, and Loop
13 pages
The_Significance_of_using_Data_Extraction_Methods_for_an_Effective_Big_Data_Mining_Process
No ratings yet
The_Significance_of_using_Data_Extraction_Methods_for_an_Effective_Big_Data_Mining_Process
4 pages
Knowledge Based Environmental Data Validation
No ratings yet
Knowledge Based Environmental Data Validation
6 pages
Toward A Cost-Effective Smart Crop Health Monitoring System
No ratings yet
Toward A Cost-Effective Smart Crop Health Monitoring System
3 pages
Department of Computer Science Amity Institute of Information Technology Amity University Jharkhand Ranchi 2019-2023
No ratings yet
Department of Computer Science Amity Institute of Information Technology Amity University Jharkhand Ranchi 2019-2023
30 pages
Konanwab Et Al - IJATER - 12 - 04
No ratings yet
Konanwab Et Al - IJATER - 12 - 04
7 pages
A Review On Visualization Approaches of Data Mining in Heavy Spatial Databases
No ratings yet
A Review On Visualization Approaches of Data Mining in Heavy Spatial Databases
11 pages
Introduction PDF
No ratings yet
Introduction PDF
20 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
14 pages
Data Mining Research Paper
No ratings yet
Data Mining Research Paper
15 pages
Edge-Based Auditing Method For Data Security in Resource-Constrained
No ratings yet
Edge-Based Auditing Method For Data Security in Resource-Constrained
10 pages
3 Xiao Et Al 2021
No ratings yet
3 Xiao Et Al 2021
13 pages
1 s2.0 S1674862X19300060 Main
No ratings yet
1 s2.0 S1674862X19300060 Main
22 pages
NTCC Report 8th Sem
No ratings yet
NTCC Report 8th Sem
30 pages
Manufacturing Execution System Specific Data Analyst
No ratings yet
Manufacturing Execution System Specific Data Analyst
16 pages
Synopsis for Data Analyzer[1]
No ratings yet
Synopsis for Data Analyzer[1]
18 pages
Agent Based Meta Learning in Distributed
No ratings yet
Agent Based Meta Learning in Distributed
7 pages
Rainfall Prediction Using Machine Learni
No ratings yet
Rainfall Prediction Using Machine Learni
7 pages
Big Data Stream Mining Using Integrated Framework With Classification and Clustering Methods
No ratings yet
Big Data Stream Mining Using Integrated Framework With Classification and Clustering Methods
9 pages
Architecture and Design For Regression Test Framework To Fuse Big Data Stack
No ratings yet
Architecture and Design For Regression Test Framework To Fuse Big Data Stack
9 pages
Using Data Mining Methods for Manufactur
No ratings yet
Using Data Mining Methods for Manufactur
6 pages
Semi-Supervised Bearing Fault Diagnosis and Classification Using Variational Autoencoder-Based Deep Generative Models
No ratings yet
Semi-Supervised Bearing Fault Diagnosis and Classification Using Variational Autoencoder-Based Deep Generative Models
11 pages
Computer Assistance Analysis of Power Grid Relay Protection Based On Data Mining - 2021
No ratings yet
Computer Assistance Analysis of Power Grid Relay Protection Based On Data Mining - 2021
11 pages
Data Fusion
No ratings yet
Data Fusion
35 pages
IJECSA - Application of The Fletcher-Reeves Algorithm To Predict Spinach Vegetable Production in Sumatra (Verdi Yasin)
No ratings yet
IJECSA - Application of The Fletcher-Reeves Algorithm To Predict Spinach Vegetable Production in Sumatra (Verdi Yasin)
12 pages
Feature-Enhanced Multisource Subdomain Adaptation On Robust Remaining Useful Life Prediction
No ratings yet
Feature-Enhanced Multisource Subdomain Adaptation On Robust Remaining Useful Life Prediction
8 pages
An Exploratory Study on Using Embedded Systems in Monitoring Weather Conditions in Ohaukwu Local Government Area.
No ratings yet
An Exploratory Study on Using Embedded Systems in Monitoring Weather Conditions in Ohaukwu Local Government Area.
11 pages
Database Assignment 3
No ratings yet
Database Assignment 3
6 pages
1 s2.0 S2665917422002951 Main
No ratings yet
1 s2.0 S2665917422002951 Main
6 pages
Machine Learning and Data Analytics For Environment
No ratings yet
Machine Learning and Data Analytics For Environment
10 pages
Artificial Intelligence and Data Analytics For Geosciences and Remote Sensing: Theory and Application
No ratings yet
Artificial Intelligence and Data Analytics For Geosciences and Remote Sensing: Theory and Application
28 pages
Data Driven Innovations in Structural Health Monit
No ratings yet
Data Driven Innovations in Structural Health Monit
12 pages
Agriculture Data Analysis Using Parallel K-Nearest Neighbour Classification Algorithm
No ratings yet
Agriculture Data Analysis Using Parallel K-Nearest Neighbour Classification Algorithm
9 pages
Artificial Intelligence and Natural Algorithms
From Everand
Artificial Intelligence and Natural Algorithms
Rijwan Khan
No ratings yet
Geospatial Data Science: Combining Geography with Data Science
From Everand
Geospatial Data Science: Combining Geography with Data Science
Dr Aran Castro A J
No ratings yet
1 chapter-CHM
No ratings yet
1 chapter-CHM
3 pages
OS Concepts Chapter 1 Exercises Part 3
100% (1)
OS Concepts Chapter 1 Exercises Part 3
3 pages
Raju's Resume
No ratings yet
Raju's Resume
7 pages
Lab 11.1 Backing Up An Ios Device To A PC or Mac Using Itunes
No ratings yet
Lab 11.1 Backing Up An Ios Device To A PC or Mac Using Itunes
3 pages
First, The Appropriate Place To Establish The Factory
No ratings yet
First, The Appropriate Place To Establish The Factory
3 pages
Changing Physical Path of ASM Disk Group LOGICAL VOLUMES
No ratings yet
Changing Physical Path of ASM Disk Group LOGICAL VOLUMES
5 pages
Case Study - 2 DBMS
No ratings yet
Case Study - 2 DBMS
4 pages
Lenovo Pantalla B156HTN03.8
No ratings yet
Lenovo Pantalla B156HTN03.8
35 pages
Quectel BG95BG77BG600L Series PSM Application Note V1.0
No ratings yet
Quectel BG95BG77BG600L Series PSM Application Note V1.0
28 pages
PDF To Mail
No ratings yet
PDF To Mail
5 pages
SF ERP EC Data Migration en-US
No ratings yet
SF ERP EC Data Migration en-US
68 pages
Emojify-Create Your Own Emoji Review - 123
No ratings yet
Emojify-Create Your Own Emoji Review - 123
22 pages
So, You Want To Learn Artificial Intelligence. Here's How You Do It
No ratings yet
So, You Want To Learn Artificial Intelligence. Here's How You Do It
23 pages
Drafting and Design Presentation Standards Manual Volume 1: Chapter 2 - General Standards
No ratings yet
Drafting and Design Presentation Standards Manual Volume 1: Chapter 2 - General Standards
39 pages
QT - Importing 3rd Party Treatment Plans Into Eclipse To Add To A RapidPlan DVH Estimation Model - Secured
No ratings yet
QT - Importing 3rd Party Treatment Plans Into Eclipse To Add To A RapidPlan DVH Estimation Model - Secured
4 pages
En - AI and Machine Learning in 5G
No ratings yet
En - AI and Machine Learning in 5G
91 pages
Industrial Training Report On SAP
100% (1)
Industrial Training Report On SAP
34 pages
ANDROIT
No ratings yet
ANDROIT
10 pages
Matemática Simples A Anvançada
No ratings yet
Matemática Simples A Anvançada
390 pages
GPT Report
No ratings yet
GPT Report
41 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
S4 Hana 2023
No ratings yet
S4 Hana 2023
5 pages
Annex 9 Payment Interoperability v2.0
No ratings yet
Annex 9 Payment Interoperability v2.0
27 pages
LPC214x Architecture - Peripherals and Programming
No ratings yet
LPC214x Architecture - Peripherals and Programming
44 pages
CV Nguyen Quang Minh Frontend Developer
No ratings yet
CV Nguyen Quang Minh Frontend Developer
1 page
Condition Monitoring
No ratings yet
Condition Monitoring
20 pages
BA ZC471 / BITS ZC471 / POM ZC471 (Merged) :: Management Information Systems
No ratings yet
BA ZC471 / BITS ZC471 / POM ZC471 (Merged) :: Management Information Systems
83 pages
How To Configure File Server Resource Manager in Windows Server 2012 R2 and Above
No ratings yet
How To Configure File Server Resource Manager in Windows Server 2012 R2 and Above
35 pages
Jar Pro Drillstring Jarring Analysis Software
No ratings yet
Jar Pro Drillstring Jarring Analysis Software
5 pages

A Machine Learning Approach For Data Quality Control of Earth Observation Data Management System

Uploaded by

A Machine Learning Approach For Data Quality Control of Earth Observation Data Management System

Uploaded by

A MACHINE LEARNING APPROACH FOR DATA QUALITY CONTROL OF EARTH

OBSERVATION DATA MANAGEMENT SYSTEM

Weiguo Han1, 2, Matthew Jochum2

Atmospheric Research, 3090 Center Green Drive, Boulder, CO 80301, USA

978-1-7281-6374-1/20/$31.00 ©2020 IEEE 3101 IGARSS 2020

Figure 1 Machine learning based data quality control

The authors would like to thank STAR’s IT team lead (Joseph

[1] G. Nelson, “Data Management Meets Machine Learning”, SAS

You might also like