Data Curation and Management
Data Curation and Management
College of Informatics
Department of Information Science
Data Science and analytics Post Graduate program
Prepared By:
1. Mohammed Seid
Table Contents
Introduction ................................................................................................................................2
Significance ................................................................................................................................3
Key components of data curation and management .....................................................................4
Data governance: .....................................................................................................................4
Benefits of Data Governance ...................................................................................................5
Data storage.............................................................................................................................6
Data quality .............................................................................................................................8
Data integration .......................................................................................................................9
Data preservation .....................................................................................................................9
Data Preservation Planning: ...................................................................................................... 10
Data sharing and reuse .............................................................................................................. 11
Data ownership and intellectual property ................................................................................... 12
Data curation:............................................................................................................................ 12
Data security ............................................................................................................................. 13
Some common challenges in data curation and management include: ........................................ 14
Tools and technologies .............................................................................................................. 15
Real-world examples for data curation and management framework ......................................... 19
The National Oceanic and Atmospheric Administration (NOAA) .......................................... 19
The Cancer Genome Atlas (TCGA) ....................................................................................... 19
The Human Cell Atlas (HCA)................................................................................................ 20
The European Open Science Cloud (EOSC) .......................................................................... 20
Related work Review Summary ................................................................................................ 20
Conclusion ................................................................................................................................ 22
Reference .................................................................................................................................. 23
1
Data Curation And management 2015
E.C
Introduction
Data curation and management framework is the combination of two concepts which are data
curation and data management, the first one “Data curation is the active and on-going
management of data through its lifecycle of interests and usefulness to scholarship, science, and
education. Data curation activities enable data discovery and retrieval, maintain its quality, add
value, and provide re-use over time, and this new field includes authentication, archiving,
management, preservation, retrieval, and representation” .[1] Data curation is the process by
which data are put into a state to be managed so that they can be understood and used by parties
across disciplines and organizations. The passage of time should not prohibit the use of the data.
This requires that appropriate measures be undertaken to ensure data infrastructure, searchability,
availability, and preservation [2] .
in the other hand Business dictionary dot com defines data management as “administrative
process by which the required data is acquired, validated, stored, protected, and processed, and
by which its accessibility, reliability, and timeliness is ensured to satisfy the needs of data users”
(Business Dictionary.com, 2012). Data must also be continuously extended, updated, and made
secure for reuse through consultation (Otlet, 1903; Bush, 1945; Rayward, 1998) and “discursive
formation” (Foucault, 1972) to remain useful to science [1].
Data curation and management framework refers to a structured approach that guides the
organization, storage, preservation, and sharing of data throughout its lifecycle. The framework
helps ensure that data is accurate, reliable, and accessible to users while protecting its integrity
and confidentiality. and also it is the active and ongoing management of data through its lifecycle
of interest and usefulness to scholarship, science, and education. It helps to make data
discoverable, accessible, and intelligible, and supports data reuse. It involves policies,
procedures, and technologies that enable organizations to control data and ensure its availability,
security, and usability. The framework helps organizations to effectively manage and utilize their
data assets, ensuring that they are accurate, reliable, and available when needed. This, in turn,
can help organizations to make better decisions, improve their operations, and achieve their
goals[3]
2
Data Curation And management 2015
E.C
Significance
The significance and benefits of a data curation and management framework are numerous and
can have a significant impact on an organization's operations and success.
1. Improved Data Quality: A data curation and management framework can help to
improve the quality of data by ensuring that it is accurate, complete, and consistent. This
can help to ensure that decisions are made based on reliable data, leading to better
outcomes and improved organizational performance.
2. Increased Efficiency: A data curation and management framework can help to
increase organizational efficiency by making data more easily accessible and reducing
the time and effort required finding and analyzing data.
3. Cost Reduction: A data curation and management framework can help to reduce costs
by streamlining data management processes, reducing data redundancy, and
improving data quality. This can lead to cost savings in areas such as storage, data
management, and data analysis.
4. Improved Decision-Making: A data curation and management framework can help to
improve decision-making by providing accurate and relevant data in a timely manner.
This can help organizations to make informed decisions that are based on data-driven
insights.
5. Enhanced Data Security: A data curation and management framework can help to
enhance data security by implementing appropriate data security measures such as access
controls, encryption, and data backup and recovery. This can help to prevent data
breaches and other security incidents that can lead to significant financial, legal,
and reputational damage.
6. Compliance with Regulations: A data curation and management framework can help
organizations to comply with relevant regulations such as data privacy laws, data security
standards, and industry-specific regulations. This can help to avoid legal and regulatory
penalties and ensure that the organization is operating in a responsible and ethical
manner.
3
Data Curation And management 2015
E.C
Data governance:
Data governance is a process of Establishing policies, procedures, and standards to guide data
management activities. Data governance specifies a cross-functional framework for managing
data as a strategic enterprise asset. In doing so, data governance specifies decision rights and
accountabilities for an organization’s decision making about its data. Furthermore, data
governance formalizes data policies, standards, and procedures and monitors compliance.
4
Data Curation And management 2015
E.C
5
Data Curation And management 2015
E.C
Data storage
Data storage: Implementing appropriate storage solutions to ensure data availability, security,
and scalability. Data storage is a key component of any data curation and management
framework. Some of the main considerations for data storage include:
Hard drives: Conventional spinning hard drives are cheap but slower. Solid state
drives are faster but more expensive. For large data storage, hard drive arrays (RAID)
are often used.
Tape: Tape storage is very cheap but slower to access. It is good for archival
storage of large amounts of data that is not accessed frequently.
Cloud storage: Cloud storage services like Amazon S3 and Google Cloud Storage are
inexpensive, scalable, and redundant. But data stored in the cloud may have security
and privacy concerns.
2. Storage architecture: The storage infrastructure can be designed for high performance,
scalability, redundancy, etc. Some options include:
Direct attached storage: Storage directly attached to a server. Simple but limited
scalability.
6
Data Curation And management 2015
E.C
Storage area network: A separate storage network with storage arrays and storage
devices. Provides more flexibility and scalability.
Object storage: A distributed storage architecture optimized for storing and
managing large amounts of unstructured data. Scalable and redundant. Used by
cloud storage services.
3. Storage redundancy: To ensure high availability and prevent data loss, storage
redundancy is important. This can include:
RAID: Uses multiple hard drives to duplicate and spread data across the drives.
Protects against drive failures.
Replication: Duplicating data across multiple storage systems in different
locations. Protects against system failures.
Erasure coding: A method of breaking up data into fragments, expanding and
encoding the data with redundant fragments, and storing across different
locations. Also provides redundancy.
Snapshots: Used to create point-in-time copies of data that can be restored if the
current data gets corrupted or lost.
Storage management: Software tools are needed to manage and monitor storage
systems and capacity. These include solutions for backup, archiving, hierarchical
storage management, and storage optimization.
There are several common data storage strategies used for data curation, and the choice of
strategy depends on the type of data, the volume of data, and the intended use of the data. Some
of the most common data storage strategies for data curation include:
databases. NoSQL databases are highly scalable and can handle large volumes of data,
making them an excellent choice for big data applications.
Data Warehouses: Data warehouses are specialized databases that store large volumes
of structured data for analytical purposes. They are optimized for complex
queries and data analysis, and they typically integrate data from multiple sources to
provide a comprehensive view of an organization's data.
Data Lakes: Data lakes are a newer data storage strategy that store data in its raw,
unstructured form. They provide a flexible and scalable data storage solution that can
handle large volumes of data from disparate sources. Data lakes are ideal for big data
applications where data is collected for later analysis.
Cloud-Based Storage: Cloud-based storage is a popular option for data curation due to
its scalability, flexibility, and cost-effectiveness. Cloud-based storage solutions allow
organizations to store and manage large volumes of data without the need for expensive
hardware or infrastructure. Additionally, cloud-based storage solutions offer high
availability, data redundancy, and disaster recovery capabilities.
Data quality
Ensuring the accuracy, consistency, and reliability of data through validation, cleaning, and
enrichment processes. Data quality is a critical aspect of data curation and management.
Inaccurate or incomplete data can lead to incorrect analysis and inappropriate decision-making.
A data curation and management framework should include strategies to ensure data
quality throughout the data lifecycle, from data acquisition to data dissemination. Some common
strategies for ensuring data quality in data curation and management include:
1. Data Profiling: Data profiling involves analyzing the data to identify patterns,
inconsistencies, and errors. This process can help to identify data quality issues and allow
for corrective action to be taken.
2. Data Cleaning: Data cleaning involves correcting or removing errors and inconsistencies
in the data. Data cleaning can be done manually or through automated tools, such as data
quality software.
8
Data Curation And management 2015
E.C
3. Data Standardization: Data standardization involves ensuring that data is consistent and
conforms to specific standards. Standardizing data can help to improve data accuracy and
reduce errors.
Data Validation: Data validation involves checking the data against predefined rules or constraints to
ensure that it meets certain criteria. Data validation can help to identify and correct errors early in the
data lifecycle.
Data integration
Data integration is a critical component of data curation and management frameworks. It
involves the process of combining data from multiple sources and formats to create a unified
dataset. The goal of data integration is to provide users with a complete and coherent view of the
data relevant to their needs.
In a data curation and management framework, data integration typically involves several steps.
These may include:
1. Data discovery: This involves identifying the sources of data that need to be integrated.
This can include data from internal databases, external sources, or third-party providers.
2. Data mapping: This involves creating a mapping between the data elements in the
different data sources to identify common elements that can be used to integrate the data.
3. Data transformation: This involves converting data from one format to another to
ensure that it can be integrated with other data sources.
4. Data quality assurance: This involves ensuring that the data is accurate, complete, and
consistent across all integrated sources.
5. Data synchronization: This involves ensuring that the integrated data is kept up to date
and synchronized with the original data sources.
Data preservation
Data preservation: Safeguarding data against loss, degradation, or corruption and ensuring its
long-term accessibility. Data preservation is an essential component of data curation and management
9
Data Curation And management 2015
E.C
framework. It involves ensuring that data is stored and maintained over a long period of time and remains
accessible, usable, and understandable. Data preservation is necessary to ensure that data can be used for
future research, analysis, and decision-making.
The following are some of the key strategies for data preservation in a data curation and
management framework:
2. Data Storage: Data storage involves choosing appropriate storage technologies that can
accommodate the volume and type of data being preserved. It is essential to choose
storage technologies that are scalable, reliable, and secure.
3. Data Backup: Data backup involves creating copies of the data and storing them in
multiple locations to ensure that data is not lost due to hardware failures, natural
disasters, or other unforeseen events. It is essential to have a backup and recovery plan in
place to ensure that data can be restored in the event of a disaster.
4. Data Migration: Data migration involves transferring data from one storage technology
to another as technology evolves or becomes obsolete. It is crucial to ensure that data
remains accessible and usable during the migration process.
5. Data Access: Data access involves providing access to the data to authorized users over a
long period of time. It is essential to have appropriate access controls in place to ensure
that the data is not accessed or used inappropriately.
10
Data Curation And management 2015
E.C
The following are some key strategies for data sharing and reuse in a data curation and
management framework:
2. Data Access: Data access should be provided to authorized users to ensure that the data
can be shared and reused. Access controls should be put in place to ensure that data is not
accessed or used inappropriately.
3. Data Standardization: Standardizing data can help to ensure that data is consistent and
can be easily shared and reused across different platforms and
applications. Standardization can include using common file formats, data structures,
and metadata standards.
4. Data Discovery: Data discovery involves making data easy to find and access by others.
This can involve publishing data in repositories or data portals, providing search tools,
and using appropriate metadata standards.
5. Data Licensing: Data licensing involves specifying the terms and conditions under
which data can be shared and reused. Licensing can include specifying the allowable uses
of the data, any restrictions on data sharing, and any attribution requirements.
11
Data Curation And management 2015
E.C
6. Data Citation: Data citation involves providing a way to acknowledge and credit the
original data creators when data is shared and reused. Data citation can help to ensure that
data creators receive appropriate credit for their work.
The following are some key strategies for addressing data ownership and intellectual property in
a data curation and management framework:
1. Data Ownership: Data ownership refers to the legal right to control, access, and use the
data. Data ownership may reside with individuals, organizations, or governments,
depending on the circumstances of data creation and collection. It is essential to establish
clear ownership of data to ensure that it is used appropriately.
Data curation:
Managing and maintaining data, including metadata, to enhance its discoverability, accessibility,
and usefulness for users. Data curation is a key component of data curation and management
frameworks. It involves the process of selecting, organizing, and maintaining data to ensure its accuracy,
completeness, and usability. The goal of data curation is to ensure that data is reliable and relevant for its
intended use, and that it is easily accessible to those who need it.
12
Data Curation And management 2015
E.C
In a data curation and management framework, data curation typically involves several steps.
These may include:
1. Data selection: This involves identifying the data that is most relevant to the project or
research question at hand.
2. Data acquisition: This involves obtaining the data from various sources, such as
databases, data repositories, or external sources.
3. Data cleaning: This involves removing any errors, duplicates, or inconsistencies in the
data to ensure its accuracy.
4. Data integration: This involves combining data from multiple sources and formats to
create a unified dataset.
Data security
Data security is an essential aspect of the data curation and management framework. Data
security involves protecting data from unauthorized access, modification, or destruction. It is
essential to implement appropriate data security measures to ensure that data is not accessed or
used inappropriately.
The following are some key strategies for data security in a data curation and management
framework:
1. Access Controls: Access controls ensure that only authorized users can access the data.
Access controls can include authentication, authorization, and encryption. It is essential
to implement appropriate access controls to prevent unauthorized access to the data.
2. Data Encryption: Data encryption involves converting data into a coded format that can
only be read by authorized users with the correct decryption key. Data encryption can
help to ensure that data is not accessed or used inappropriately.
13
Data Curation And management 2015
E.C
3. Data Classification: Data classification involves categorizing data into different levels of
sensitivity or confidentiality. Different levels of data classification can have different
access controls and encryption requirements based on their sensitivity.
4. Data Backup and Recovery: Data backup and recovery involves creating copies of the
data and storing them in multiple locations to ensure that data is not lost due to hardware
failures, natural disasters, or other unforeseen events. It is essential to have a backup
and recovery plan in place to ensure that data can be restored in the event of a disaster.
5. Data Auditing: Data auditing involves monitoring data access and use to detect
unauthorized access or use. Data auditing can help to identify and mitigate potential
security threats to the data.
6. Data Retention and Destruction: Data retention and destruction involves establishing
policies for how long data is kept and when it should be destroyed. Data retention and
destruction policies can help to ensure that data is not retained longer than necessary and
is properly disposed of when no longer needed.
1. Lack of time and resources: Data curation requires time and effort, but researchers often
lack sufficient time or funding to devote to curation activities. This can lead to data that is
poorly organized, documented, and preserved.
2. Lack of expertise: Researchers do not always have the expertise required for good data
curation. Activities like metadata creation, digital preservation, and data
documentation require specific knowledge and skills.
3. Data heterogeneity: Research data comes in many different forms, formats, and
structures. Curation solutions that work for one type of data may not work for another.
This heterogeneity makes automation and standardization difficult.
14
Data Curation And management 2015
E.C
4. Lack of incentives: Researchers are often incentivized to publish papers, but not
necessarily to share or curate their data. This can lead to a lack of motivation to invest in
good data practices.
5. Evolving technology: Technology used in research is constantly changing. This means
data, formats, storage solutions, software, etc. are also changing. Keeping data usable and
accessible over time requires ongoing curation efforts to address technological changes.
6. Privacy and ethical concerns: Some data cannot be shared or reused due to privacy,
confidentiality, or other ethical concerns. Determining how to handle sensitive data in an
ethical way can be challenging.
7. Lack of tools and infrastructure: Good tools and infrastructure for activities like metadata
generation, format migration, storage, access provision, and digital preservation do not
always exist, especially for highly specialized or heterogeneous data types. This makes
curation difficult.
8. Unsustainable practices: Short-term, ad hoc solutions are common in research, but these
approaches do not support long-term data access and reuse. Developing sustainable
practices and infrastructure requires a significant shift in culture and mindset.
9. Lack of standards: The lack of community standards and best practices for data in some
domains makes establishing a reasonable and consistent approach to curation
challenging. Standards help enable sharing, interoperability and reuse.
15
Data Curation And management 2015
E.C
Examples
o NoSQL DBMS: NoSQL DBMS (Not Only SQL Database Management System) is a type
of database management system that does not use the traditional relational model of
data storage. Instead of using tables with rows and columns, NoSQL databases use
different data models, such as document-oriented, key-value, graph, or column-family
models, to store and manage data.NoSQL DBMSs are designed to handle large volumes
of unstructured or semi-structured data, which may not fit well into the rigid structure
of a relational database. They are also designed to be highly scalable, able to handle
large amounts of data and high levels of traffic, and can be distributed across multiple
servers to improve performance and availability
Example
2. Data Warehouses: Data warehouses store large amounts of structured and semi-
structured data from various sources, supporting the efficient querying and analysis of
data. Examples include:
o Amazon Redshift
o Google BigQuery
o Snowflake
o Microsoft Azure Synapse Analytics
16
Data Curation And management 2015
E.C
3. Data Integration Tools: Data integration tools help to extract, transform, and load (ETL)
data from multiple sources and formats into a unified data store. Examples include:
o Apache NiFi
o Talend
o Microsoft SQL Server Integration Services (SSIS)
o Informatica PowerCenter
4. Data Catalogs: Data catalogs help to organize and discover data by providing metadata
management and data lineage tracking. Examples include:
o Alation
o Collibra
o AWS Glue Data Catalog
o Google Cloud Data Catalog
5. Data Quality Tools: These tools ensure data accuracy, consistency, and reliability by
identifying and resolving data issues such as duplicates, missing values, and incorrect
formats. Examples include:
o IBM InfoSphere Information Analyzer
o Informatica Data Quality
o Talend Data Quality
o Trifacta
6. Data Governance Platforms: Data governance platforms provide a holistic approach to
managing data policies, standards, and processes to ensure the availability, usability,
integrity, and security of data. Examples include:
o Collibra Data Governance Center
o Informatica Axon Data Governance
o IBM Watson Knowledge Catalog
o SAP Data Intelligence
7. Data Visualization and Reporting Tools: These tools enable users to visually explore,
analyze, and share data insights. Examples include:
o Tableau
o Microsoft Power BI
o QlikView
17
Data Curation And management 2015
E.C
o D3.js
8. Big Data Processing Frameworks: Big data processing frameworks enable the
handling, processing, and analysis of large and complex datasets. Examples include:
o Apache Hadoop
o Apache Spark
o Flink
o Google Cloud Dataflow
18
Data Curation And management 2015
E.C
Here are some real-world examples of organizations using data curation and management
frameworks:
The National Oceanic and Atmospheric Administration (NOAA) is a US government agency that
collects and manages a vast amount of environmental data, including data on weather, oceans,
and fisheries. To improve the management of this data, NOAA implemented a data curation and
management framework based on the Data Management Maturity Model (DMM).
The DMM provided a comprehensive framework for assessing and improving NOAA's data
management practices across multiple domains, including data governance, metadata,
preservation, and sharing. NOAA used the DMM to evaluate its existing data management
practices, identify gaps and opportunities for improvement, and develop a roadmap for
implementing best practices.
As part of this effort, NOAA established a Data Management Integration Team (DMIT) to lead
the implementation of the DMM and coordinate data management activities across the agency.
The DMIT worked with NOAA's data stakeholders to develop and implement policies and
procedures for data management, establish data quality standards, and improve data sharing and
interoperability.
TCGA is a research program funded by the National Cancer Institute (NCI) that aims to improve
our understanding of cancer biology and develop new treatments for cancer. To manage the vast
amount of genomic data generated by the program, TCGA developed a data curation and
management framework based on the FAIR Data Principles. The framework includes policies
and procedures for data sharing, metadata management, and data quality control, and has enabled
researchers to access and analyze TCGA data more easily and efficiently.
19
Data Curation And management 2015
E.C
The HCA is a global research initiative that aims to create a comprehensive map of all human
cells to understand how they interact and contribute to health and disease. To manage the large
amount of data generated by the project, the HCA developed a data curation and management
framework based on the Data Management Maturity Model (DMM). The framework includes
policies and procedures for data governance, metadata management, and data sharing, and has
enabled researchers to access and analyze HCA data more effectively.
The EOSC is a pan-European initiative that aims to provide researchers with seamless access to
research data and services. To achieve this, the EOSC developed a data curation and
management framework based on the Research Data Alliance (RDA) guidelines and
recommendations. The framework includes policies and procedures for data sharing,
interoperability, and reuse, and has enabled researchers to access and share research data across
different disciplines and domains.
20
Data Curation And management 2015
E.C
emphasizes simplicity and ease of use, providing researchers with practical guidelines for
organizing, documenting, and sharing their data. By adopting this lightweight framework,
researchers can enhance data reproducibility, collaboration, and long-term preservation without
overwhelming administrative burdens. Article [7] titled "Provenance-Driven Data Curation
Workflow Analysis" focuses on the analysis of data curation workflows using provenance
information. The study explores how provenance, which records the history and origin of data,
can be leveraged to improve data curation processes. By analyzing the provenance data,
researchers can gain insights into the workflow patterns, identify bottlenecks, and optimize the
curation process. The article highlights the significance of incorporating provenance-driven
approaches in data curation to enhance efficiency, quality, and reliability in managing and
preserving research data. Article [8] titled “Data Curation with a Focus on Reuse" explores the
importance of data curation and its role in enabling data reuse. The article emphasizes that
effective data curation involves not only preserving and organizing data but also ensuring its
accessibility and usability for future research. It highlights the challenges faced in data curation,
including the need for standardized metadata, data documentation, and long-term preservation
strategies. The article also discusses the benefits of data reuse, such as promoting scientific
advancements, enabling interdisciplinary research, and facilitating reproducibility. It concludes
by emphasizing the need for continued investment in data curation efforts to maximize the value
and impact of research data. Article [9] titled "Medical Data Quality Assessment: On the
Development of an Automated Framework for Medical Data Curation" focuses on the
development of an automated framework for assessing the quality of medical data. The article
highlights the challenges faced in ensuring the accuracy, completeness, and consistency of
medical data, which is crucial for reliable clinical research and decision-making. The proposed
framework aims to automate the process of data curation by leveraging machine learning
algorithms and data mining techniques. It discusses the various components of the framework,
including data preprocessing, feature extraction, and quality assessment algorithms. The article
concludes by highlighting the potential benefits of the automated framework, such as improved
efficiency, consistency, and reliability in medical data curation processes.
21
Data Curation And management 2015
E.C
Conclusion
22
Data Curation And management 2015
E.C
Reference
[1] Plato L. Smith II, Exploring Data Curation and Management Programs, Projects, and
Services through Metatriangulation, Communication and Information,
[2] Vasily Bunakov ,Brian Matthews ,Data Curation Framework for Facilities Science, In
Proceedings of the 2nd International Conference on Data Technologies and Applications, 2013
DOI: 10.5220/0004593302110216,
[3] Whyte, A. Emerging infrastructure and services for research data management and curation
in the UK and Europe. In G. Pryor (Ed.). Managing Research Data. (2012).
[4] Andrey Kosinov1, Adilbek Erkimbaev1, Geirgy Kobzev1, and Vladimir Zitserman, Data
Curation Approach to Management of Research Data, Joint Institute for High Temperatures,
Russian Academy of Sciences, Russia,2019
[5] Yin Zhang, Chen, Data Management and Curation Practices: The Case of Using DSpace and
Implications, 2015
[8] Maria Esteva,Robert McLay, Weijia Xu, Sivakumar Kulasekaran, Data Curation with a
Focus on Reuse,2016, DOI: http://dx.doi.org/10.1145/2910896.2910906
23