Chapter 01 - Introduction To DM and Curation - 01
Chapter 01 - Introduction To DM and Curation - 01
Credit Hours: 3
Year I, Semester I
3
BASIC CONCEPTS
Petabytes of data are generated every day due to the various human and machine
activities.
❖ Those data may be lost in a similar amount of time if they are not captured, curated
(carefully chosen and thoughtfully organized or presented) , and marked up in ways
that allow for discovery and reuse by others.
9
KEY COMPONENTS OF DATA MANAGEMENT
Data Security& Privacy: Protecting sensitive data from unauthorized access through
encryption, authentication, and regulatory compliance (e.g., GDPR, HIPAA).
The US Health Insurance Portability and Accountability Act (HIPAA), enacted in 1996, was established
to safeguard patient privacy and secure health information.
GDPR stands for General Data Protection Regulation, a law in the European Union aimed at
safeguarding the data and privacy of EU residents. CCPA stands for California Consumer Privacy Act,
which is a law in the United States specifically for protecting the data and privacy of California
residents.
Example: A financial institution encrypts customer banking information to prevent breaches.
Data Backup & Recovery: Regular backups are essential to prevent data loss due to hardware
failures, cyberattacks, or accidental deletions.
Example: A research laboratory performs daily cloud backups of experiment data to ensure recovery in
case of system crashes.
Data Governance & Compliance: Establishing policies and guidelines on how data is
10
accessed, used, and shared within an organization.
Data Management fundamentals for…
• Good data management is fundamental
– For research excellence
– For high quality data
– For data sharing, replication and reuse
– For long-term sustainability and accessibility
– For data security
– For reputational benefit
DATA MANAGEMENT PRACTICES
Data management practices from the Researchers should consider all of the below
when designing research:
• Selection
• Collection
• Analysis
• Storage
• Ownership
12
DATA SELECTION
It is important to determine the data type and source as the project is being initiated.
Also important are what instruments are suitable for the collection of data in order to
answer the research questions.
The following should be considered during the “data selection” process:
Appropriate type and source of data in order to answer the research question
Qualitative data collection is a type of collection used for textual type data
(observational research)
Quantitative data collection is a type of data collection used for numerical type data
(recording biochemical markers)
Procedures needed to obtain a representative sample
Proper instruments used to collect the data
13
DATA COLLECTION
14
STORAGE OF DATA
Data storage refers to the maintenance of electronic and non-electronic data and issues
related to confidentiality, security, and preservation.
Non-electronic data can exist in paper files, journals, laboratory notebooks
Electronic data can exist in the form of an electronic file, videotapes, DVDs, etc.
Storage of Data - Integrity
Data integrity should be considered initially as part of the instigation of the project.
- Storage capacity is enough to store your data
- Storage solution is reliable
- How long the data will be kept
- Who will input the data and who checks on the accuracy of the data
- Do you need data handling processes/procedures 15
DATA OWNERSHIP
It is often thought that the person conducting and producing the data owns the
data.
However, funding agencies, institutions and sources where the data was obtained
can determine otherwise.
16
What is Data Curation?
➢ Data curation goes beyond management by ensuring that data is well-documented,
preserved, and made reusable for future purposes. It includes: content creation, selection,
classification, transformation, validation, and preservation.
➢ Data curation refers to the active and ongoing process of managing data to ensure it
remains useful, accessible, and reliable over time.
❖ It involves organizing, maintaining, and enhancing data for future use, ensuring proper
documentation, and improving data quality.
➢ Active management of data over its life cycle to ensure it meets the necessary data quality
requirements for its effective usage.
➢ Data curation is performed by expert curators (data annotators) that are responsible for
improving the accessibility and quality of data. 17
KEY COMPONENTS OF DATA CURATION
1. Data Cleaning & Standardization: Removing duplicate records, fixing errors, and
ensuring consistency.
Example: A climate research team standardizes temperature data collected from
different weather stations by converting all values to Celsius.
2. Metadata Creation & Documentation: Adding structured information (metadata) that
describes the dataset, including the source, collection method, format, and keywords.
Example: A digital library adds metadata to scanned historical documents, including
author names, publication dates, and summaries, making them searchable.
3. Data Transformation & Integration: Converting data into formats compatible with
analysis tools and integrating multiple datasets.
Example: A healthcare system merges patient records from different hospitals using
FHIR (Fast Healthcare Interoperability Resources) standards to ensure data
consistency. 18
KEY COMPONENTS OF DATA CURATION
4. Long-term Data Preservation: Ensuring that valuable data remains accessible despite
technological changes by storing it in sustainable formats (e.g., CSV, XML, PDF).
Example: A national archive converts old government reports from proprietary formats
into open-standard file formats to ensure future readability.
19
IMPORTANCE OF DATA MANAGEMENT AND CURATION
1. Enhancing Data Quality & Integrity: Well-managed and curated data ensures
reliability, accuracy, and trustworthiness.
Example: A pharmaceutical company must ensure accurate drug trial data to meet
regulatory approvals.
2 Improving Efficiency & Decision-Making: Structured and curated data allows
organizations to extract insights faster.
Example: A logistics company uses cleaned and structured data to optimize delivery
routes and reduce fuel costs.
Example: A hospital anonymizes patient data before sharing it with researchers to
comply with privacy laws.
20
IMPORTANCE OF DATA MANAGEMENT AND CURATION
1. Develop a Data Management Plan (DMP) – Outline how data will be collected,
stored, secured, and shared.
2. Use Standardized Formats & Metadata – Follow established schemas like principles
for data documentation.
3. Ensure Regular Data Backup– Implement cloud and backups.
4. Implement Robust Security Measures – Use encryption, access controls, and
authentication methods.
5. Promote Data Reusability & Open Access – Share data through trusted repositories.
23
CHALLENGES IN DATA MANAGEMENT AND CURATION
Despite its benefits, data management and curation come with challenges, including:
1. Data Overload & Storage Costs – Organizations generate massive amounts of data that
require costly storage solutions.
2. Data Security Threats – Cyberattacks and data breaches pose risks to sensitive
information.
3. Interoperability Issues – Integrating data from different systems with varying formats
and standards can be complex.
4. Metadata Inconsistency – Lack of proper documentation can make data difficult to find
and interpret.
5. Long-Term Accessibility – Technologies become outdated, making some data formats
unreadable over time.
Example: A government agency using an outdated database system struggles to retrieve historical
census data due to software incompatibility. 24
THE WORLD OF DATA AROUND US
1,000,000
Transient
900,000 information
or unfilled
800,000 demand for
storage
700,000
Petabytes Worldwide
600,000
Information
500,000
400,000
300,000
100,000
0
2005 2006 2007 2008 2009 2010
THE WORLD OF DATA AROUND US: DATA LOSS
Data Management vs. Data Curation:
Aspect Data Management Data Curation
Focus Handling and organizing data Enhancing data for future use
Processes Collection, storage, security Selection, annotation, archiving
Goal Efficient data handling Long-term usability & accessibility
Example A company manages sales records in A digital archive curates’ old manuscripts
a database for research
➢ Data management and curation are critical for ensuring that data remains high-quality, well-
organized, accessible and reusable over time.
➢ While data management focuses on the technical aspects of handling data, data curation
emphasizes its long-term usability and value.
➢ Together, these practices play a vital role in research, business, healthcare, and many other
fields.
➢ Example: By adopting a structured approach to data storage, security, cleaning,
documentation, and long-term preservation, organizations can maximize the value of their27data
and support innovation in various fields.
DATA MANAGEMENT FOR AI AND MACHINE LEARNING (ML)
1. Pick a topic of interest to you and come up with an interesting question that could potentially
be answered with existing data.
2. Search the literature for five peer-reviewed research papers that use data that would be
potentially helpful for answering your question.
3. Be prepared to discuss with classmates
(a) difficulties in accessing data from the literature to address your question and
(b) ways that data accessibility could be improved.
THANK YOU
30