0% found this document useful (0 votes)
6 views30 pages

Chapter 01 - Introduction To DM and Curation - 01

The document outlines a course on Advanced Data Management and Curation, focusing on the principles of data management and curation, including data collection, storage, processing, and security. It emphasizes the importance of effective data practices for research, compliance, and decision-making, while also addressing challenges such as data overload and security threats. The course includes various evaluation methods and encourages students to engage in discussions and article reviews related to data management.

Uploaded by

dine mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views30 pages

Chapter 01 - Introduction To DM and Curation - 01

The document outlines a course on Advanced Data Management and Curation, focusing on the principles of data management and curation, including data collection, storage, processing, and security. It emphasizes the importance of effective data practices for research, compliance, and decision-making, while also addressing challenges such as data overload and security threats. The course includes various evaluation methods and encourages students to engage in discussions and article reviews related to data management.

Uploaded by

dine mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DtSc 5120: Advanced Data Management and Curation

Credit Hours: 3
Year I, Semester I

Gebremedhin Gebreyohans (PhD)


1
April, 2025
Mode of Lecture,
Delivery discussions,
article reviewing,
term paper,
seminar presentation etc.

Evaluation Continuous Assessment


Method ✓ Test:10%,
✓ project work/term paper:15%,
✓ article review:15%,
✓ individual/group assignment:10%) and 2
Chapter one

Introduction to Data Management and Curation

3
BASIC CONCEPTS

✓ What is data, information and knowledge?


✓ What is data curation and data management?
 Data is a new factor of production in this era.

❖ A huge volume of data production is leading to the situation of data overflow

 Petabytes of data are generated every day due to the various human and machine
activities.

❖ Those data may be lost in a similar amount of time if they are not captured, curated
(carefully chosen and thoughtfully organized or presented) , and marked up in ways
that allow for discovery and reuse by others.

 The challenge is how to manage such data? 4


WHAT IS DATA MANAGEMENT?
 Data management refers to the process of collecting, storing, organizing, maintaining,
and protecting data to ensure its accuracy, availability, and security.
❖ It involves a range of practices, tools, and policies that help organizations and
researchers effectively manage data throughout its lifecycle.
 Data management refers to a set of practices, technologies, and policies used to collect,
store, organize, protect, and process data efficiently.
 The goal is to ensure that data remains accurate, consistent, and readily available when
needed.
Example of Data Management:
A university library manages research data by collecting digital publications, storing them in a
repository, indexing them for easy retrieval, and ensuring they are protected from unauthorized
access. 5
Data Management Concepts

 Why manage data?

 Without data and the ability to process the data:


 An organization could not successfully complete most business activities

 Data consists of raw facts

 To transform data into useful information:


 It must first be organized in a meaningful way i.e. Database

 Database Management System (DBMS)


 A Collection of programs that enables users to store, modify, and extract
information from a database
6
KEY COMPONENTS OF DATA MANAGEMENT
 Data Collection: Gathering raw data from different sources (e.g., surveys, sensors, databases
and online forms). This is the first and foremost process.
 Example: An e-commerce company collects customer purchase data from online
transactions, mobile apps, and physical stores.
 Data Storage: Data must be stored in reliable systems such as databases, cloud platforms,
or data warehouses.
 Example: A healthcare institution stores patient records in an Electronic Health Record
(EHR) system with backup servers to prevent data loss.
 Data Processing: Cleaning, transforming, and analyzing data to extract meaningful insights.
 Data Sharing: Ensuring proper access to data for collaboration while maintaining security.

9
KEY COMPONENTS OF DATA MANAGEMENT
 Data Security& Privacy: Protecting sensitive data from unauthorized access through
encryption, authentication, and regulatory compliance (e.g., GDPR, HIPAA).
 The US Health Insurance Portability and Accountability Act (HIPAA), enacted in 1996, was established
to safeguard patient privacy and secure health information.
 GDPR stands for General Data Protection Regulation, a law in the European Union aimed at
safeguarding the data and privacy of EU residents. CCPA stands for California Consumer Privacy Act,
which is a law in the United States specifically for protecting the data and privacy of California
residents.
 Example: A financial institution encrypts customer banking information to prevent breaches.

 Data Backup & Recovery: Regular backups are essential to prevent data loss due to hardware
failures, cyberattacks, or accidental deletions.
 Example: A research laboratory performs daily cloud backups of experiment data to ensure recovery in
case of system crashes.
 Data Governance & Compliance: Establishing policies and guidelines on how data is
10
accessed, used, and shared within an organization.
Data Management fundamentals for…
• Good data management is fundamental
– For research excellence
– For high quality data
– For data sharing, replication and reuse
– For long-term sustainability and accessibility
– For data security
– For reputational benefit
DATA MANAGEMENT PRACTICES

 Data management practices from the Researchers should consider all of the below
when designing research:
• Selection
• Collection
• Analysis
• Storage
• Ownership

12
DATA SELECTION

 It is important to determine the data type and source as the project is being initiated.
Also important are what instruments are suitable for the collection of data in order to
answer the research questions.
 The following should be considered during the “data selection” process:
 Appropriate type and source of data in order to answer the research question
 Qualitative data collection is a type of collection used for textual type data
(observational research)
 Quantitative data collection is a type of data collection used for numerical type data
(recording biochemical markers)
 Procedures needed to obtain a representative sample
 Proper instruments used to collect the data
13
DATA COLLECTION

 The process of gathering information about variables of interest systematically in order


to answer the research questions or test hypotheses.
 If data is not collected properly:
 the research questions cannot be answered,
 the study cannot be repeated, there are distorted findings,
 other researchers are misled, public policy decisions can be compromised and
 harm to humans and animals can result

14
STORAGE OF DATA
 Data storage refers to the maintenance of electronic and non-electronic data and issues
related to confidentiality, security, and preservation.
 Non-electronic data can exist in paper files, journals, laboratory notebooks
 Electronic data can exist in the form of an electronic file, videotapes, DVDs, etc.
 Storage of Data - Integrity
 Data integrity should be considered initially as part of the instigation of the project.
- Storage capacity is enough to store your data
- Storage solution is reliable
- How long the data will be kept
- Who will input the data and who checks on the accuracy of the data
- Do you need data handling processes/procedures 15
DATA OWNERSHIP

 Data is a product resulting from research.

 It is often thought that the person conducting and producing the data owns the
data.

 However, funding agencies, institutions and sources where the data was obtained
can determine otherwise.

16
What is Data Curation?
➢ Data curation goes beyond management by ensuring that data is well-documented,
preserved, and made reusable for future purposes. It includes: content creation, selection,
classification, transformation, validation, and preservation.

➢ Data curation refers to the active and ongoing process of managing data to ensure it
remains useful, accessible, and reliable over time.

❖ It involves organizing, maintaining, and enhancing data for future use, ensuring proper
documentation, and improving data quality.

➢ Active management of data over its life cycle to ensure it meets the necessary data quality
requirements for its effective usage.

➢ Data curation is performed by expert curators (data annotators) that are responsible for
improving the accessibility and quality of data. 17
KEY COMPONENTS OF DATA CURATION
 1. Data Cleaning & Standardization: Removing duplicate records, fixing errors, and
ensuring consistency.
 Example: A climate research team standardizes temperature data collected from
different weather stations by converting all values to Celsius.
 2. Metadata Creation & Documentation: Adding structured information (metadata) that
describes the dataset, including the source, collection method, format, and keywords.
 Example: A digital library adds metadata to scanned historical documents, including
author names, publication dates, and summaries, making them searchable.
 3. Data Transformation & Integration: Converting data into formats compatible with
analysis tools and integrating multiple datasets.
 Example: A healthcare system merges patient records from different hospitals using
FHIR (Fast Healthcare Interoperability Resources) standards to ensure data
consistency. 18
KEY COMPONENTS OF DATA CURATION

 4. Long-term Data Preservation: Ensuring that valuable data remains accessible despite
technological changes by storing it in sustainable formats (e.g., CSV, XML, PDF).

 Example: A national archive converts old government reports from proprietary formats
into open-standard file formats to ensure future readability.

 5. Data Sharing & Reusability: Providing open-access data repositories to enable


collaboration and innovation.

19
IMPORTANCE OF DATA MANAGEMENT AND CURATION

 1. Enhancing Data Quality & Integrity: Well-managed and curated data ensures
reliability, accuracy, and trustworthiness.
 Example: A pharmaceutical company must ensure accurate drug trial data to meet
regulatory approvals.
 2 Improving Efficiency & Decision-Making: Structured and curated data allows
organizations to extract insights faster.
 Example: A logistics company uses cleaned and structured data to optimize delivery
routes and reduce fuel costs.
 Example: A hospital anonymizes patient data before sharing it with researchers to
comply with privacy laws.

20
IMPORTANCE OF DATA MANAGEMENT AND CURATION

 3 Supporting Research & Knowledge Sharing: Properly curated datasets allow


scientists to replicate experiments and build upon previous findings.
 Example: Open data repositories like Kaggle and Dryad provide well-documented
datasets for data scientists and researchers.
 4 Ensuring Compliance with Data Regulations: Following data management policies
helps organizations comply with laws like GDPR (Europe), HIPAA (USA), and the
Ethiopian Data Protection Act.
 Example: A hospital anonymizes patient data before sharing it with researchers to
comply with privacy laws.
 Example in Real Life:
• In healthcare, patient records are curated and managed to ensure accurate diagnosis and treatment history.
• In scientific research, curated datasets allow researchers to reproduce experiments and validate findings.
• In business intelligence, properly managed data helps companies make data-driven decisions 21
22
BEST PRACTICES FOR EFFECTIVE DATA MANAGEMENT AND CURATION

 1. Develop a Data Management Plan (DMP) – Outline how data will be collected,
stored, secured, and shared.
 2. Use Standardized Formats & Metadata – Follow established schemas like principles
for data documentation.
 3. Ensure Regular Data Backup– Implement cloud and backups.
 4. Implement Robust Security Measures – Use encryption, access controls, and
authentication methods.
 5. Promote Data Reusability & Open Access – Share data through trusted repositories.

23
CHALLENGES IN DATA MANAGEMENT AND CURATION
 Despite its benefits, data management and curation come with challenges, including:
 1. Data Overload & Storage Costs – Organizations generate massive amounts of data that
require costly storage solutions.
 2. Data Security Threats – Cyberattacks and data breaches pose risks to sensitive
information.
 3. Interoperability Issues – Integrating data from different systems with varying formats
and standards can be complex.
 4. Metadata Inconsistency – Lack of proper documentation can make data difficult to find
and interpret.
 5. Long-Term Accessibility – Technologies become outdated, making some data formats
unreadable over time.
 Example: A government agency using an outdated database system struggles to retrieve historical
census data due to software incompatibility. 24
THE WORLD OF DATA AROUND US
1,000,000
Transient
900,000 information
or unfilled
800,000 demand for
storage
700,000
Petabytes Worldwide

600,000
Information
500,000

400,000

300,000

200,000 Available Storage

100,000

0
2005 2006 2007 2008 2009 2010
THE WORLD OF DATA AROUND US: DATA LOSS
Data Management vs. Data Curation:
Aspect Data Management Data Curation
Focus Handling and organizing data Enhancing data for future use
Processes Collection, storage, security Selection, annotation, archiving
Goal Efficient data handling Long-term usability & accessibility
Example A company manages sales records in A digital archive curates’ old manuscripts
a database for research
➢ Data management and curation are critical for ensuring that data remains high-quality, well-
organized, accessible and reusable over time.
➢ While data management focuses on the technical aspects of handling data, data curation
emphasizes its long-term usability and value.
➢ Together, these practices play a vital role in research, business, healthcare, and many other
fields.
➢ Example: By adopting a structured approach to data storage, security, cleaning,
documentation, and long-term preservation, organizations can maximize the value of their27data
and support innovation in various fields.
DATA MANAGEMENT FOR AI AND MACHINE LEARNING (ML)

 Data management needs AI and machine learning, and just as important,


AI/ML needs data management. As of now, the two are connected, with the
path to successful AI intrinsically linked to modern data management
practices.
 Many business processes rely on AI, which is the science of training
systems to emulate human tasks through learning and automation.
 For example, With AI and ML, it’s more important than ever to have well-
managed data that you understand and trust – because if bad data
feeds algorithms that adapt based on what they learn, mistakes can
multiply quickly.
28
Article Review
Instructions:

1. Pick a topic of interest to you and come up with an interesting question that could potentially
be answered with existing data.

2. Search the literature for five peer-reviewed research papers that use data that would be
potentially helpful for answering your question.
3. Be prepared to discuss with classmates
(a) difficulties in accessing data from the literature to address your question and
(b) ways that data accessibility could be improved.
THANK YOU

30

You might also like