0% found this document useful (0 votes)
242 views

Intro To Data Management PDF

This document provides an overview of data management. It defines data management and describes its key components such as data policy, ownership, documentation, quality, standardization, and the data lifecycle. Effective data management ensures quality, compliance, efficiency and security of data. It occurs in phases from proposal to publication, including collection, quality assurance, description, submission, preservation, integration, analysis and insights. Data management is important for organizations conducting research, programs and policy work that rely on data-driven decisions and reporting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views

Intro To Data Management PDF

This document provides an overview of data management. It defines data management and describes its key components such as data policy, ownership, documentation, quality, standardization, and the data lifecycle. Effective data management ensures quality, compliance, efficiency and security of data. It occurs in phases from proposal to publication, including collection, quality assurance, description, submission, preservation, integration, analysis and insights. Data management is important for organizations conducting research, programs and policy work that rely on data-driven decisions and reporting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335062781

Introduction to Data Management

Presentation · August 2019

CITATIONS READS

0 23,496

1 author:

Victor Olajide
Teaching at the Right Level
9 PUBLICATIONS   11 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Real Sector Performance and Policy Analysis View project

Nigeria's Growth Performance and the Role of the ECOWAS View project

All content following this page was uploaded by Victor Olajide on 09 August 2019.

The user has requested enhancement of the downloaded file.


S.A.L.T ANALYTICS TRAINING/WORKSHOP

DATA MANAGEMENT AND DATA


ANALYSIS
(using XLSForms, ODK Suite, KoboToolbox,
STATA, Google Docs)
Data Quality Management
This course introduces Data Management. It commences with
an introduction of what Data Management is and what makes
Data Management necessary for job seekers needing to
supply manpower to NGOs, CSOs, MDAs and the private
sector involved in; Industry, Humanitarian and Behavioral
researches, programs, projects and policy advocacy. This
course concludes with a discussion of the overall syllabus.
Data Management. The What!

Data Management as defined by the Data


Management Association (DAMA) is the development,
execution and supervision of plans, policies,
programs and practices that control, protect, deliver
and enhance the value of data and information
assets.1
Data Management. The What!

A more expanded version of the definition advanced


by Intra-Governmental Group on Geographic
Information (IGGI) is that: „Data Management is a
group of activities relating to the planning,
development, implementation and administration of
systems for the acquisition, storage, security,
retrieval, dissemination, archiving and disposal of
data.‟2
Data Management. The What!
The above definition of Data Management makes it
composite of some basic activities which include:
Data Policy; Data Ownership and responsibilities for
ensuring legislative compliance; Data
Documentation and Metadata Compilation; Data
Quality, Standardization and Harmonization; Data
Life Cycle Control and Data Audit.3
Data Management. The What!

Data Policy
This function of Data Management refers to the broad
framework of guidelines that govern the entire
process of data management and this is formed by
high level officials in the organization in need of data
management.
Data Management. The What!

Data Policy
This function of Data Management refers to the broad
framework of guidelines that govern the entire
process of data management and this is formed by
high level officials in the organization in need of data
management.
Data Management. The What!

Data Ownership
This function of data management, as the names
suggests, refers to the clear identification of the
owner of the data. Ownership determines who has
rights over the management and subsequent use of
data.
Data Management. The What!

Data Documentation and metadata compilation


This concept borders on the provision of information
necessary for identifying data and this is more
specifically referred to as documentation. Metadata
(data describing another data) refers to summaries
of datasets which also play a key role in identifying a
dataset. e.g DateTime stamps.
Data Management. The What!

Data Quality, Standardization and Harmonization


Data quality is concerned with the datasets aligning
with the functions of completeness, accuracy
(precision), reliability (consistency), validity,
timeliness, integrity, confidentiality and ethics
bordering on handling data particularly those related
to human subjects.
Data Management. The What!

Data Quality, Standardization and Harmonization


Standardization relates to the dataset meeting
specified minimum levels of quality. Hence the
different dimensions of data quality are given
minimum requirements and standardization requires
that these minimum requirements are met.
Data Management. The What!

Data Quality, Standardization and Harmonization


Harmonization is concerned with the situation where
multiple datasets have to be integrated and this
process requires that the individual datasets attain
the minimum quality requirements in the specified
standards.
Data Management. The What!

Data Life Cycle Control


This is concerned with ensuring that the processes that
govern the evolution of datasets from the ideas that
necessitate the collection of new data, and not relying on
already existing data, are adhered to according to the
laid out data policy.
Data Management. The What!
Data Life Cycle Control (Cont‟d)
These processes include: the specification of the data, data
modeling, data processing, data maintenance and security,
audit, the use of data, the archiving of data and the probable
destruction of the data when it is no longer needed or
economical to retain, are adhered to according to the laid out
data policy.
Data Management. The What!

Data Audit
This function of Data Management is concerned with
ensuring that resources committed to the Data
Management throughout the life cycle of the data are
expended appropriately and in accordance with the Data
Policy, Data Management plans and procedures.
Benefits of Data Management

NGOs, CSOs, the private sector and even MDAs of


the government that make data-driven policies and
decisions derive benefits from Data Management.
These benefits make Data Management very
important for successful operations of these
organizations.
Benefits of Data Management (Cont‟d)

Data Management ensures the following benefits:


 Quality of data, which ensures that data availed for

research or decision making meets the criteria of:


completeness, validity, reliability, accuracy,
timeliness, confidentiality, integrity and ethics of
research.5
Benefits of Data Management (Cont‟d)

Compliance with requirements agreed to by the


user of the data which could be a donor agency,
research institute or an NGO requiring the
monitoring of the implementation of programs,
projects and/or policies.5
Benefits of Data Management (Cont‟d)

 Efficiency gains (measured in terms of duration


from data management planning to data usage,
monetary cost of getting data ready for use and the
quality of the data) recorded in the handling of data
from its collection till its preparation for use.5
Benefits of Data Management (Cont‟d)
Access and security of data, is also another benefit
guaranteed by Data Management and makes it highly
relevant for agencies that work with data on human
subjects or other sensitive subjects about which
utmost security is necessary and access must be
strictly aligned with the requirements of the data
user as well as the statutory obligations of the data
user.5
Data Management Life Cycle

Data Management occurs in phases which make up what


is referred to as the Data Management Life Cycle and
these phases in procedural order are5:
i. Proposal and Planning vi. Data Preservation
ii. Data Collection vii. Discover
iii. Quality Assurance viii. Data Integration
iv. Data Description ix. Data Analysis
v. Data Submission x. Publication of Insights
Data Management Life Cycle5
Data Management Life Cycle: Proposal and Planning

The first phase of the Data Management life cycle, is


the phase that concerns itself with the preparation of
proposal and planning for data collection. This phase
stems from the larger research life cycle after user
requirements and funding issues have been sorted
out. Upon concluding the proposal and data planning
phase, the phase of data collection commences.
Data Management Life Cycle: Data Collection

Data Collection can be done via a variety of ways depending


on the requirements of the data user. Depending on the nature
of the research the data required may be primary (field data),
secondary (from databases and data repositories) or tertiary
(from articles, journals etc). While sourcing data from
secondary and tertiary sources are relatively less technical as
the data has been processed, the collection of primary data
could pose unique challenges.
Data Management Life Cycle: Data Collection (Cont‟d)

Some of the ways to collect primary data include:


i. Paper based questionnaires;
ii. Data entry software such as: SPSS data entry
builder, EpiData and MS Excel based data entry program
operational on a PC and
iii. Software/mobile hardware that permits the
collection of primary data directly from source.
Data Management Life Cycle: Data Collection (cont‟d)

It is the third approach that is in popular use by NGOs,


Government MDAs and the private sector due to the
following benefits:
Lower incidence of data entry/validation errors

(transpositional errors, copying errors, coding errors,


routing errors and consistency errors)
Ease of use due to the portability of the hardware

(hand-held android devices) and the clarity of the


software.
Data Management Life Cycle: Data Collection (cont‟d)

Ease of use in highly volatile and insecure


environments.
Reduced cost of financing a data collection effort

spanning a geographical area.


Efficient and timely data collection by teams of data

collectors simulatanously.
Data Management Life Cycle: Data Collection (cont‟d)

Ease of collection of diverse forms of data


including: audio data, video data, pictorial data,
GPS data (longitude, latitude and altitude), text,
string, multiple choice questions.
Efficient for interviewer-administered

questionnaires.
Data Management Life Cycle: Data Collection (cont‟d)

Ease of performance of on-field procedures


such as: randomization of sampled subjects.
Ease of enforcement of validation criteria

Efficient monitoring of the location of field

staff using the GPS


Efficient mapping of geographical location of

the sampling frame.


Data Management Life Cycle: Data Collection (cont‟d)

Efficient exporting of collected data based on


the relational database format – which makes
quality assurance easier.
Auto-generation of metadata

Creation of unique user identification numbers

for each subject from or about which data is


collected and also for repeating data – as in
case of a roster in a single household.
Data Management Life Cycle: Data Collection (cont‟d)

Despite the benefits of the third approach to data


collection, there are some challenges associated
with it and these include:
 It requires technical expertise in the programming

of forms, setting validation criteria and providing for


quality assurance.
 It requires technical expertise in the transmission,

extraction (or exporting) and integration of data.


Data Management Life Cycle: Data Collection (cont‟d)

 It requires the training of interviewers on the use of


the software for data collection
 It is not appropriate for self-administered

questionnaires – as respondents may have to also be


trained in the handling of the software/hardware for
data collection.
Data Management Life Cycle: Data Collection (cont‟d)

Examples of software that enables data collection via mobile


devices include: SurveyCTO, KoboToolBox, Open Data Kit
(ODK) Suite etc. Some of these services are paid for such as
Ona and SurveyCTO while some others like KoboToolBox and
Open Data Kit (ODK) is open source and available for use free
of charge. Due to popular demand by NGOs and research
agencies, we shall focus on KoboToolBox and ODK in this
tutorial.
Data Management Life Cycle: Data Collection (cont‟d)

ODK, based on information from its documentation page


https://docs.opendatakit.org/, has the following components:
ODK Build, Collect, Aggregate, Briefcase and XLSform
(online/offline) converter.
ODK Build: allows you to build forms using a drag and drop
graphic user interface which is quite easy to use but has the
drawback of not being able to handle complex form authoring.
Data Management Life Cycle: Data Collection (cont‟d)

ODK Collect: is an app that runs on an android device, provides


an interface corresponding to the authored form and allows for
data collection with the authored form.
ODK Aggregate: is a server that provides storage of form
definitions, stores data collected with the aid of the ODK
Collect and provides some visualization of the collected data,
among other things.
Data Management Life Cycle: Data Collection (cont‟d)

ODK Briefcase: makes it possible for clients to interact with


the ODK Aggregate server and the ODK Collect app. It assists
with pulling and exporting structured data in .csv format from
a server (which could be ODK Aggregate) and directly from
the device containing the ODK Collect app. Also it assists
with pushing new form definitions to the server or to an
android device with the ODK Collect app.
Data Management Life Cycle: Data Collection (cont‟d)
XLSform Converter: plays the role of converting forms
authored in Microsoft Excel Spreadsheets or Google Sheets
based on the XLSform standard into equivalent Xforms in .xml
format which can be loaded into an android device and
displayed via the ODK Collect app. This converter comes in
both online and offline versions.
KoboTool Box like ODK Suite, also offers a suite of services
which include: an online form builder, KoboCollect and online
data storage.
Data Management Life Cycle: Quality Assurance

Upon concluding the data collection effort the next


phase in the Data Management Life Cycle is the Data
Quality Assurance phase. Data quality assurance
requires the following criteria4:
 Completeness  Confidentiality and Ethics
 Reliability (Consistency)  Timeliness
 Validity  Data Integrity
 Precision (Accuracy)
Data Management Life Cycle: Quality Assurance (Cont‟d)

Completeness: this criterion requires that all the data


exported at the end of the data collection reflects all that is
required during the design of the proposal and planning
phase. E.g if we expect ANC and delivery rates are to be
collected from 20 Primary Health Care Centres (PHCs) but we
have ANC rates for 19 PHCs and Delivery rates for 18 PHCs
then the completeness criterion has been violated.
Data Management Life Cycle: Quality Assurance (Cont‟d)

Reliability (Consistency): this criterion ensures that the data


collection process (and hence the standard operating protocol
for data collection crafted during the planning and proposal
phase) is adhered to strictly and consistently. E.g If we are to
collect data from 10 pregnant women (PW) each in 3 catchment
areas delimited by PHCs and after interviewing 20 PW in 2 PHCs
the same set of questions used previously are adjusted to
collect a slightly different data from the remaining 10 PW then
the reliability criterion is violated.
Data Management Life Cycle: Quality Assurance (Cont‟d)
Validity: this criterion stipulates that data collected actually
measures what was intended during the planning and
proposal phase. E.g If it was concluded that the HIV/AIDS
rate is to be measured in a catchment area served by a PHC,
then asking the question: “How many patients have been
diagnosed with HIV/AIDS this month?” would make the data
collected invalid as the number of HIV/AIDS patients is not a
rate! A valid question could be: “Out of every 100 patients
that visits this PHC in a month how many of them were
diagnosed with the HIV/AIDS?”
Data Management Life Cycle: Quality Assurance (Cont‟d)

Precision (Accuracy): this criterion stipulates that the


collected data must be accurate to the degree determined at
the planning and proposal phase. E.g If we are to take the
measurement of the weight of children under the age of 5
correct to the nearest 10th (say, 10.5kg) but the
measurements are taken to the nearest whole number (say,
11kg) then the precision criterion has been violated. This can
also happen for questions that do not necessarily require
measurements.
Data Management Life Cycle: Quality Assurance (Cont‟d)

Timeliness: this criterion requires that data be collected


with due consideration of the relevance of the data with
respect to the timing of the research. E.g A research
whose findings are to be put to use in May, 2019 would
require that its data collection phase be concluded well
before May 2019 but if for one reason or the other the
timing for the data collection effort exceeds May 2019
then the timeliness criterion has been violated.
Data Management Life Cycle: Quality Assurance (Cont‟d)

Data Integrity: this criterion maintains that the collected


data must be free of manipulation, based on standard
operating protocols and the pace of the data collection
effort must not be induced by monetary factors.
Data Management Life Cycle: Quality Assurance (Cont‟d)

Confidentiality and Ethics: Human subject research,


as well as other forms of research on sensitive
subjects, require the protection life, information,
privacy and wellbeing of the subject. Hence this
criterion requires that data collection be conducted
in a manner that it does not result in loss of any kind
to the subjects.
Data Management Life Cycle: Quality Assurance (Cont‟d)

In the case of the protection of the privacy of the


subject (confidentiality) data collection must collect
anonymous data, avoid the requesting for personally
identifying information and if unavoidable provide
the assurance of the security of information.
Data Management Life Cycle: Quality Assurance (Cont‟d)

The ethics of data collection also require that the


voluntary informed consent of the respondent be
obtained before data collection commences. The
violation of these makes the data collection effort
and indeed the research unethical and could attract
legal penalties.
Data Management Life Cycle: Data Description

The next phase in the Data Management Life Cycle is the


description of the data. At this stage the concern is with the
construction of appropriate documentation needed to make the
data more informative, meaningful and easier for further
analysis. At this stage variables are labeled, the values of
categorical variables are coded and/or labeled, documentation
of the inventory of data collected is provided, missing values
are coded appropriately (99 “I don‟t know”, 98 “No value
supplied”, 97 “Missing data”, etc)
Data Management Life Cycle: Data Submission

At the conclusion of the data quality assurance


checks, cleaning the data and providing the
necessary description of the data, the data becomes
ready for submission to a data repository and it is at
the juncture that access to the data can be effected.
Data Management Life Cycle: Data Preservation
At the planning and proposal phase of the Data
Management Life Cycle, issues bordering on how long
the data is to be stored for current and possibly future
uses, the integrity of the data over time and the format in
which the data is stored are deliberated and these fall
under the purview of Data Preservation.
Data Management Life Cycle: Discovery
Based on the requirements of the research or the data user, it
may be necessary to collect data that is not current available
and consider discovering data that is already available from
other research efforts, thereby avoiding reinventing the wheel
and saving resources for other uses.
Issues to note at the discovery phase of the Data
Management Life Cycle include: data accessibility and
existence of metadata providing sufficient information about
the discovered data.
Data Management Life Cycle: Data Integration
On concluding the discovery phase and determining the
different secondary datasets to be combined with the
primary data collected at the conclusion of the data
collection phase, the next phase would be the integration of
all datasets into a unified dataset. This process could
involve understanding methodological differences,
transforming data into common representations, re-coding
and the re-construction of meta data for new master
dataset.
Data Management Life Cycle: Data Integration (Cont‟d)

This phase requires and understanding of applying


data joins, exact merges, fuzzy merges based on
string and numeric keys and appends. In addition,
there may be the need for the Describe phase to be
re-visited with regards to the new master dataset
result from data integration.
Data Management Life Cycle: Data Analysis
At the Data Analysis phase, the master dataset is
subjected to statistical methods – inferential or exploratory
methods – with the aim of extracting insights which would
provide the basis for the formation of hypotheses, the
testing of hypotheses, the confirmation of assumptions
and the validation of theories – which spurred the research
in the first place. The insights generated from analysis
could be displayed in text, numbers, tables and graphics.
Data Management Life Cycle: Data Analysis (Cont‟d)

The use of graphics to display insights from


structured or unstructured data is broadly defined
as Data Visualization and it is an essential
component of data analysis.
Data Management Life Cycle: Publication
Though not considered necessary by some, the Publication
phase which is the concluding phase of the Data
Management Life Cycle is quite important for the
furtherance of research, evolution of data, granting
credence to researches based on the published data,
provides a basis for the reproduction of research findings
based on the data, provides basis for research in other
research fields and interests and provides a greater
impetus for further funding of researches by donor
organizations.
Syllabus for the Training/Workshop
The syllabus for this training/workshop is based on components of the
Data Management Life Cycle and it is detailed below:
 Data Collection:

 Overview of Form design, Data Collection, Data Retrieval and Exporting


Technologies
 Form Programming with Google forms

 Form Programming with KoboTool Box

 Form Programming with XLSform standard and XLSform-Xform converter

 Data Retrieval and Exporting using Dropbox, ODK Briefcase and Compression

App.
Syllabus for the Training/Workshop (Cont‟d)

 Data Quality Assurance


 Introduction to Data Quality Assurance
 Getting to know your data (data types)
 Understanding validation criteria
 Looking out for validation loopholes and documenting them
 Introduction to Stata12 for Data Quality Checking with Do files
 Practical application of 70 commands/functions including: append, assert, by/bys,
cd, clear, collapse, count, data types, datetime, describe, destring, drop, ds,
duplicates, egen, encode, erase, expand, export, foreach, forvalues, functions (38 of
them), generate, global, gsort, import, input, joinby, label, list, local, merge, missing
values, order, rename, reshape, save, sort, split, summarize/tabulate, tostring, use,
while.
Syllabus for the Training/Workshop (Cont‟d)

 Data Quality Assurance (Cont‟d)


 Quality Checks with Stata12 Do file
 Data Cleaning with Stata12 Do files
 Design of reporting system using spreadsheets (MS

Excel/Google sheets):
 Some valuable spreadsheet formulas

 Quality checks/cleaning reports

 Inventory reports
Syllabus for the Training/Workshop (Cont‟d)

 Data Summary and Meta Data Generation


 Labeling variable names in Stata 12
 Labeling the values of categorical variables (encoding, recoding)
 Labeling the dataset
 Generating indicator variables from categorical variables
 Generating the summary statistics of ratio scale variables
 Generating the frequency distributions of nominal scale variables
 Data Integration Techniques
 Exact merging datasets based on parent-child unique-user identification keys
 Appending a dataset to another
Syllabus for the Training/Workshop (Cont‟d)
 Data Integration Techniques (Cont‟d)
 Fuzzy merges of datasets (possibly by groups) based on numeric keys relations
Fuzzy merges of datasets (possibly by specified groups) based on similarities in a
string variable.
 Data Analysis
 Introduction to Statistics
 Descriptive Statistics
Mean, Median, Mode, Standard deviation, Variance, Skewness, Kurtosis,

Normality with the aid of Stata 12


2
 Exact sampling statistics: Student-t, Fisher‟s F and Chi tests with Stata 12
Syllabus for the Training/Workshop (Cont‟d)
 Data Analysis (Cont‟d)
 Categorical regressions using Stata 12
 Logit regression
 Multinomial Logit regressions

 Least square regressions with indicator variables as regressors

 Regressions with interactions between regressors

 Data Visualization
 Graphical analysis using MS Excel
 Graphical analysis using Stata 12
References

1. Information Management and Technology Assurance Section (AICPA) (2013) An Overview of Data
Management, accessed on 1st January 2019 at:
https://www.aicpa.org/InterestAreas/InformationTechnology/Resources/DataAnalytics/DownloadableDo
cuments/Overview_Data_Mgmt.pdf
2. 2. Intra-governmental Group on Geographic Information (IGGI) (2000) The principles of good data
management, 1st Edition, Department of the Environment, Transport and the Regions, UK
3. 3. Intra-governmental Group on Geographic Information (IGGI) (2005) The principles of good data
management, 2nd Edition, Department of the Environment, Transport and the Regions, UK
4. 4. PACT (2014) Field guide for data quality management, module 2, Monitoring, Evaluation, Results
and Learning Series Publications, Washington DC.
5. 5. Soler A. S., Ort M., Steckel J. & Nieschullze J. (2016) An introduction to data management,
accessed 1st January, 2019 at:
https://www.gfbio.org/documents/10184/22817/Reader_GFBio_BefMate_20160222/1ca43f24-2550-
44b3-a05e-e180c3e544c0
The End

Thank you

View publication stats

You might also like