Chapter II Data Collection and Management
Chapter II Data Collection and Management
Data collection and management is the process of gathering and organizing data in a way
that it can be used to answer questions, make decisions, or solve problems. It is an
essential part of many research projects, business operations, and government initiatives.
1. Define the research question or problem. What do you want to learn from the data?
2. Identify the data sources. Where will you get the data?
3. Choose the data collection methods. How will you collect the data?
4. Collect the data. This may involve conducting surveys, interviews, or experiments.
5. Clean and prepare the data. This involves removing errors and inconsistencies
from the data.
6. Store and manage the data. This involves organizing the data in a way that it can
be easily accessed and analyzed.
7. Analyze the data. This involves using statistical methods to draw insights from the
data.
8. Communicate the results. This involves sharing the findings of the data analysis
with others.
Data management is the ongoing process of organizing, storing, and protecting data. It is
important to ensure that the data is accessible to authorized users, and that it is protected
from unauthorized access, use, or disclosure.
Data collection and management is a complex and challenging process, but it is essential
for organizations to make informed decisions and achieve their goals.
1
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
There are many different methods of data collection, each with its own advantages and
disadvantages. Some of the most common methods include:
Surveys: Surveys are a popular way to collect data from a large number of people. They
can be conducted online, by mail, or in person.
Interviews: Interviews are a more in-depth way to collect data from a smaller number of
people. They can be conducted face-to-face, over the phone, or by video chat.
Experiments: Experiments are used to test cause-and-effect relationships. They involve
manipulating one variable and observing the effects on another variable.
Observational studies: Observational studies are used to collect data without interfering
with the subjects being studied. They can be conducted in a natural setting or in a
laboratory.
The best data collection method for a particular project will depend on the research
question, the budget, and the time constraints.
Data collection and management is a critical part of many research projects, business
operations, and government initiatives. By following the steps outlined above,
organizations can ensure that they are collecting and managing data in a way that is
efficient, effective, and compliant.
Data Source
A data source is the location where data originates from. It can be internal or external to
an organization. Internal data sources include:
Transactional data: This is data about the day-to-day operations of an organization, such
as sales, orders, and inventory.
Customer data: This is data about customers, such as demographics, purchase history,
and contact information.
Employee data: This is data about employees, such as compensation, performance
reviews, and training history.
Financial data: This is data about the financial performance of an organization, such as
revenue, expenses, and profits.
Government data: This is data collected by governments, such as census data and
economic data.
Industry data: This is data collected by industry associations, such as market research
data and pricing data.
2
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Academic data: This is data collected by universities and research institutions, such as
scientific data and medical data.
Social media data: This is data collected from social media platforms, such as user
posts, comments, and likes.
The choice of data source will depend on the specific needs of the organization. For
example, if an organization is trying to understand its customer behavior, it might use
customer data from its CRM system. If an organization is trying to forecast demand, it
might use economic data from a government agency.
Relevance: The data should be relevant to the research question or problem that the
organization is trying to solve.
Accuracy: The data should be accurate and reliable.
Timeliness: The data should be up-to-date.
Accessibility: The data should be easy to access and use.
Cost: The cost of obtaining the data should be reasonable.
Data sources can be classified into two main categories: primary and secondary.
Primary data is data that is collected for the first time. It is typically collected through
surveys, interviews, experiments, or observations.
Secondary data is data that has already been collected and is available for reuse. It can
be found in books, journals, government reports, and online databases.
Primary data is often more accurate and reliable than secondary data, but it can be more
expensive and time-consuming to collect. Secondary data is less expensive and time-
consuming to collect, but it may not be as accurate or reliable as primary data.
The best data source for a particular project will depend on the research question, the
budget, and the time constraints.
Web services: Web services are websites that provide data through APIs. For example,
the Google Maps API allows you to get information about geographic locations, such as
their latitude and longitude.
Databases: APIs can be used to access data from databases. For example, the MySQL
API allows you to query MySQL databases.
3
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Sensor devices: APIs can be used to collect data from sensor devices, such as
temperature sensors or GPS sensors.
Social media platforms: APIs can be used to collect data from social media platforms,
such as Twitter or Facebook.
Find the API that you want to use. There are many APIs available, so you will need to
do some research to find one that meets your needs.
Get an API key. Most APIs require you to get an API key before you can use them. This
key is used to authenticate your requests and to prevent unauthorized access to the data.
Understand the API documentation. The API documentation will tell you how to use the
API to request and receive data.
Make API requests. Once you understand the API documentation, you can start making
API requests to collect data.
Data collection using APIs can be a quick and easy way to get the data that you need.
However, it is important to be aware of the limitations of APIs. For example, some APIs
may only allow you to access a limited amount of data, or they may charge a fee for
access.
Overall, APIs can be a valuable tool for data collection. However, it is important to weigh
the benefits and challenges before deciding whether or not to use them.
Data exploration is the process of understanding the data by summarizing its main
characteristics, identifying patterns, and outliers. It is an important first step in data
analysis, as it helps to ensure that the data is clean and ready for further analysis.
There are many different methods of data exploration, but some of the most common
include:
4
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Data profiling: This involves summarizing the main characteristics of the data, such as
the number of records, the number of variables, the data types, and the distribution of the
values.
Data visualization: This involves using charts and graphs to visualize the data, which
can help to identify patterns and outliers.
Statistical analysis: This involves using statistical tests to identify relationships between
variables.
Data fixing is the process of identifying and correcting errors in the data. It is an important
step in data preparation, as it ensures that the data is accurate and reliable.
There are many different methods of data fixing, but some of the most common
include:
Data cleaning: This involves removing errors from the data, such as typos, missing
values, and inconsistent values.
Data imputation: This involves filling in missing values with estimates.
Data transformation: This involves converting the data into a different format, such as
converting categorical data into numerical data.
Data exploration and data fixing are essential steps in data analysis. By understanding
the data and fixing any errors, you can ensure that your analysis is accurate and reliable.
Here are some of the things to look for when exploring and fixing data:
Missing values: Are there any missing values in the data? If so, how many? And are they
missing randomly or systematically?
Outliers: Are there any outliers in the data? Outliers are data points that are significantly
different from the rest of the data. They can be caused by errors or by legitimate variation
in the data.
Duplicate values: Are there any duplicate values in the data? Duplicate values can occur
when data is entered incorrectly or when two different records refer to the same entity.
Inconsistent values: Are there any inconsistent values in the data? Inconsistent values
are data points that have different values for the same variable. They can occur when
data is entered incorrectly or when two different records refer to the same entity.
Incorrect data types: Are there any data points that are stored in the wrong data type?
For example, a date value might be stored as a string.
Corrupt data: Is there any corrupt data in the file? Corrupt data is data that is damaged
or unreadable.
Once you have identified any problems with the data, you can take steps to fix them. For
example, you can remove missing values, impute missing values, or transform data types.
Data exploration and data fixing are important steps in data analysis. By taking the time
to explore and fix the data, you can ensure that your analysis is accurate and reliable.
5
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Organizations must keep in mind how storage management has changed in recent years.
The COVID-19 pandemic increased remote work, the use of cloud services and
cybersecurity concerns such as ransomware. Even before the pandemic, all those
elements saw major surges -- and after the pandemic, these elements will still be
prominent.
With this guide, explore what data storage management is, who needs it, advantages and
challenges, key storage management software features, security and compliance
concerns, implementation tips, and vendors and products.
What data storage management is, who needs it and how to implement it
Storage management ensures data is available to users when they need it.
The data retention policy is a key element of storage management and a good starting
point for implementation. This policy defines the data an organization retains for
operational or compliance needs. It describes why the organization must keep the data,
the retention period and the process of disposal. It helps an organization determine how
it can search and access data. The retention policy is especially important now as data
volumes continually increase, and it can help cut storage space and costs.
The task of data storage management also includes resource provisioning and
configuration, unstructured and structured data, and evaluating how needs might change
over time.
To help with implementation, a management tool that meets organizational needs can
ease the administrative burden that comes with large amounts of data. Features to look
6
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Data storage management has both advantages and challenges. On the plus side, it
improves performance and protects against data loss. With effective management,
storage systems perform well across geographic areas, time and users. It also ensures
that data is safe from outside threats, human error and system failures. Proper backup
and disaster recovery are pieces of this data protection strategy.
An effective management strategy provides users with the right amount of storage
capacity. Organizations can scale storage space up and down as needed. The storage
strategy accommodates for constantly changing needs and applications.
Benefits of data storage management include more efficient operations and optimized
resource utilization.
7
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Distributed and complex systems present a hurdle for data storage management. Not
only are workers spread out, but systems run both on premises and in the cloud. An on-
premises storage environment could include HDDs, SSDs and tapes. Organizations often
use multiple clouds. New technologies, such as AI, can benefit organizations but also
increase complexity.
Unstructured data -- which includes documents, emails, photos, videos and metadata --
has surged, and this also complicates storage management. Unstructured
data challenges include volume, new types and how to gain value. Although some
organizations might not want to spend the time to manage unstructured data, in the end
it saves money and storage space. Vendors such as Aparavi, Dell EMC, Pure Storage
and Spectra Logic offer tools for this type of management.
Object storage can provide high performance but also has challenges, including the
infrastructure's scale-out nature and potentially high latency, for example. Organizations
must address issues with metadata performance and cluster management.
Here are some general methods and services for data storage management:
• storage resource management software
• consolidation of systems
• multiprotocol storage arrays
• storage tiers
• strategic SSD deployment
• hybrid cloud
• scale-out systems
• archive storage of infrequently accessed data
• elimination of inactive virtual machines
• deduplication
• disaster recovery as a service
• object storage
As far as media type, it's tempting to go all-flash because of its performance. However,
to save money, try a hybrid drive option that incorporates high-capacity HDD and high-
speed SSD technology.
8
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Organizations also must choose among object, block and file storage. Block storage is
the default type for HDDs and SSDs, and it provides strong performance. File storage
places files in folders and offers simplicity. Object storage efficiently organizes
unstructured data at a comparatively low cost. NAS is another worthwhile option for
storing unstructured data because of its organizational capabilities and speed.
Storage security
With threats both internal and external, storage security is as important as ever to a
management strategy. Storage security ensures protection and availability by enabling
data accessibility for authorized users and protecting against unauthorized access.
A storage security strategy should have tiers. Security risks are so varied, from
ransomware to insider threats, that organizations must protect their data storage in a
number of ways. Proper permissions, monitoring and encryption are key to cyberthreat
defense.
Offline storage -- for example, in tape backup -- that isn't connected to a network is a
strong way to keep data safe. If attackers can't reach the data, they can't harm it. While
it's not feasible to keep all data offline, this type of storage is an important aspect of a
strong storage security strategy.
9
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Another aspect is off-site storage, one form of which is cloud storage. Organizations
shouldn't assume that this keeps their data entirely safe. Users are responsible for their
data, and cloud storage is still online and thus open to some risk.
The surge in remote workers produced a new level of storage security complications,
including the following risks:
• less secure home office environments;
• use of personal devices for work;
• misuse of services and applications;
• less formal work habits;
• adjustments to working from home; and
• more opportunities for malicious insiders.
Endpoint security, encryption, access controls and user training help protect against these
new storage security issues.
10
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
GDPR can spur enterprises into adopting practices that deliver long-term competitive
advantages.
Data storage management helps organizations understand where they have data, which
is a major piece of compliance. Compliance best practices include documentation,
automation, anonymization and use of governance tools.
Immutable data storage also helps achieve compliance. Immutability ensures retained
data -- for example, legal holds -- doesn't change. Vendors such as AWS, Dell EMC and
Wasabi provide immutable storage. However, organizations should still retain more than
one copy of this data, as immutability doesn't protect against physical threats, such as
natural disasters.
Recent trends among vendors include services for cloud storage and the container
management platform Kubernetes. Top storage providers can support a range of different
platforms. And though Kubernetes is more specialized, it has gained traction: Vendors
such as Diamanti, NetApp and Pure Storage provide Kubernetes services.
11
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Some form of cloud management is essentially table stakes for storage vendors. A few
vendors, including Cohesity and Rubrik, have made cloud data management a hallmark
of their platforms. Many organizations use more than one cloud, so multi-cloud data
management is crucial. Managing data storage across multiple clouds is complex, but
vendors such as Ctera, Dell EMC, NetApp and Nutanix can help.
12
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
In addition, admins must be aware of other new and emerging technologies that help
storage management, from automation to machine learning.
Lesson Overview
Lecture Outline
Each topic is covered with a theoretical lecture followed by practical laboratory activities.
The activities focus on applying the concepts in Microsoft Excel, allowing students to
work with datasets related to their fields (Mathematics or Social Science).
Theoretical Lecture
• Objective: Understand why data collection and preparation are crucial in data
science.
• Key Points:
o Foundation of Data Science: Data collection is the first step in data
analysis; the quality of collected data determines the reliability of results.
o Minimizing Errors: Proper preparation helps identify and correct errors
early in the analysis process.
o Efficiency: Well-organized data saves time during the analysis stage.
o Consistency: Consistent data allows for accurate comparisons across
different datasets or studies.
Laboratory Activities
13
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Theoretical Lecture
Laboratory Activities
Theoretical Lecture
14
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Laboratory Activities
Theoretical Lecture
• Objective: Learn how to import data from different formats into Excel.
• Key Points:
o Data Formats: Understanding common formats such as CSV, JSON, and
Excel workbooks.
o Data Import Techniques: Using Excel's Get Data feature for importing
data from various sources.
o Combining Data: Methods for combining multiple datasets into one using
Excel tools like Power Query.
Laboratory Activities
Theoretical Lecture
15
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Laboratory Activities
Theoretical Lecture
Laboratory Activities
16
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Description: This dataset includes sales information with some inconsistencies and
missing values. It helps students practice importing data and performing basic data
quality assessments.
Example Data:
17
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Description: This dataset includes customer feedback with various issues such as
extra spaces, inconsistencies in text, and varying formats. It is used for data cleaning
and preprocessing.
Example Data:
Average
3 17/01/2024 3
experience
• Open the Excel file and clean the data using functions like TRIM(), CLEAN(), and
SUBSTITUTE().
• Standardize the feedback text (e.g., remove extra spaces, correct
inconsistencies).
• Normalize the rating scale if necessary.
Description: This dataset includes employee performance metrics with some missing
values and potential outliers. Students will practice handling missing data and
identifying outliers.
18
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.
Example Data:
19