0% found this document useful (0 votes)
12 views19 pages

Chapter II Data Collection and Management

Chapter II discusses the importance of data collection and management in research and business, outlining the steps involved in the data collection process and the significance of data management for informed decision-making. It highlights various data collection methods, sources, and the role of APIs in facilitating data gathering. Additionally, the chapter addresses data storage management, its advantages, challenges, and strategies for effective implementation in organizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Chapter II Data Collection and Management

Chapter II discusses the importance of data collection and management in research and business, outlining the steps involved in the data collection process and the significance of data management for informed decision-making. It highlights various data collection methods, sources, and the role of APIs in facilitating data gathering. Additionally, the chapter addresses data storage management, its advantages, challenges, and strategies for effective implementation in organizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

IT INST3 – Data Science Analytics Chapter II

Subject Teacher: Edward B. Panganiban, Ph.D.

Republic of the Philippines


Isabela State University
Echague, Isabela

College of Computing Studies, Information and Communication Technology

Chapter II - Data Collection and Management

Data collection and management is the process of gathering and organizing data in a way
that it can be used to answer questions, make decisions, or solve problems. It is an
essential part of many research projects, business operations, and government initiatives.

The data collection process typically involves the following steps:

1. Define the research question or problem. What do you want to learn from the data?
2. Identify the data sources. Where will you get the data?
3. Choose the data collection methods. How will you collect the data?
4. Collect the data. This may involve conducting surveys, interviews, or experiments.
5. Clean and prepare the data. This involves removing errors and inconsistencies
from the data.
6. Store and manage the data. This involves organizing the data in a way that it can
be easily accessed and analyzed.
7. Analyze the data. This involves using statistical methods to draw insights from the
data.
8. Communicate the results. This involves sharing the findings of the data analysis
with others.

Data management is the ongoing process of organizing, storing, and protecting data. It is
important to ensure that the data is accessible to authorized users, and that it is protected
from unauthorized access, use, or disclosure.

Data collection and management is a complex and challenging process, but it is essential
for organizations to make informed decisions and achieve their goals.

Here are some of the benefits of data collection and management:

Improved decision-making: Data can help organizations make better decisions by


providing insights into their operations, customers, and markets.
Increased efficiency: Data can help organizations identify areas where they can improve
efficiency, such as by reducing costs or increasing productivity.
Enhanced innovation: Data can help organizations identify new opportunities for
innovation by providing insights into customer needs and trends.
Improved compliance: Data can help organizations comply with regulations by providing
a record of their activities.
Enhanced security: Data can help organizations protect themselves from security
threats by providing a way to track and monitor access to sensitive data.

1
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

There are many different methods of data collection, each with its own advantages and
disadvantages. Some of the most common methods include:

Surveys: Surveys are a popular way to collect data from a large number of people. They
can be conducted online, by mail, or in person.
Interviews: Interviews are a more in-depth way to collect data from a smaller number of
people. They can be conducted face-to-face, over the phone, or by video chat.
Experiments: Experiments are used to test cause-and-effect relationships. They involve
manipulating one variable and observing the effects on another variable.
Observational studies: Observational studies are used to collect data without interfering
with the subjects being studied. They can be conducted in a natural setting or in a
laboratory.

The best data collection method for a particular project will depend on the research
question, the budget, and the time constraints.

Data management is an ongoing process that involves organizing, storing, and


protecting data. There are many different data management tools and techniques
available, and the best choice will depend on the specific needs of the organization.

Data collection and management is a critical part of many research projects, business
operations, and government initiatives. By following the steps outlined above,
organizations can ensure that they are collecting and managing data in a way that is
efficient, effective, and compliant.

Data Source

A data source is the location where data originates from. It can be internal or external to
an organization. Internal data sources include:

Transactional data: This is data about the day-to-day operations of an organization, such
as sales, orders, and inventory.
Customer data: This is data about customers, such as demographics, purchase history,
and contact information.
Employee data: This is data about employees, such as compensation, performance
reviews, and training history.
Financial data: This is data about the financial performance of an organization, such as
revenue, expenses, and profits.

External data sources include:

Government data: This is data collected by governments, such as census data and
economic data.
Industry data: This is data collected by industry associations, such as market research
data and pricing data.

2
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Academic data: This is data collected by universities and research institutions, such as
scientific data and medical data.
Social media data: This is data collected from social media platforms, such as user
posts, comments, and likes.

The choice of data source will depend on the specific needs of the organization. For
example, if an organization is trying to understand its customer behavior, it might use
customer data from its CRM system. If an organization is trying to forecast demand, it
might use economic data from a government agency.

When choosing a data source, it is important to consider the following factors:

Relevance: The data should be relevant to the research question or problem that the
organization is trying to solve.
Accuracy: The data should be accurate and reliable.
Timeliness: The data should be up-to-date.
Accessibility: The data should be easy to access and use.
Cost: The cost of obtaining the data should be reasonable.

Data sources can be classified into two main categories: primary and secondary.

Primary data is data that is collected for the first time. It is typically collected through
surveys, interviews, experiments, or observations.
Secondary data is data that has already been collected and is available for reuse. It can
be found in books, journals, government reports, and online databases.

Primary data is often more accurate and reliable than secondary data, but it can be more
expensive and time-consuming to collect. Secondary data is less expensive and time-
consuming to collect, but it may not be as accurate or reliable as primary data.

The best data source for a particular project will depend on the research question, the
budget, and the time constraints.

Data Collection and APIs

APIs, or application programming interfaces, are a way for software applications to


communicate with each other. They provide a set of rules and instructions that allow
applications to request and receive data from each other.

APIs can be used to collect data from a variety of sources, including:

Web services: Web services are websites that provide data through APIs. For example,
the Google Maps API allows you to get information about geographic locations, such as
their latitude and longitude.
Databases: APIs can be used to access data from databases. For example, the MySQL
API allows you to query MySQL databases.

3
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Sensor devices: APIs can be used to collect data from sensor devices, such as
temperature sensors or GPS sensors.
Social media platforms: APIs can be used to collect data from social media platforms,
such as Twitter or Facebook.

To collect data using an API, you will need to:

Find the API that you want to use. There are many APIs available, so you will need to
do some research to find one that meets your needs.
Get an API key. Most APIs require you to get an API key before you can use them. This
key is used to authenticate your requests and to prevent unauthorized access to the data.
Understand the API documentation. The API documentation will tell you how to use the
API to request and receive data.
Make API requests. Once you understand the API documentation, you can start making
API requests to collect data.

Data collection using APIs can be a quick and easy way to get the data that you need.
However, it is important to be aware of the limitations of APIs. For example, some APIs
may only allow you to access a limited amount of data, or they may charge a fee for
access.

Here are some of the benefits of using APIs to collect data:

Speed: APIs can be used to collect data quickly and easily.


Efficiency: APIs can automate the data collection process, which can save time and
resources.
Flexibility: APIs can be used to collect data from a variety of sources.
Scalability: APIs can be scaled to meet the needs of large data sets.

Here are some of the challenges of using APIs to collect data:

Cost: Some APIs may charge a fee for access.


Security: APIs can be a security risk if they are not properly secured.
Complexity: APIs can be complex to use, especially if you are not familiar with them.

Overall, APIs can be a valuable tool for data collection. However, it is important to weigh
the benefits and challenges before deciding whether or not to use them.

Exploring and Fixing Data

Data exploration is the process of understanding the data by summarizing its main
characteristics, identifying patterns, and outliers. It is an important first step in data
analysis, as it helps to ensure that the data is clean and ready for further analysis.

There are many different methods of data exploration, but some of the most common
include:

4
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Data profiling: This involves summarizing the main characteristics of the data, such as
the number of records, the number of variables, the data types, and the distribution of the
values.
Data visualization: This involves using charts and graphs to visualize the data, which
can help to identify patterns and outliers.
Statistical analysis: This involves using statistical tests to identify relationships between
variables.

Data fixing is the process of identifying and correcting errors in the data. It is an important
step in data preparation, as it ensures that the data is accurate and reliable.

There are many different methods of data fixing, but some of the most common
include:

Data cleaning: This involves removing errors from the data, such as typos, missing
values, and inconsistent values.
Data imputation: This involves filling in missing values with estimates.
Data transformation: This involves converting the data into a different format, such as
converting categorical data into numerical data.

Data exploration and data fixing are essential steps in data analysis. By understanding
the data and fixing any errors, you can ensure that your analysis is accurate and reliable.

Here are some of the things to look for when exploring and fixing data:

Missing values: Are there any missing values in the data? If so, how many? And are they
missing randomly or systematically?
Outliers: Are there any outliers in the data? Outliers are data points that are significantly
different from the rest of the data. They can be caused by errors or by legitimate variation
in the data.
Duplicate values: Are there any duplicate values in the data? Duplicate values can occur
when data is entered incorrectly or when two different records refer to the same entity.
Inconsistent values: Are there any inconsistent values in the data? Inconsistent values
are data points that have different values for the same variable. They can occur when
data is entered incorrectly or when two different records refer to the same entity.
Incorrect data types: Are there any data points that are stored in the wrong data type?
For example, a date value might be stored as a string.
Corrupt data: Is there any corrupt data in the file? Corrupt data is data that is damaged
or unreadable.
Once you have identified any problems with the data, you can take steps to fix them. For
example, you can remove missing values, impute missing values, or transform data types.

Data exploration and data fixing are important steps in data analysis. By taking the time
to explore and fix the data, you can ensure that your analysis is accurate and reliable.

5
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Data storage management: What is it and why is it important?


Effective data storage management is more important than ever, as security and
regulatory compliance have become even more challenging and complex over
time.

Enterprise data volumes continue to grow exponentially. So how can organizations


effectively store it all? That's where data storage management comes in.

Effective management is key to ensuring organizations use storage resources


effectively, and that they store data securely in compliance with company policies and
government regulations. IT administrators and managers must understand what
procedures and tools encompass data storage management to develop their own
strategy.

Organizations must keep in mind how storage management has changed in recent years.
The COVID-19 pandemic increased remote work, the use of cloud services and
cybersecurity concerns such as ransomware. Even before the pandemic, all those
elements saw major surges -- and after the pandemic, these elements will still be
prominent.

With this guide, explore what data storage management is, who needs it, advantages and
challenges, key storage management software features, security and compliance
concerns, implementation tips, and vendors and products.

What data storage management is, who needs it and how to implement it

Storage management ensures data is available to users when they need it.

Data storage management is typically part of the storage administrator's job.


Organizations without a dedicated storage administrator might use an IT generalist for
storage management.

The data retention policy is a key element of storage management and a good starting
point for implementation. This policy defines the data an organization retains for
operational or compliance needs. It describes why the organization must keep the data,
the retention period and the process of disposal. It helps an organization determine how
it can search and access data. The retention policy is especially important now as data
volumes continually increase, and it can help cut storage space and costs.

The task of data storage management also includes resource provisioning and
configuration, unstructured and structured data, and evaluating how needs might change
over time.

To help with implementation, a management tool that meets organizational needs can
ease the administrative burden that comes with large amounts of data. Features to look

6
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

for in a management tool include storage capacity planning, performance monitoring,


compression and deduplication.

Advantages and challenges of data storage management

Data storage management has both advantages and challenges. On the plus side, it
improves performance and protects against data loss. With effective management,
storage systems perform well across geographic areas, time and users. It also ensures
that data is safe from outside threats, human error and system failures. Proper backup
and disaster recovery are pieces of this data protection strategy.
An effective management strategy provides users with the right amount of storage
capacity. Organizations can scale storage space up and down as needed. The storage
strategy accommodates for constantly changing needs and applications.

Storage management also makes it easier on admins by centralizing administration so


they can oversee a variety of storage systems. These benefits lead to reduced costs as
well, as admins are able to better utilize storage resources.

Benefits of data storage management include more efficient operations and optimized
resource utilization.

Challenges of data storage management include persistent cyberthreats, data


management regulations and a distributed workforce. These challenges illustrate why it's
so important to implement a comprehensive plan: A storage management strategy should
ensure organizations protect their data against data breaches, ransomware and other
malware attacks; lack of compliance could lead to hefty fines; and remote workers must
know they'll have access to files and applications just as they would if in a traditional office
environment.

7
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Distributed and complex systems present a hurdle for data storage management. Not
only are workers spread out, but systems run both on premises and in the cloud. An on-
premises storage environment could include HDDs, SSDs and tapes. Organizations often
use multiple clouds. New technologies, such as AI, can benefit organizations but also
increase complexity.

Unstructured data -- which includes documents, emails, photos, videos and metadata --
has surged, and this also complicates storage management. Unstructured
data challenges include volume, new types and how to gain value. Although some
organizations might not want to spend the time to manage unstructured data, in the end
it saves money and storage space. Vendors such as Aparavi, Dell EMC, Pure Storage
and Spectra Logic offer tools for this type of management.

Object storage can provide high performance but also has challenges, including the
infrastructure's scale-out nature and potentially high latency, for example. Organizations
must address issues with metadata performance and cluster management.

Data storage management strategies


Storage management processes and practices vary, depending on the technology,
platform and type.

Here are some general methods and services for data storage management:
• storage resource management software
• consolidation of systems
• multiprotocol storage arrays
• storage tiers
• strategic SSD deployment
• hybrid cloud
• scale-out systems
• archive storage of infrequently accessed data
• elimination of inactive virtual machines
• deduplication
• disaster recovery as a service
• object storage

Organizations may consider incorporating standards-based storage management


interfaces as part of their management strategy. The Storage Management Initiative
Specification and the Intelligent Platform Management Interface are two veteran models,
while Redfish and Swordfish have emerged as newer options. Interfaces offer
management, monitoring and simplification.

As far as media type, it's tempting to go all-flash because of its performance. However,
to save money, try a hybrid drive option that incorporates high-capacity HDD and high-
speed SSD technology.

8
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Organizations also must choose among object, block and file storage. Block storage is
the default type for HDDs and SSDs, and it provides strong performance. File storage
places files in folders and offers simplicity. Object storage efficiently organizes
unstructured data at a comparatively low cost. NAS is another worthwhile option for
storing unstructured data because of its organizational capabilities and speed.

Understand how object, block and file storage compare.

Storage security
With threats both internal and external, storage security is as important as ever to a
management strategy. Storage security ensures protection and availability by enabling
data accessibility for authorized users and protecting against unauthorized access.

A storage security strategy should have tiers. Security risks are so varied, from
ransomware to insider threats, that organizations must protect their data storage in a
number of ways. Proper permissions, monitoring and encryption are key to cyberthreat
defense.

Offline storage -- for example, in tape backup -- that isn't connected to a network is a
strong way to keep data safe. If attackers can't reach the data, they can't harm it. While
it's not feasible to keep all data offline, this type of storage is an important aspect of a
strong storage security strategy.

9
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Another aspect is off-site storage, one form of which is cloud storage. Organizations
shouldn't assume that this keeps their data entirely safe. Users are responsible for their
data, and cloud storage is still online and thus open to some risk.

The surge in remote workers produced a new level of storage security complications,
including the following risks:
• less secure home office environments;
• use of personal devices for work;
• misuse of services and applications;
• less formal work habits;
• adjustments to working from home; and
• more opportunities for malicious insiders.

Endpoint security, encryption, access controls and user training help protect against these
new storage security issues.

Data storage compliance


Compliance with regulations has always been important, but the need has increased in
the last few years with laws such as the General Data Protection Regulation (GDPR) and
the California Consumer Privacy Act. These laws specifically address data and storage,
so it's incumbent on organizations to comprehend them and ensure compliance.

10
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

GDPR can spur enterprises into adopting practices that deliver long-term competitive
advantages.
Data storage management helps organizations understand where they have data, which
is a major piece of compliance. Compliance best practices include documentation,
automation, anonymization and use of governance tools.

Immutable data storage also helps achieve compliance. Immutability ensures retained
data -- for example, legal holds -- doesn't change. Vendors such as AWS, Dell EMC and
Wasabi provide immutable storage. However, organizations should still retain more than
one copy of this data, as immutability doesn't protect against physical threats, such as
natural disasters.

Data storage technology, vendors and products


Key features for overall data storage management providers include resource
provisioning, process automation, load balancing, capacity planning and management,
predictive analytics, performance monitoring, replication, compression, deduplication,
snapshotting and cloning.

Recent trends among vendors include services for cloud storage and the container
management platform Kubernetes. Top storage providers can support a range of different
platforms. And though Kubernetes is more specialized, it has gained traction: Vendors
such as Diamanti, NetApp and Pure Storage provide Kubernetes services.

11
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Some form of cloud management is essentially table stakes for storage vendors. A few
vendors, including Cohesity and Rubrik, have made cloud data management a hallmark
of their platforms. Many organizations use more than one cloud, so multi-cloud data
management is crucial. Managing data storage across multiple clouds is complex, but
vendors such as Ctera, Dell EMC, NetApp and Nutanix can help.

Cloud management components include automation and orchestration; security;


governance and compliance; performance monitoring; and cost management.

The future of data storage management


Data storage administrators must be ready for a consistently evolving field. Cloud storage
was trending up before the pandemic and has skyrocketed since -- and once
organizations go to the cloud, they typically stay there. As a result, admins must
understand the various forms of cloud storage management, including multi-cloud, hybrid
cloud, cloud-native data and cloud data protection.

12
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Hyper-convergence, composable infrastructure and computational storage are also


popular frameworks.

In addition, admins must be aware of other new and emerging technologies that help
storage management, from automation to machine learning.

Lesson Overview

1. Importance of Data Collection and Preparation


2. Basic Data Quality Assessment
3. Ethical Considerations in Data Collection
4. Importing Data from Various Sources
5. Data Cleaning and Preprocessing in Excel
6. Handling Missing Data and Outliers

Lecture Outline

Each topic is covered with a theoretical lecture followed by practical laboratory activities.
The activities focus on applying the concepts in Microsoft Excel, allowing students to
work with datasets related to their fields (Mathematics or Social Science).

1. Importance of Data Collection and Preparation

Theoretical Lecture

• Objective: Understand why data collection and preparation are crucial in data
science.
• Key Points:
o Foundation of Data Science: Data collection is the first step in data
analysis; the quality of collected data determines the reliability of results.
o Minimizing Errors: Proper preparation helps identify and correct errors
early in the analysis process.
o Efficiency: Well-organized data saves time during the analysis stage.
o Consistency: Consistent data allows for accurate comparisons across
different datasets or studies.

Laboratory Activities

• Task: Create a Data Collection Plan using Excel.


• Steps:
1. Define Data Requirements: Identify the type of data needed for analysis
based on the student's major (e.g., test scores for Mathematics or survey
responses for Social Science).

13
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

2. Design a Data Collection Template in Excel: Set up a spreadsheet with


columns for data sources, collection methods, data types, and validation
rules.
3. Document the Collection Process: Write a step-by-step plan for
collecting the data, including tools and procedures to be used.

2. Basic Data Quality Assessment

Theoretical Lecture

• Objective: Learn to assess data quality using dimensions such as


completeness, accuracy, and consistency.
• Key Points:
o Completeness: Ensures all necessary data is present.
o Accuracy: Data must be correct and free from errors.
o Consistency: Data should follow the same format and standards.
o Validity: Data should represent what it is supposed to measure.
o Uniqueness: Avoids duplicate records that can skew analysis results.

Laboratory Activities

• Task: Perform a Data Quality Assessment using Excel.


• Steps:
1. Import Data into Excel: Students import a dataset they have collected.
2. Check for Completeness:
▪ Use COUNTA and IF formulas to identify missing data.
3. Validate Data Accuracy:
▪ Use DATA VALIDATION to set rules for data entry (e.g., numerical
ranges, text length).
4. Ensure Consistency:
▪ Use FIND and REPLACE to standardize text entries (e.g., "Male"
and "Female" vs. "M" and "F").
5. Check for Duplicates:
▪ Use the Remove Duplicates feature to identify and remove
duplicates.

3. Ethical Considerations in Data Collection

Theoretical Lecture

• Objective: Understand ethical implications when collecting and handling data.


• Key Points:
o Informed Consent: Participants must be informed about the data
collection process.
o Data Privacy and Security: Protecting sensitive information is critical.

14
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

o Anonymity and Confidentiality: Ensure that participants' identities


remain confidential.
o Avoiding Bias: Collect data in a manner that avoids bias and
misrepresentation.

Laboratory Activities

• Task: Develop an Ethical Data Collection Plan.


• Steps:
1. Define the Purpose of the Data Collection: Document the objectives
and intended use of the data.
2. Create a Consent Form Template in Excel: Include sections for study
information, participant rights, and signature fields.
3. Design Data Storage and Access Control Measures: Plan how to
secure data in Excel (e.g., using password protection and hidden sheets).

4. Importing Data from Various Sources

Theoretical Lecture

• Objective: Learn how to import data from different formats into Excel.
• Key Points:
o Data Formats: Understanding common formats such as CSV, JSON, and
Excel workbooks.
o Data Import Techniques: Using Excel's Get Data feature for importing
data from various sources.
o Combining Data: Methods for combining multiple datasets into one using
Excel tools like Power Query.

Laboratory Activities

• Task: Import Data from Various Sources.


• Steps:
1. Import from CSV Files:
▪ Go to Data > Get Data > From Text/CSV and select a file.
2. Import from Web Pages:
▪ Use Get Data > From Web to import data directly from a URL.
3. Combine Multiple Datasets:
▪ Use Power Query to merge or append datasets from different files.

5. Data Cleaning and Preprocessing in Excel

Theoretical Lecture

• Objective: Learn essential data cleaning and preprocessing techniques.


• Key Points:

15
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

o Removing Duplicates: Ensures unique data points.


o Handling Missing Data: Using techniques like mean imputation or data
interpolation.
o Standardizing Data Formats: Ensuring consistency in data types (e.g.,
date formats).
o Data Validation: Setting up rules to prevent incorrect data entry.

Laboratory Activities

• Task: Clean and Preprocess Data in Excel.


• Steps:
1. Remove Duplicates:
▪ Select the data range and go to Data > Remove Duplicates.
2. Handle Missing Data:
▪ Use IF and ISBLANK functions to fill in missing values.
3. Standardize Data Formats:
▪ Use TEXT functions and Data Validation to ensure consistent
formatting.
4. Create Data Validation Rules:
▪ Go to Data > Data Validation to set up entry restrictions (e.g., only
allowing numbers between 0 and 100).

6. Handling Missing Data and Outliers

Theoretical Lecture

• Objective: Learn methods to handle missing data and outliers.


• Key Points:
o Handling Missing Data: Options include deletion, mean/mode
imputation, and interpolation.
o Identifying Outliers: Use visual tools like box plots and scatter plots.
o Dealing with Outliers: Deciding whether to remove, cap, or transform
outliers based on their impact on the analysis.

Laboratory Activities

• Task: Handle Missing Data and Outliers in Excel.


• Steps:
1. Identify Missing Data:
▪ Use CONDITIONAL FORMATTING to highlight missing values.
2. Impute Missing Data:
▪ Apply the AVERAGEIF function to replace missing values.
3. Identify Outliers:
▪ Use Excel’s BOX PLOT feature to visualize and detect outliers.
4. Transform or Remove Outliers:

16
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

▪ Use IF statements to cap outliers or remove them based on a


defined threshold.

Summary and Assessment

1. Review Key Concepts: Summarize the importance of data collection, quality


assessment, ethical considerations, importing data, cleaning, and handling
missing data and outliers.
2. Quiz: Conduct a short quiz using Excel to test understanding.
3. Assignment: Each student collects a dataset, performs a quality assessment,
cleans the data, handles missing data and outliers, and submits a report detailing
the steps and insights.

Lab Activity 1: Importing and Assessing Data Quality

Dataset: Sales Data (CSV)

Description: This dataset includes sales information with some inconsistencies and
missing values. It helps students practice importing data and performing basic data
quality assessments.

Example Data:

OrderID Date Product Quantity Price


1001 15/01/2024 Widget A 10 20.5
1002 16/01/2024 Widget B 25
1003 17/01/2024 Widget C 5 15
1004 18/01/2024 10
1005 19/01/2024 Widget A 8 21
1006 20/01/2024 Widget B 12 24
1007 Widget C 7 16

Lab Activity Instructions:

• Import the CSV file into Excel.


• Assess the quality of the data by calculating summary statistics (mean, median,
mode) for numerical columns.
• Use conditional formatting to highlight missing values and inconsistencies.

17
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Lab Activity 2: Data Cleaning and Preprocessing

Dataset: Customer Feedback (Excel)

Description: This dataset includes customer feedback with various issues such as
extra spaces, inconsistencies in text, and varying formats. It is used for data cleaning
and preprocessing.

Example Data:

CustomerID Date Feedback Rating

1 15/01/2024 Great service! 5

2 16/01/2024 Poor service 2

Average
3 17/01/2024 3
experience

4 18/01/2024 Excellent service 4

5 19/01/2024 Great service! 5

6 20/01/2024 Poor service

7 21/01/2024 Excellent service 4

Lab Activity Instructions:

• Open the Excel file and clean the data using functions like TRIM(), CLEAN(), and
SUBSTITUTE().
• Standardize the feedback text (e.g., remove extra spaces, correct
inconsistencies).
• Normalize the rating scale if necessary.

Lab Activity 3: Handling Missing Data and Outliers

Dataset: Employee Performance (CSV)

Description: This dataset includes employee performance metrics with some missing
values and potential outliers. Students will practice handling missing data and
identifying outliers.

18
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Example Data:

EmployeeID Age Department PerformanceScore


E001 29 Sales 85
E002 34 HR 78
E003 28 Sales
E004 45 IT 92
E005 31 HR 80
E006 29 Sales 115
E007 IT 88
E008 40 HR 72
E009 35 IT 89
E010 50 Sales 105

• Import the CSV file into Excel.


• Identify missing data and decide on appropriate methods for imputation (mean,
median, etc.).
• Use a boxplot to visualize potential outliers in the PerformanceScore column.
• Handle outliers based on the chosen method (e.g., transformation or removal).

19

You might also like