0% found this document useful (0 votes)

12 views19 pages

Chapter II Data Collection and Management

Chapter II discusses the importance of data collection and management in research and business, outlining the steps involved in the data collection process and the significance of data management for informed decision-making. It highlights various data collection methods, sources, and the role of APIs in facilitating data gathering. Additionally, the chapter addresses data storage management, its advantages, challenges, and strategies for effective implementation in organizations.

Uploaded by

Jian Janell Viloria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views19 pages

Chapter II Data Collection and Management

Uploaded by

Jian Janell Viloria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

IT INST3 – Data Science Analytics Chapter II

Subject Teacher: Edward B. Panganiban, Ph.D.

Republic of the Philippines

Isabela State University
Echague, Isabela

College of Computing Studies, Information and Communication Technology

Chapter II - Data Collection and Management

Data collection and management is the process of gathering and organizing data in a way
that it can be used to answer questions, make decisions, or solve problems. It is an
essential part of many research projects, business operations, and government initiatives.

The data collection process typically involves the following steps:

1. Define the research question or problem. What do you want to learn from the data?
2. Identify the data sources. Where will you get the data?
3. Choose the data collection methods. How will you collect the data?
4. Collect the data. This may involve conducting surveys, interviews, or experiments.
5. Clean and prepare the data. This involves removing errors and inconsistencies
from the data.
6. Store and manage the data. This involves organizing the data in a way that it can
be easily accessed and analyzed.
7. Analyze the data. This involves using statistical methods to draw insights from the
data.
8. Communicate the results. This involves sharing the findings of the data analysis
with others.

Data management is the ongoing process of organizing, storing, and protecting data. It is
important to ensure that the data is accessible to authorized users, and that it is protected
from unauthorized access, use, or disclosure.

Data collection and management is a complex and challenging process, but it is essential
for organizations to make informed decisions and achieve their goals.

Here are some of the benefits of data collection and management:

Improved decision-making: Data can help organizations make better decisions by

providing insights into their operations, customers, and markets.
Increased efficiency: Data can help organizations identify areas where they can improve
efficiency, such as by reducing costs or increasing productivity.
Enhanced innovation: Data can help organizations identify new opportunities for
innovation by providing insights into customer needs and trends.
Improved compliance: Data can help organizations comply with regulations by providing
a record of their activities.
Enhanced security: Data can help organizations protect themselves from security
threats by providing a way to track and monitor access to sensitive data.

1
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

There are many different methods of data collection, each with its own advantages and
disadvantages. Some of the most common methods include:

Surveys: Surveys are a popular way to collect data from a large number of people. They
can be conducted online, by mail, or in person.
Interviews: Interviews are a more in-depth way to collect data from a smaller number of
people. They can be conducted face-to-face, over the phone, or by video chat.
Experiments: Experiments are used to test cause-and-effect relationships. They involve
manipulating one variable and observing the effects on another variable.
Observational studies: Observational studies are used to collect data without interfering
with the subjects being studied. They can be conducted in a natural setting or in a
laboratory.

The best data collection method for a particular project will depend on the research
question, the budget, and the time constraints.

Data management is an ongoing process that involves organizing, storing, and

protecting data. There are many different data management tools and techniques
available, and the best choice will depend on the specific needs of the organization.

Data collection and management is a critical part of many research projects, business
operations, and government initiatives. By following the steps outlined above,
organizations can ensure that they are collecting and managing data in a way that is
efficient, effective, and compliant.

Data Source

A data source is the location where data originates from. It can be internal or external to
an organization. Internal data sources include:

Transactional data: This is data about the day-to-day operations of an organization, such
as sales, orders, and inventory.
Customer data: This is data about customers, such as demographics, purchase history,
and contact information.
Employee data: This is data about employees, such as compensation, performance
reviews, and training history.
Financial data: This is data about the financial performance of an organization, such as
revenue, expenses, and profits.

External data sources include:

Government data: This is data collected by governments, such as census data and
economic data.
Industry data: This is data collected by industry associations, such as market research
data and pricing data.

2
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Academic data: This is data collected by universities and research institutions, such as
scientific data and medical data.
Social media data: This is data collected from social media platforms, such as user
posts, comments, and likes.

The choice of data source will depend on the specific needs of the organization. For
example, if an organization is trying to understand its customer behavior, it might use
customer data from its CRM system. If an organization is trying to forecast demand, it
might use economic data from a government agency.

When choosing a data source, it is important to consider the following factors:

Relevance: The data should be relevant to the research question or problem that the
organization is trying to solve.
Accuracy: The data should be accurate and reliable.
Timeliness: The data should be up-to-date.
Accessibility: The data should be easy to access and use.
Cost: The cost of obtaining the data should be reasonable.

Data sources can be classified into two main categories: primary and secondary.

Primary data is data that is collected for the first time. It is typically collected through
surveys, interviews, experiments, or observations.
Secondary data is data that has already been collected and is available for reuse. It can
be found in books, journals, government reports, and online databases.

Primary data is often more accurate and reliable than secondary data, but it can be more
expensive and time-consuming to collect. Secondary data is less expensive and time-
consuming to collect, but it may not be as accurate or reliable as primary data.

The best data source for a particular project will depend on the research question, the
budget, and the time constraints.

Data Collection and APIs

APIs, or application programming interfaces, are a way for software applications to

communicate with each other. They provide a set of rules and instructions that allow
applications to request and receive data from each other.

APIs can be used to collect data from a variety of sources, including:

Web services: Web services are websites that provide data through APIs. For example,
the Google Maps API allows you to get information about geographic locations, such as
their latitude and longitude.
Databases: APIs can be used to access data from databases. For example, the MySQL
API allows you to query MySQL databases.

3
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Sensor devices: APIs can be used to collect data from sensor devices, such as
temperature sensors or GPS sensors.
Social media platforms: APIs can be used to collect data from social media platforms,
such as Twitter or Facebook.

To collect data using an API, you will need to:

Find the API that you want to use. There are many APIs available, so you will need to
do some research to find one that meets your needs.
Get an API key. Most APIs require you to get an API key before you can use them. This
key is used to authenticate your requests and to prevent unauthorized access to the data.
Understand the API documentation. The API documentation will tell you how to use the
API to request and receive data.
Make API requests. Once you understand the API documentation, you can start making
API requests to collect data.

Data collection using APIs can be a quick and easy way to get the data that you need.
However, it is important to be aware of the limitations of APIs. For example, some APIs
may only allow you to access a limited amount of data, or they may charge a fee for
access.

Here are some of the benefits of using APIs to collect data:

Speed: APIs can be used to collect data quickly and easily.

Efficiency: APIs can automate the data collection process, which can save time and
resources.
Flexibility: APIs can be used to collect data from a variety of sources.
Scalability: APIs can be scaled to meet the needs of large data sets.

Here are some of the challenges of using APIs to collect data:

Cost: Some APIs may charge a fee for access.

Security: APIs can be a security risk if they are not properly secured.
Complexity: APIs can be complex to use, especially if you are not familiar with them.

Overall, APIs can be a valuable tool for data collection. However, it is important to weigh
the benefits and challenges before deciding whether or not to use them.

Exploring and Fixing Data

Data exploration is the process of understanding the data by summarizing its main
characteristics, identifying patterns, and outliers. It is an important first step in data
analysis, as it helps to ensure that the data is clean and ready for further analysis.

There are many different methods of data exploration, but some of the most common
include:

4
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Data profiling: This involves summarizing the main characteristics of the data, such as
the number of records, the number of variables, the data types, and the distribution of the
values.
Data visualization: This involves using charts and graphs to visualize the data, which
can help to identify patterns and outliers.
Statistical analysis: This involves using statistical tests to identify relationships between
variables.

Data fixing is the process of identifying and correcting errors in the data. It is an important
step in data preparation, as it ensures that the data is accurate and reliable.

There are many different methods of data fixing, but some of the most common
include:

Data cleaning: This involves removing errors from the data, such as typos, missing
values, and inconsistent values.
Data imputation: This involves filling in missing values with estimates.
Data transformation: This involves converting the data into a different format, such as
converting categorical data into numerical data.

Data exploration and data fixing are essential steps in data analysis. By understanding
the data and fixing any errors, you can ensure that your analysis is accurate and reliable.

Here are some of the things to look for when exploring and fixing data:

Missing values: Are there any missing values in the data? If so, how many? And are they
missing randomly or systematically?
Outliers: Are there any outliers in the data? Outliers are data points that are significantly
different from the rest of the data. They can be caused by errors or by legitimate variation
in the data.
Duplicate values: Are there any duplicate values in the data? Duplicate values can occur
when data is entered incorrectly or when two different records refer to the same entity.
Inconsistent values: Are there any inconsistent values in the data? Inconsistent values
are data points that have different values for the same variable. They can occur when
data is entered incorrectly or when two different records refer to the same entity.
Incorrect data types: Are there any data points that are stored in the wrong data type?
For example, a date value might be stored as a string.
Corrupt data: Is there any corrupt data in the file? Corrupt data is data that is damaged
or unreadable.
Once you have identified any problems with the data, you can take steps to fix them. For
example, you can remove missing values, impute missing values, or transform data types.

Data exploration and data fixing are important steps in data analysis. By taking the time
to explore and fix the data, you can ensure that your analysis is accurate and reliable.

5
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Data storage management: What is it and why is it important?

Effective data storage management is more important than ever, as security and
regulatory compliance have become even more challenging and complex over
time.

Enterprise data volumes continue to grow exponentially. So how can organizations

effectively store it all? That's where data storage management comes in.

Effective management is key to ensuring organizations use storage resources

effectively, and that they store data securely in compliance with company policies and
government regulations. IT administrators and managers must understand what
procedures and tools encompass data storage management to develop their own
strategy.

Organizations must keep in mind how storage management has changed in recent years.
The COVID-19 pandemic increased remote work, the use of cloud services and
cybersecurity concerns such as ransomware. Even before the pandemic, all those
elements saw major surges -- and after the pandemic, these elements will still be
prominent.

With this guide, explore what data storage management is, who needs it, advantages and
challenges, key storage management software features, security and compliance
concerns, implementation tips, and vendors and products.

What data storage management is, who needs it and how to implement it

Storage management ensures data is available to users when they need it.

Data storage management is typically part of the storage administrator's job.

Organizations without a dedicated storage administrator might use an IT generalist for
storage management.

The data retention policy is a key element of storage management and a good starting
point for implementation. This policy defines the data an organization retains for
operational or compliance needs. It describes why the organization must keep the data,
the retention period and the process of disposal. It helps an organization determine how
it can search and access data. The retention policy is especially important now as data
volumes continually increase, and it can help cut storage space and costs.

The task of data storage management also includes resource provisioning and
configuration, unstructured and structured data, and evaluating how needs might change
over time.

To help with implementation, a management tool that meets organizational needs can
ease the administrative burden that comes with large amounts of data. Features to look

6
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

for in a management tool include storage capacity planning, performance monitoring,

compression and deduplication.

Advantages and challenges of data storage management

Data storage management has both advantages and challenges. On the plus side, it
improves performance and protects against data loss. With effective management,
storage systems perform well across geographic areas, time and users. It also ensures
that data is safe from outside threats, human error and system failures. Proper backup
and disaster recovery are pieces of this data protection strategy.
An effective management strategy provides users with the right amount of storage
capacity. Organizations can scale storage space up and down as needed. The storage
strategy accommodates for constantly changing needs and applications.

Storage management also makes it easier on admins by centralizing administration so

they can oversee a variety of storage systems. These benefits lead to reduced costs as
well, as admins are able to better utilize storage resources.

Benefits of data storage management include more efficient operations and optimized
resource utilization.

Challenges of data storage management include persistent cyberthreats, data

management regulations and a distributed workforce. These challenges illustrate why it's
so important to implement a comprehensive plan: A storage management strategy should
ensure organizations protect their data against data breaches, ransomware and other
malware attacks; lack of compliance could lead to hefty fines; and remote workers must
know they'll have access to files and applications just as they would if in a traditional office
environment.

7
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Distributed and complex systems present a hurdle for data storage management. Not
only are workers spread out, but systems run both on premises and in the cloud. An on-
premises storage environment could include HDDs, SSDs and tapes. Organizations often
use multiple clouds. New technologies, such as AI, can benefit organizations but also
increase complexity.

Unstructured data -- which includes documents, emails, photos, videos and metadata --
has surged, and this also complicates storage management. Unstructured
data challenges include volume, new types and how to gain value. Although some
organizations might not want to spend the time to manage unstructured data, in the end
it saves money and storage space. Vendors such as Aparavi, Dell EMC, Pure Storage
and Spectra Logic offer tools for this type of management.

Object storage can provide high performance but also has challenges, including the
infrastructure's scale-out nature and potentially high latency, for example. Organizations
must address issues with metadata performance and cluster management.

Data storage management strategies

Storage management processes and practices vary, depending on the technology,
platform and type.

Here are some general methods and services for data storage management:
• storage resource management software
• consolidation of systems
• multiprotocol storage arrays
• storage tiers
• strategic SSD deployment
• hybrid cloud
• scale-out systems
• archive storage of infrequently accessed data
• elimination of inactive virtual machines
• deduplication
• disaster recovery as a service
• object storage

Organizations may consider incorporating standards-based storage management

interfaces as part of their management strategy. The Storage Management Initiative
Specification and the Intelligent Platform Management Interface are two veteran models,
while Redfish and Swordfish have emerged as newer options. Interfaces offer
management, monitoring and simplification.

As far as media type, it's tempting to go all-flash because of its performance. However,
to save money, try a hybrid drive option that incorporates high-capacity HDD and high-
speed SSD technology.

8
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Organizations also must choose among object, block and file storage. Block storage is
the default type for HDDs and SSDs, and it provides strong performance. File storage
places files in folders and offers simplicity. Object storage efficiently organizes
unstructured data at a comparatively low cost. NAS is another worthwhile option for
storing unstructured data because of its organizational capabilities and speed.

Understand how object, block and file storage compare.

Storage security
With threats both internal and external, storage security is as important as ever to a
management strategy. Storage security ensures protection and availability by enabling
data accessibility for authorized users and protecting against unauthorized access.

A storage security strategy should have tiers. Security risks are so varied, from
ransomware to insider threats, that organizations must protect their data storage in a
number of ways. Proper permissions, monitoring and encryption are key to cyberthreat
defense.

Offline storage -- for example, in tape backup -- that isn't connected to a network is a
strong way to keep data safe. If attackers can't reach the data, they can't harm it. While
it's not feasible to keep all data offline, this type of storage is an important aspect of a
strong storage security strategy.

9
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Another aspect is off-site storage, one form of which is cloud storage. Organizations
shouldn't assume that this keeps their data entirely safe. Users are responsible for their
data, and cloud storage is still online and thus open to some risk.

The surge in remote workers produced a new level of storage security complications,
including the following risks:
• less secure home office environments;
• use of personal devices for work;
• misuse of services and applications;
• less formal work habits;
• adjustments to working from home; and
• more opportunities for malicious insiders.

Endpoint security, encryption, access controls and user training help protect against these
new storage security issues.

Data storage compliance

Compliance with regulations has always been important, but the need has increased in
the last few years with laws such as the General Data Protection Regulation (GDPR) and
the California Consumer Privacy Act. These laws specifically address data and storage,
so it's incumbent on organizations to comprehend them and ensure compliance.

10
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

GDPR can spur enterprises into adopting practices that deliver long-term competitive
advantages.
Data storage management helps organizations understand where they have data, which
is a major piece of compliance. Compliance best practices include documentation,
automation, anonymization and use of governance tools.

Immutable data storage also helps achieve compliance. Immutability ensures retained
data -- for example, legal holds -- doesn't change. Vendors such as AWS, Dell EMC and
Wasabi provide immutable storage. However, organizations should still retain more than
one copy of this data, as immutability doesn't protect against physical threats, such as
natural disasters.

Data storage technology, vendors and products

Key features for overall data storage management providers include resource
provisioning, process automation, load balancing, capacity planning and management,
predictive analytics, performance monitoring, replication, compression, deduplication,
snapshotting and cloning.

Recent trends among vendors include services for cloud storage and the container
management platform Kubernetes. Top storage providers can support a range of different
platforms. And though Kubernetes is more specialized, it has gained traction: Vendors
such as Diamanti, NetApp and Pure Storage provide Kubernetes services.

11
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Some form of cloud management is essentially table stakes for storage vendors. A few
vendors, including Cohesity and Rubrik, have made cloud data management a hallmark
of their platforms. Many organizations use more than one cloud, so multi-cloud data
management is crucial. Managing data storage across multiple clouds is complex, but
vendors such as Ctera, Dell EMC, NetApp and Nutanix can help.

Cloud management components include automation and orchestration; security;

governance and compliance; performance monitoring; and cost management.

The future of data storage management

Data storage administrators must be ready for a consistently evolving field. Cloud storage
was trending up before the pandemic and has skyrocketed since -- and once
organizations go to the cloud, they typically stay there. As a result, admins must
understand the various forms of cloud storage management, including multi-cloud, hybrid
cloud, cloud-native data and cloud data protection.

12
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Hyper-convergence, composable infrastructure and computational storage are also

popular frameworks.

In addition, admins must be aware of other new and emerging technologies that help
storage management, from automation to machine learning.

Lesson Overview

1. Importance of Data Collection and Preparation

2. Basic Data Quality Assessment
3. Ethical Considerations in Data Collection
4. Importing Data from Various Sources
5. Data Cleaning and Preprocessing in Excel
6. Handling Missing Data and Outliers

Lecture Outline

Each topic is covered with a theoretical lecture followed by practical laboratory activities.
The activities focus on applying the concepts in Microsoft Excel, allowing students to
work with datasets related to their fields (Mathematics or Social Science).

1. Importance of Data Collection and Preparation

Theoretical Lecture

• Objective: Understand why data collection and preparation are crucial in data
science.
• Key Points:
o Foundation of Data Science: Data collection is the first step in data
analysis; the quality of collected data determines the reliability of results.
o Minimizing Errors: Proper preparation helps identify and correct errors
early in the analysis process.
o Efficiency: Well-organized data saves time during the analysis stage.
o Consistency: Consistent data allows for accurate comparisons across
different datasets or studies.

Laboratory Activities

• Task: Create a Data Collection Plan using Excel.

• Steps:
1. Define Data Requirements: Identify the type of data needed for analysis
based on the student's major (e.g., test scores for Mathematics or survey
responses for Social Science).

13
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

2. Design a Data Collection Template in Excel: Set up a spreadsheet with

columns for data sources, collection methods, data types, and validation
rules.
3. Document the Collection Process: Write a step-by-step plan for
collecting the data, including tools and procedures to be used.

2. Basic Data Quality Assessment

Theoretical Lecture

• Objective: Learn to assess data quality using dimensions such as

completeness, accuracy, and consistency.
• Key Points:
o Completeness: Ensures all necessary data is present.
o Accuracy: Data must be correct and free from errors.
o Consistency: Data should follow the same format and standards.
o Validity: Data should represent what it is supposed to measure.
o Uniqueness: Avoids duplicate records that can skew analysis results.

Laboratory Activities

• Task: Perform a Data Quality Assessment using Excel.

• Steps:
1. Import Data into Excel: Students import a dataset they have collected.
2. Check for Completeness:
▪ Use COUNTA and IF formulas to identify missing data.
3. Validate Data Accuracy:
▪ Use DATA VALIDATION to set rules for data entry (e.g., numerical
ranges, text length).
4. Ensure Consistency:
▪ Use FIND and REPLACE to standardize text entries (e.g., "Male"
and "Female" vs. "M" and "F").
5. Check for Duplicates:
▪ Use the Remove Duplicates feature to identify and remove
duplicates.

3. Ethical Considerations in Data Collection

Theoretical Lecture

• Objective: Understand ethical implications when collecting and handling data.

• Key Points:
o Informed Consent: Participants must be informed about the data
collection process.
o Data Privacy and Security: Protecting sensitive information is critical.

14
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

o Anonymity and Confidentiality: Ensure that participants' identities

remain confidential.
o Avoiding Bias: Collect data in a manner that avoids bias and
misrepresentation.

Laboratory Activities

• Task: Develop an Ethical Data Collection Plan.

• Steps:
1. Define the Purpose of the Data Collection: Document the objectives
and intended use of the data.
2. Create a Consent Form Template in Excel: Include sections for study
information, participant rights, and signature fields.
3. Design Data Storage and Access Control Measures: Plan how to
secure data in Excel (e.g., using password protection and hidden sheets).

4. Importing Data from Various Sources

Theoretical Lecture

• Objective: Learn how to import data from different formats into Excel.
• Key Points:
o Data Formats: Understanding common formats such as CSV, JSON, and
Excel workbooks.
o Data Import Techniques: Using Excel's Get Data feature for importing
data from various sources.
o Combining Data: Methods for combining multiple datasets into one using
Excel tools like Power Query.

Laboratory Activities

• Task: Import Data from Various Sources.

• Steps:
1. Import from CSV Files:
▪ Go to Data > Get Data > From Text/CSV and select a file.
2. Import from Web Pages:
▪ Use Get Data > From Web to import data directly from a URL.
3. Combine Multiple Datasets:
▪ Use Power Query to merge or append datasets from different files.

5. Data Cleaning and Preprocessing in Excel

Theoretical Lecture

• Objective: Learn essential data cleaning and preprocessing techniques.

• Key Points:

15
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

o Removing Duplicates: Ensures unique data points.

o Handling Missing Data: Using techniques like mean imputation or data
interpolation.
o Standardizing Data Formats: Ensuring consistency in data types (e.g.,
date formats).
o Data Validation: Setting up rules to prevent incorrect data entry.

Laboratory Activities

• Task: Clean and Preprocess Data in Excel.

• Steps:
1. Remove Duplicates:
▪ Select the data range and go to Data > Remove Duplicates.
2. Handle Missing Data:
▪ Use IF and ISBLANK functions to fill in missing values.
3. Standardize Data Formats:
▪ Use TEXT functions and Data Validation to ensure consistent
formatting.
4. Create Data Validation Rules:
▪ Go to Data > Data Validation to set up entry restrictions (e.g., only
allowing numbers between 0 and 100).

6. Handling Missing Data and Outliers

Theoretical Lecture

• Objective: Learn methods to handle missing data and outliers.

• Key Points:
o Handling Missing Data: Options include deletion, mean/mode
imputation, and interpolation.
o Identifying Outliers: Use visual tools like box plots and scatter plots.
o Dealing with Outliers: Deciding whether to remove, cap, or transform
outliers based on their impact on the analysis.

Laboratory Activities

• Task: Handle Missing Data and Outliers in Excel.

• Steps:
1. Identify Missing Data:
▪ Use CONDITIONAL FORMATTING to highlight missing values.
2. Impute Missing Data:
▪ Apply the AVERAGEIF function to replace missing values.
3. Identify Outliers:
▪ Use Excel’s BOX PLOT feature to visualize and detect outliers.
4. Transform or Remove Outliers:

16
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

▪ Use IF statements to cap outliers or remove them based on a

defined threshold.

Summary and Assessment

1. Review Key Concepts: Summarize the importance of data collection, quality

assessment, ethical considerations, importing data, cleaning, and handling
missing data and outliers.
2. Quiz: Conduct a short quiz using Excel to test understanding.
3. Assignment: Each student collects a dataset, performs a quality assessment,
cleans the data, handles missing data and outliers, and submits a report detailing
the steps and insights.

Lab Activity 1: Importing and Assessing Data Quality

Dataset: Sales Data (CSV)

Description: This dataset includes sales information with some inconsistencies and
missing values. It helps students practice importing data and performing basic data
quality assessments.

Example Data:

OrderID Date Product Quantity Price

1001 15/01/2024 Widget A 10 20.5
1002 16/01/2024 Widget B 25
1003 17/01/2024 Widget C 5 15
1004 18/01/2024 10
1005 19/01/2024 Widget A 8 21
1006 20/01/2024 Widget B 12 24
1007 Widget C 7 16

Lab Activity Instructions:

• Import the CSV file into Excel.

• Assess the quality of the data by calculating summary statistics (mean, median,
mode) for numerical columns.
• Use conditional formatting to highlight missing values and inconsistencies.

17
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Lab Activity 2: Data Cleaning and Preprocessing

Dataset: Customer Feedback (Excel)

Description: This dataset includes customer feedback with various issues such as
extra spaces, inconsistencies in text, and varying formats. It is used for data cleaning
and preprocessing.

Example Data:

CustomerID Date Feedback Rating

1 15/01/2024 Great service! 5

2 16/01/2024 Poor service 2

Average
3 17/01/2024 3
experience

4 18/01/2024 Excellent service 4

5 19/01/2024 Great service! 5

6 20/01/2024 Poor service

7 21/01/2024 Excellent service 4

Lab Activity Instructions:

• Open the Excel file and clean the data using functions like TRIM(), CLEAN(), and
SUBSTITUTE().
• Standardize the feedback text (e.g., remove extra spaces, correct
inconsistencies).
• Normalize the rating scale if necessary.

Lab Activity 3: Handling Missing Data and Outliers

Dataset: Employee Performance (CSV)

Description: This dataset includes employee performance metrics with some missing
values and potential outliers. Students will practice handling missing data and
identifying outliers.

18
IT INST3 – Data Science Analytics Chapter II
Subject Teacher: Edward B. Panganiban, Ph.D.

Example Data:

EmployeeID Age Department PerformanceScore

E001 29 Sales 85
E002 34 HR 78
E003 28 Sales
E004 45 IT 92
E005 31 HR 80
E006 29 Sales 115
E007 IT 88
E008 40 HR 72
E009 35 IT 89
E010 50 Sales 105

• Import the CSV file into Excel.

• Identify missing data and decide on appropriate methods for imputation (mean,
median, etc.).
• Use a boxplot to visualize potential outliers in the PerformanceScore column.
• Handle outliers based on the chosen method (e.g., transformation or removal).

Data Science
100% (2)
Data Science
68 pages
Data Literacy
No ratings yet
Data Literacy
11 pages
Data Collection
No ratings yet
Data Collection
64 pages
Secondary Data Collection Methods
No ratings yet
Secondary Data Collection Methods
15 pages
Minbooklist 136254
No ratings yet
Minbooklist 136254
156 pages
11.1.06 Warman Pump Shaft Seals
100% (2)
11.1.06 Warman Pump Shaft Seals
12 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Module 2: Data Collection and Sampling Design
100% (1)
Module 2: Data Collection and Sampling Design
8 pages
305.5 Mini Hydraulic Excavator Electrical System: Machine Harness Connector and Component Locations
100% (1)
305.5 Mini Hydraulic Excavator Electrical System: Machine Harness Connector and Component Locations
2 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Important Question of Introduction of Data Science
No ratings yet
Important Question of Introduction of Data Science
10 pages
TA2068N Toshiba
No ratings yet
TA2068N Toshiba
19 pages
Codal Life of Railway Assets
100% (1)
Codal Life of Railway Assets
13 pages
Data Collection in Our World
No ratings yet
Data Collection in Our World
17 pages
Data Analytics: UCSC0601
No ratings yet
Data Analytics: UCSC0601
64 pages
How To Prepare For Computer Science 2210 Exam: By: Engineer Fahad Ahmad Khan, MS Software Engineering, BE Telecom
No ratings yet
How To Prepare For Computer Science 2210 Exam: By: Engineer Fahad Ahmad Khan, MS Software Engineering, BE Telecom
41 pages
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
No ratings yet
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
23 pages
Unit-2 DS
No ratings yet
Unit-2 DS
10 pages
Identifying Relevant Data Sources and Collection Methods
No ratings yet
Identifying Relevant Data Sources and Collection Methods
8 pages
Data Collection Is The Process of Gathering and Measuring Information On Variables of
No ratings yet
Data Collection Is The Process of Gathering and Measuring Information On Variables of
16 pages
Data Collection
No ratings yet
Data Collection
15 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
Module 5 Lecture Note
No ratings yet
Module 5 Lecture Note
8 pages
TECH8000 Week 05
No ratings yet
TECH8000 Week 05
30 pages
Week 2.2 Data Collection Methods
No ratings yet
Week 2.2 Data Collection Methods
16 pages
DS Module2 L1 L11
No ratings yet
DS Module2 L1 L11
27 pages
Unit 3.3
No ratings yet
Unit 3.3
32 pages
Introduction To Data Science Module 2
No ratings yet
Introduction To Data Science Module 2
35 pages
Statistics Method of Data Collection
No ratings yet
Statistics Method of Data Collection
6 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
Unit 2 BI & Data Science
No ratings yet
Unit 2 BI & Data Science
35 pages
Session 3 Data Collection Analysis and Interpretation
No ratings yet
Session 3 Data Collection Analysis and Interpretation
31 pages
Data Science-Unit-2
No ratings yet
Data Science-Unit-2
33 pages
What Is Data Collection
No ratings yet
What Is Data Collection
13 pages
Data Collection
No ratings yet
Data Collection
6 pages
Unit 1 - PPT
No ratings yet
Unit 1 - PPT
67 pages
Data Collection
No ratings yet
Data Collection
13 pages
Identifying Data Sources
No ratings yet
Identifying Data Sources
4 pages
UDAS
No ratings yet
UDAS
3 pages
Lec 5
No ratings yet
Lec 5
1 page
Methods of Data Collection Lesson
No ratings yet
Methods of Data Collection Lesson
3 pages
All About Data Science
No ratings yet
All About Data Science
35 pages
Unit 2
No ratings yet
Unit 2
105 pages
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
No ratings yet
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
15 pages
Data Collection Methods
No ratings yet
Data Collection Methods
6 pages
BigDataAnalytics - Unit1
No ratings yet
BigDataAnalytics - Unit1
21 pages
Computer Project
No ratings yet
Computer Project
3 pages
RetiCam 3100 Mini
No ratings yet
RetiCam 3100 Mini
4 pages
Lecture Notes 6
No ratings yet
Lecture Notes 6
11 pages
Research Method Unit 3
No ratings yet
Research Method Unit 3
10 pages
Data Collection Methods and Primary Data Sources
No ratings yet
Data Collection Methods and Primary Data Sources
8 pages
Chapter 5 of TU BBA 8th Semester Marketing Research
No ratings yet
Chapter 5 of TU BBA 8th Semester Marketing Research
75 pages
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
No ratings yet
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
18 pages
Educational Material
No ratings yet
Educational Material
6 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Lecture One
No ratings yet
Lecture One
29 pages
What Is Data Collection
No ratings yet
What Is Data Collection
8 pages
EN DX140W-7 DX160W-7 Brochure D4600863 04-2023
No ratings yet
EN DX140W-7 DX160W-7 Brochure D4600863 04-2023
28 pages
Comprehensive Guide To Data Collection
No ratings yet
Comprehensive Guide To Data Collection
16 pages
Essay 2
No ratings yet
Essay 2
3 pages
Data Collection Lecture
No ratings yet
Data Collection Lecture
10 pages
PRENSA - MANUAL - DEPR603704000632103 - en - PT
100% (1)
PRENSA - MANUAL - DEPR603704000632103 - en - PT
517 pages
Dodge RE End Play Setup
No ratings yet
Dodge RE End Play Setup
8 pages
IM M2-Week 3-Organization & Presentation of Data-1
No ratings yet
IM M2-Week 3-Organization & Presentation of Data-1
16 pages
Data Processing
No ratings yet
Data Processing
14 pages
Technical Brief Stats Concepts 19c
No ratings yet
Technical Brief Stats Concepts 19c
27 pages
WinterTech Inventions Volume 1
No ratings yet
WinterTech Inventions Volume 1
14 pages
ERP Issues and Challenges: A Research Synthesis: Faisal Mahmood Abdul Zahid Khan
100% (1)
ERP Issues and Challenges: A Research Synthesis: Faisal Mahmood Abdul Zahid Khan
31 pages
Helipad Lighting System
No ratings yet
Helipad Lighting System
11 pages
INTEGRAIVE CHPT 2
No ratings yet
INTEGRAIVE CHPT 2
9 pages
Milestone04 Nada Zirari 30.05.2025
No ratings yet
Milestone04 Nada Zirari 30.05.2025
14 pages
Agile Suitability Tool Handout 18102020 093347am 2 PDF
No ratings yet
Agile Suitability Tool Handout 18102020 093347am 2 PDF
14 pages
Chapter 5 Quanti Techniques Decision Theory
No ratings yet
Chapter 5 Quanti Techniques Decision Theory
25 pages
HONOR 9N User Guide - (EMUI9.1 - 01, EN, Normal)
No ratings yet
HONOR 9N User Guide - (EMUI9.1 - 01, EN, Normal)
76 pages
Thinking Backwards: A Fundamental Strategy For Solving Mathematical Problems
No ratings yet
Thinking Backwards: A Fundamental Strategy For Solving Mathematical Problems
10 pages
Video Cassette Recorder NV-HV61 Series: Operating Instructions
No ratings yet
Video Cassette Recorder NV-HV61 Series: Operating Instructions
20 pages
Forrester Predictions2025 B2C CX
No ratings yet
Forrester Predictions2025 B2C CX
9 pages
Plan 2023-PhD-Training - Martin - Marañón
No ratings yet
Plan 2023-PhD-Training - Martin - Marañón
3 pages
DSP - Module 2
No ratings yet
DSP - Module 2
34 pages
Predictive Analytics
No ratings yet
Predictive Analytics
22 pages
SC5520GWV Data Sheet
No ratings yet
SC5520GWV Data Sheet
2 pages
SO - Pilot Hydraulic System
No ratings yet
SO - Pilot Hydraulic System
8 pages
Microsoft Windows (Versión 10.0.261
No ratings yet
Microsoft Windows (Versión 10.0.261
5 pages
Psychomotor Domain - Tass
No ratings yet
Psychomotor Domain - Tass
9 pages
BANK DISPUTES (How To Get Refunds On Transactions)
No ratings yet
BANK DISPUTES (How To Get Refunds On Transactions)
16 pages
Cryptocurrencies and Financial Management A Bibliometric Analysis
No ratings yet
Cryptocurrencies and Financial Management A Bibliometric Analysis
14 pages
ASSESSMENT
No ratings yet
ASSESSMENT
8 pages
ReCAT Doc1
No ratings yet
ReCAT Doc1
2 pages
Parts of Proposal 1
No ratings yet
Parts of Proposal 1
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Analytics and Big Data for Accountants
From Everand
Analytics and Big Data for Accountants
Jim Lindell
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
From Everand
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet