0% found this document useful (0 votes)

0 views28 pages

DSV-S7 Data Collection and Data Pre Processing Overview

This document provides an overview of data collection and preprocessing in the context of data science and visualization, emphasizing their importance in deriving meaningful insights from raw data. It outlines various data collection strategies, including defining objectives, assessing data quality, and ensuring data privacy, as well as detailing preprocessing steps like data cleaning, integration, transformation, and reduction. The document also addresses data security issues and ethical considerations in data handling.

Uploaded by

1730303sivakartheek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views28 pages

DSV-S7 Data Collection and Data Pre Processing Overview

Uploaded by

1730303sivakartheek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Department of AI&DS

COURSE NAME: DATA SCIENCE AND VISUALIZATION

COURSE CODE: 22AD3206A
Topic: Data Collection and Data Preprocessing Overview

Session - 07

1
AIM OF THE
SESSION
The primary aim of data analytics and visualization is to transform raw data into meaningful insights.

INSTRUCTIONAL OBJECTIVES

This Session is designed to discuss

1. Data Collection Strategies
2. Data Pre-Processing Overview

LEARNING OUTCOMES
At the end of this session, you should be able to:
1.Related to data analytics, such as data types, variables, and data structures.
2.Understand different types of data (e.g., structured, unstructured) and their
significance in
analytics.
3. Analyse data and draw meaningful insights.
4. Understand probability theory and its application in data analysis.
Data Collection Strategies

 Data collection is a fundamental step in the data analytics and visualization

process.

 The quality and relevance of the collected data significantly impact the
insights and decisions derived from the analysis.

 Effective data collection and visualization strategies are essential for extracting
valuable insights and empowering data-driven decision-making.

 It's a dynamic process that requires continuous refinement based on user

feedback and changing business needs.

3
Data collection Strategies in the Context of Data Analytics
and Visualization

1) Define Clear Objectives

2) Identify Relevant Data Sources
3) Data Quality Assessment
4) Consider Structured and Unstructured Data
5) Real-time Data Collection
6) Data Privacy and Ethics
7) Sampling Techniques
8) Surveys and Questionnaires
9) Collaboration with Stakeholders
10) Data Integration

4
Data collection Strategies in the Context of Data Analytics
and Visualization

1) Define Clear Objectives:

Clearly outline the goals and objectives of your data analytics and visualization
project. Understand the questions you want to answer and the insights you aim to
derive. Knowing what insights you aim to gain will guide your data collection efforts.

2) Identify Relevant Data Sources:

 Determine the sources of data that are relevant to your objectives.
 This can include databases, spread sheets, APIs, external datasets, or a combination of these.
 Determine the key performance indicators (KPIs) and metrics relevant to your analysis and
visualization goals.
 These metrics will drive the selection of data sources and variables.

5
Data collection Strategies in the Context of Data Analytics
and Visualization
3) Data Quality Assessment:
 Assess the quality of available data. Check for completeness, accuracy, consistency,
and relevance.
 Cleaning and pre-processing may be necessary to address any issues.

4) Consider Structured and Unstructured Data:

 Depending on your objectives, collect both structured data (e.g., databases) and
unstructured data (e.g., text, images) for a more comprehensive analysis.
5) Real-time Data Collection:
 If your analysis requires real-time insights, consider implementing systems for
collecting and processing data in real-time.
 This is especially important for dynamic datasets.

6
Data collection Strategies in the Context of Data Analytics
and Visualization
6) Data Privacy and Ethics:
 Ensure compliance with data privacy regulations.
 Obtain necessary permissions for data collection, especially when dealing with personal or
sensitive information.
7) Sampling Techniques:
 Use sampling methods if working with large datasets.
 This involves selecting a representative subset of data for analysis, which can save time and
resources.
8) Surveys and Questionnaires :
 Design and deploy surveys or questionnaires to gather specific information directly from users
or relevant stakeholders.
 Ensure that the questions align with your objectives.

7
Data collection Strategies in the Context of Data Analytics
and Visualization

9) Collaboration with Stakeholders:

 Collaborate with domain experts and stakeholders to gain insights into the context of
the data.
 Their input can help refine data collection strategies.
9) Data Integration:
 Integrate data from different sources to create a unified dataset.
 Ensure compatibility and consistency when combining data from various platforms.

8
Data Security Issues

Data security is a critical concern in the field of data analysis and visualization. As organizations
collect and analyze large volumes of data to gain insights and make informed decisions, they also
face significant challenges related to the security and privacy of this data.
1. Data Breaches
2. Data Privacy
3. Data Access Control
4. Data Encryption
5. Data Masking and Redaction
6. Data Integrity
7. Secure Data Sharing
8. Compliance with Regulations
9. Awareness and Training
10. Data Lifecycle Management

9
Data Security Issues

1) Data Breaches:
 One of the most significant concerns is the potential for data breaches.
 If unauthorized individuals gain access to sensitive data, it can lead to financial losses,
reputational damage, and legal consequences for organizations.
2) Data Privacy:
 Protecting the privacy of individuals is crucial, especially when dealing with personally
identifiable information (PII).
 Analyzing and visualizing data while preserving privacy is a complex task.
 Techniques such as anonymization and differential privacy are used to mitigate
these concerns.

10
Data Security Issues

3) Data Access Control:

 Organizations need to implement strict access controls to ensure that only authorized
personnel can access specific datasets.
 Role-based access control (RBAC) and other access management protocols help in
regulating who can view, edit, or analyze sensitive data.
4) Data Encryption:
 Data should be encrypted both in transit and at rest.
 Encryption ensures that even if data is intercepted or the storage media is
compromised, the data remains unreadable without the proper decryption keys.

11
Data Security Issues

5) Data Integrity:
 Ensuring the integrity of data is essential.
 Data should not be tampered with during analysis or visualization processes.
Implementing checksums and digital signatures can help detect unauthorized changes
to data.
6) Secure Data Sharing:
 Organizations often need to share data with external partners or third-party vendors.
 Secure data sharing mechanisms, such as secure FTP, secure APIs, or block chain
technology, can help in ensuring the safe transfer of data.

12
Data Security Issues

7) Data Masking and Reduction:

 In situations where sharing data is necessary, techniques like data masking and
redaction can be employed.
 This involves replacing, encrypting, or removing sensitive information to protect privacy
while still allowing analysis and visualization on a subset of the data.
8) Compliance with Regulations:
 Organizations must comply with data protection regulations like GDPR (General Data
Protection Regulation) in the European Union or HIPAA (Health Insurance Portability and
Accountability Act) in the United States.
 Non-compliance can result in hefty fines and legal consequences.

13
Data Security Issues

9) Awareness and Training:

 Human error is a common cause of data breaches. Regular training and awareness
programs for employees can help in preventing accidental disclosures and ensuring
that employees understand their roles and responsibilities in maintaining data security.

10) Data Lifecycle Management:

 Proper management of data throughout its lifecycle, including secure storage, archival,
and deletion when it's no longer needed, is crucial. Unused or out-dated data can
become a security risk if not managed appropriately.

14
Data Pre-Processing Overview in DAV

 Effective data collection is the foundation of meaningful data analytics

and visualization.

 Data pre-processing is a crucial step in the data analytics and visualization

process.

 It involves cleaning, transforming, and organizing raw data into a

format that can be effectively utilized for analysis and visualization

15
Data Pre-Processing Overview in DAV

16
Data Pre-Processing Overview in DAV

1) Data Cleaning:
• Data cleaning is the process of detecting corrupt data and inaccurate records
from a record set or database table.
• The main use of cleaning step is based on detecting incomplete, inaccurate,
inconsistent and irrelevant data and applying techniques to modify or delete
this useless data.

17
Data Pre-Processing Overview in DAV

2) Data Integration:
• Data Integration focuses on unification of data residing in different sources and
presenting a unified view of these data.
• Data with different representations are put together and any conflicts resulting from
it are resolved.
• This process becomes vital in a number of scientific and commercial applications.
With increasing volume and exponential growth of data, integrating it becomes even
more significant.

18
Data Pre-Processing Overview in DAV

3) Data Transformation:
• Data transformation plays a pivotal role in converting unprocessed data into
understandable form.
• It consists of data normalization, aggregation and generalization.
• Data normalization helps to arrange the columns and tables of a database such that
redundancy is minimum. This helps cut down on the processing time and
complexity.
• Data aggregation helps in creating a brief summary for faster overview.
• The process of data generalization is also known as rolling-up data. It helps in
generalizing data and creates successive layers of summary in evaluation database.
19
Data Pre-Processing Overview in DAV

4) Data Reduction:
• Data reduction is the process of transforming digital info into ordered and
simplified form.
• This data is generally derived through empirical and experimental means.
• It involves reducing large amounts of data into smaller and meaningful
fragments.

20
Data Pre-Processing Overview in DAV

5) Data Discretization:
• Data discretization is an important concept when you have a large amount of
numeric data, but only want to classify it based on nominal values.
• In this scenario, the continuous data is split into discrete forms and the values
of these discrete sets are said to be the nominal value. It is basically a process
of converting continuous data attributes into a finite set of intervals with
minimal loss of information.

21
Summary

Data collection and preprocessing are critical steps in the data analysis process,
laying the foundation for accurate and reliable insights derived from the data.
These steps involve identifying, acquiring, cleaning, and transforming data to
make it suitable for analysis and modelling. Properly executed data collection and
preprocessing enhance the quality and effectiveness of downstream analyses and
machine learning tasks.

22
SELF-ASSESSMENT QUESTIONS

1. Have I detected and dealt with outliers in the data?

2. Did I use appropriate methods such as visualization, statistical tests, or transformation
techniques?

1. Have I normalized the data to ensure consistent ranges

for variables?
2. Did I consider the impact of normalization on the
performance of different algorithms?

1. Have I split the dataset into training and testing sets to assess model generalization?
2. Did I consider stratified sampling, especially for imbalanced datasets?

1. Why is transactional data important for businesses, and how can it be used for analysis?
2. What measures would you take to ensure the security and privacy of transactional data?
Summary

a) Data Collection: The primary purpose of data collection is to gather raw data for analysis
and decision-making in data science and analytics.

b) Data Sources: Structured data, such as Excel spreadsheets, is organized into a predefined
format, making it suitable for analysis.

c) Data Cleaning: Techniques like replacing missing values with the median of the column
are used to handle missing data in datasets.

d) Feature Engineering: Feature engineering involves creating new features from existing
data to enhance model performance and analysis.

24
TERMINAL QUESTIONS

 What are the main objectives of data collection in the context of data science
and analytics?
 How do you differentiate between structured and unstructured data? Provide
examples of each.
 Explain the importance of handling missing values in a dataset during data
preprocessing.
 What is feature engineering, and why is it a crucial step in data
preprocessing?
 Describe the process of data integration and its significance in data analysis.
TERMINAL QUESTIONS

 Why are ethical considerations important in data collection and

preprocessing? Provide examples of ethical dilemmas in data handling.
 What are the common techniques used for data normalization in data
preprocessing? How does normalization improve data analysis?
 Discuss the role of data quality assurance in ensuring reliable and accurate
data for analysis.
REFERENCES FOR FURTHER LEARNING OF THE
SESSION

Reference Books:
1. Paulraj Ponniah, DATA MODELING FUNDAMENTALS A Practical Guide for IT Professionals.
Sites and Web links:
2. http://www.cs.toronto.edu/~sme/CSC340F/slides/11-objects.pdf
THANK YOU

Team – DAV

3374897-CLASS IX AI - PART B - unit-2-DATA LITERACY
No ratings yet
3374897-CLASS IX AI - PART B - unit-2-DATA LITERACY
32 pages
ANSWERS - End Sem Lab Data Visualization using Tableau
No ratings yet
ANSWERS - End Sem Lab Data Visualization using Tableau
5 pages
Data Visulization and Power Bi Lab Manual
No ratings yet
Data Visulization and Power Bi Lab Manual
42 pages
AAOtoSAO_D5S1_Overview of Data Analytics
No ratings yet
AAOtoSAO_D5S1_Overview of Data Analytics
59 pages
Finals IT APP REVIEWER
No ratings yet
Finals IT APP REVIEWER
48 pages
Google Data Analytics Professional Certificate COURSE 2
No ratings yet
Google Data Analytics Professional Certificate COURSE 2
24 pages
Business Data Analytics
No ratings yet
Business Data Analytics
28 pages
Big Data Analytics
No ratings yet
Big Data Analytics
22 pages
Ai Project Cycle
No ratings yet
Ai Project Cycle
10 pages
DATA4
No ratings yet
DATA4
259 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
DATA VISUALIZATION USING PYTHON
No ratings yet
DATA VISUALIZATION USING PYTHON
79 pages
It App - Finals Notes
No ratings yet
It App - Finals Notes
60 pages
Data Science
No ratings yet
Data Science
10 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Data Analysis Question and Answers
No ratings yet
Data Analysis Question and Answers
15 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Introduction to DataAnalysis
No ratings yet
Introduction to DataAnalysis
17 pages
Part II, Meet 4 - Ch 6 Dan 7 UNP
No ratings yet
Part II, Meet 4 - Ch 6 Dan 7 UNP
19 pages
Bi Tools - Comparative Study
No ratings yet
Bi Tools - Comparative Study
14 pages
CBDA Domain-II Source Data v0.1
No ratings yet
CBDA Domain-II Source Data v0.1
32 pages
Gr9 AI Data Literacy-Final (2)
No ratings yet
Gr9 AI Data Literacy-Final (2)
28 pages
Unit-2 (Data Litrecy)
No ratings yet
Unit-2 (Data Litrecy)
7 pages
AssignmentBigData
No ratings yet
AssignmentBigData
7 pages
Unit 1
No ratings yet
Unit 1
36 pages
Introduction-to-Data-Analytics
No ratings yet
Introduction-to-Data-Analytics
15 pages
Grade 9 DataLiteracy
No ratings yet
Grade 9 DataLiteracy
13 pages
Unit_1.pptx
No ratings yet
Unit_1.pptx
57 pages
CHAPTER 2 Data Literacy.docx
No ratings yet
CHAPTER 2 Data Literacy.docx
8 pages
Data Analysis for grade 5 elementary
No ratings yet
Data Analysis for grade 5 elementary
24 pages
978-3-031-45124-9
No ratings yet
978-3-031-45124-9
298 pages
Dcova Framework
No ratings yet
Dcova Framework
7 pages
Data analytics ppt 1 (1)
No ratings yet
Data analytics ppt 1 (1)
16 pages
DV
No ratings yet
DV
30 pages
Updated notes of APR_084732
No ratings yet
Updated notes of APR_084732
6 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Data Analytics
No ratings yet
Data Analytics
12 pages
Report On Summer Internship
No ratings yet
Report On Summer Internship
30 pages
Week 2 - Data Analytics Life Cycle
No ratings yet
Week 2 - Data Analytics Life Cycle
41 pages
Business Undestanding and Data Collection
No ratings yet
Business Undestanding and Data Collection
27 pages
Sol04 en
No ratings yet
Sol04 en
5 pages
Artificial Intelligence Approaches for Advanced Battery
No ratings yet
Artificial Intelligence Approaches for Advanced Battery
49 pages
AI Project Cycle
No ratings yet
AI Project Cycle
10 pages
Data Analysis and Visualization
No ratings yet
Data Analysis and Visualization
4 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Career Recommendation
No ratings yet
Career Recommendation
28 pages
Data analyses
No ratings yet
Data analyses
9 pages
Unit-4 (1)
No ratings yet
Unit-4 (1)
25 pages
004 - Discover Data Analysis - Tasks of A Data Analyst
No ratings yet
004 - Discover Data Analysis - Tasks of A Data Analyst
4 pages
3tasks of A Data Analyst
No ratings yet
3tasks of A Data Analyst
4 pages
Da End Sem
No ratings yet
Da End Sem
5 pages
The Following Papers Belong To: WSEAS NNA-FSFS-EC 2001, February 11-15, 2001, Puerto de La Cruz, Tenerife, Spain
No ratings yet
The Following Papers Belong To: WSEAS NNA-FSFS-EC 2001, February 11-15, 2001, Puerto de La Cruz, Tenerife, Spain
228 pages
BI 2025
No ratings yet
BI 2025
106 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Journal of Building
100% (1)
Journal of Building
13 pages
Chap7 BigData
No ratings yet
Chap7 BigData
35 pages
General Data Analyst Interview Questions
No ratings yet
General Data Analyst Interview Questions
7 pages
Ba
No ratings yet
Ba
4 pages
Advantages and Disadvantages of Data Analytics
No ratings yet
Advantages and Disadvantages of Data Analytics
6 pages
Crime Prediction Using Machine Learning Project[1] [Read-Only]
No ratings yet
Crime Prediction Using Machine Learning Project[1] [Read-Only]
14 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
26 pages
Data Analytics Interview Questions
No ratings yet
Data Analytics Interview Questions
3 pages
DA
No ratings yet
DA
1 page
Software Performance Engineering For Foundation Model-Powered Software (Fmware)
No ratings yet
Software Performance Engineering For Foundation Model-Powered Software (Fmware)
13 pages
NLP&Clinical Decision
No ratings yet
NLP&Clinical Decision
13 pages
INFS 222 Unit 1
No ratings yet
INFS 222 Unit 1
10 pages
Fi Pdflatex mk4 - Bezdeklarace
No ratings yet
Fi Pdflatex mk4 - Bezdeklarace
41 pages
Resume of Chitharanjan (2)
No ratings yet
Resume of Chitharanjan (2)
7 pages
Varsha Resume
No ratings yet
Varsha Resume
2 pages
Info 4 Coursework Help
100% (2)
Info 4 Coursework Help
4 pages
KAVS
No ratings yet
KAVS
8 pages
AI Unit 4
No ratings yet
AI Unit 4
11 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
CS8791-Cloud Computing UNIT 5 Notes
No ratings yet
CS8791-Cloud Computing UNIT 5 Notes
33 pages
Op in Rank Data Set With Judgments
No ratings yet
Op in Rank Data Set With Judgments
4 pages
Top 5 Highest Paying Skills Beautified
No ratings yet
Top 5 Highest Paying Skills Beautified
3 pages
Answer_Key_worksheet_1
No ratings yet
Answer_Key_worksheet_1
3 pages
Page 1 of 3
No ratings yet
Page 1 of 3
3 pages
DMDW Midsem Question
No ratings yet
DMDW Midsem Question
1 page
DOT-NET(6th)May2022 (1)
No ratings yet
DOT-NET(6th)May2022 (1)
1 page
Enrollment NCR 2022K10
No ratings yet
Enrollment NCR 2022K10
1 page
Year 7 Science KPI's
No ratings yet
Year 7 Science KPI's
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Data-Driven Decision Making
From Everand
Data-Driven Decision Making
Aadinath Pothuvaal
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Comprehensive Guide to Implementing Data Science and Analytics: Tips, Recommendations, and Strategies for Success
From Everand
Comprehensive Guide to Implementing Data Science and Analytics: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet

DSV-S7 Data Collection and Data Pre Processing Overview

Uploaded by

DSV-S7 Data Collection and Data Pre Processing Overview

Uploaded by

Department of AI&DS

COURSE NAME: DATA SCIENCE AND VISUALIZATION

This Session is designed to discuss

 Data collection is a fundamental step in the data analytics and visualization

 It's a dynamic process that requires continuous refinement based on user

1) Define Clear Objectives

1) Define Clear Objectives:

2) Identify Relevant Data Sources:

4) Consider Structured and Unstructured Data:

9) Collaboration with Stakeholders:

3) Data Access Control:

7) Data Masking and Reduction:

9) Awareness and Training:

10) Data Lifecycle Management:

 Effective data collection is the foundation of meaningful data analytics

 Data pre-processing is a crucial step in the data analytics and visualization

 It involves cleaning, transforming, and organizing raw data into a

1. Have I detected and dealt with outliers in the data?

1. Have I normalized the data to ensure consistent ranges

 Why are ethical considerations important in data collection and

You might also like