0% found this document useful (0 votes)
0 views28 pages

DSV-S7 Data Collection and Data Pre Processing Overview

This document provides an overview of data collection and preprocessing in the context of data science and visualization, emphasizing their importance in deriving meaningful insights from raw data. It outlines various data collection strategies, including defining objectives, assessing data quality, and ensuring data privacy, as well as detailing preprocessing steps like data cleaning, integration, transformation, and reduction. The document also addresses data security issues and ethical considerations in data handling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views28 pages

DSV-S7 Data Collection and Data Pre Processing Overview

This document provides an overview of data collection and preprocessing in the context of data science and visualization, emphasizing their importance in deriving meaningful insights from raw data. It outlines various data collection strategies, including defining objectives, assessing data quality, and ensuring data privacy, as well as detailing preprocessing steps like data cleaning, integration, transformation, and reduction. The document also addresses data security issues and ethical considerations in data handling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Department of AI&DS

COURSE NAME: DATA SCIENCE AND VISUALIZATION


COURSE CODE: 22AD3206A
Topic: Data Collection and Data Preprocessing Overview

Session - 07

1
AIM OF THE
SESSION
The primary aim of data analytics and visualization is to transform raw data into meaningful insights.

INSTRUCTIONAL OBJECTIVES

This Session is designed to discuss


1. Data Collection Strategies
2. Data Pre-Processing Overview

LEARNING OUTCOMES
At the end of this session, you should be able to:
1.Related to data analytics, such as data types, variables, and data structures.
2.Understand different types of data (e.g., structured, unstructured) and their
significance in
analytics.
3. Analyse data and draw meaningful insights.
4. Understand probability theory and its application in data analysis.
Data Collection Strategies

 Data collection is a fundamental step in the data analytics and visualization


process.

 The quality and relevance of the collected data significantly impact the
insights and decisions derived from the analysis.

 Effective data collection and visualization strategies are essential for extracting
valuable insights and empowering data-driven decision-making.

 It's a dynamic process that requires continuous refinement based on user


feedback and changing business needs.

3
Data collection Strategies in the Context of Data Analytics
and Visualization

1) Define Clear Objectives


2) Identify Relevant Data Sources
3) Data Quality Assessment
4) Consider Structured and Unstructured Data
5) Real-time Data Collection
6) Data Privacy and Ethics
7) Sampling Techniques
8) Surveys and Questionnaires
9) Collaboration with Stakeholders
10) Data Integration

4
Data collection Strategies in the Context of Data Analytics
and Visualization

1) Define Clear Objectives:


Clearly outline the goals and objectives of your data analytics and visualization
project. Understand the questions you want to answer and the insights you aim to
derive. Knowing what insights you aim to gain will guide your data collection efforts.

2) Identify Relevant Data Sources:


 Determine the sources of data that are relevant to your objectives.
 This can include databases, spread sheets, APIs, external datasets, or a combination of these.
 Determine the key performance indicators (KPIs) and metrics relevant to your analysis and
visualization goals.
 These metrics will drive the selection of data sources and variables.

5
Data collection Strategies in the Context of Data Analytics
and Visualization
3) Data Quality Assessment:
 Assess the quality of available data. Check for completeness, accuracy, consistency,
and relevance.
 Cleaning and pre-processing may be necessary to address any issues.

4) Consider Structured and Unstructured Data:


 Depending on your objectives, collect both structured data (e.g., databases) and
unstructured data (e.g., text, images) for a more comprehensive analysis.
5) Real-time Data Collection:
 If your analysis requires real-time insights, consider implementing systems for
collecting and processing data in real-time.
 This is especially important for dynamic datasets.

6
Data collection Strategies in the Context of Data Analytics
and Visualization
6) Data Privacy and Ethics:
 Ensure compliance with data privacy regulations.
 Obtain necessary permissions for data collection, especially when dealing with personal or
sensitive information.
7) Sampling Techniques:
 Use sampling methods if working with large datasets.
 This involves selecting a representative subset of data for analysis, which can save time and
resources.
8) Surveys and Questionnaires :
 Design and deploy surveys or questionnaires to gather specific information directly from users
or relevant stakeholders.
 Ensure that the questions align with your objectives.

7
Data collection Strategies in the Context of Data Analytics
and Visualization

9) Collaboration with Stakeholders:


 Collaborate with domain experts and stakeholders to gain insights into the context of
the data.
 Their input can help refine data collection strategies.
9) Data Integration:
 Integrate data from different sources to create a unified dataset.
 Ensure compatibility and consistency when combining data from various platforms.

8
Data Security Issues

Data security is a critical concern in the field of data analysis and visualization. As organizations
collect and analyze large volumes of data to gain insights and make informed decisions, they also
face significant challenges related to the security and privacy of this data.
1. Data Breaches
2. Data Privacy
3. Data Access Control
4. Data Encryption
5. Data Masking and Redaction
6. Data Integrity
7. Secure Data Sharing
8. Compliance with Regulations
9. Awareness and Training
10. Data Lifecycle Management

9
Data Security Issues

1) Data Breaches:
 One of the most significant concerns is the potential for data breaches.
 If unauthorized individuals gain access to sensitive data, it can lead to financial losses,
reputational damage, and legal consequences for organizations.
2) Data Privacy:
 Protecting the privacy of individuals is crucial, especially when dealing with personally
identifiable information (PII).
 Analyzing and visualizing data while preserving privacy is a complex task.
 Techniques such as anonymization and differential privacy are used to mitigate
these concerns.

10
Data Security Issues

3) Data Access Control:


 Organizations need to implement strict access controls to ensure that only authorized
personnel can access specific datasets.
 Role-based access control (RBAC) and other access management protocols help in
regulating who can view, edit, or analyze sensitive data.
4) Data Encryption:
 Data should be encrypted both in transit and at rest.
 Encryption ensures that even if data is intercepted or the storage media is
compromised, the data remains unreadable without the proper decryption keys.

11
Data Security Issues

5) Data Integrity:
 Ensuring the integrity of data is essential.
 Data should not be tampered with during analysis or visualization processes.
Implementing checksums and digital signatures can help detect unauthorized changes
to data.
6) Secure Data Sharing:
 Organizations often need to share data with external partners or third-party vendors.
 Secure data sharing mechanisms, such as secure FTP, secure APIs, or block chain
technology, can help in ensuring the safe transfer of data.

12
Data Security Issues

7) Data Masking and Reduction:


 In situations where sharing data is necessary, techniques like data masking and
redaction can be employed.
 This involves replacing, encrypting, or removing sensitive information to protect privacy
while still allowing analysis and visualization on a subset of the data.
8) Compliance with Regulations:
 Organizations must comply with data protection regulations like GDPR (General Data
Protection Regulation) in the European Union or HIPAA (Health Insurance Portability and
Accountability Act) in the United States.
 Non-compliance can result in hefty fines and legal consequences.

13
Data Security Issues

9) Awareness and Training:


 Human error is a common cause of data breaches. Regular training and awareness
programs for employees can help in preventing accidental disclosures and ensuring
that employees understand their roles and responsibilities in maintaining data security.

10) Data Lifecycle Management:


 Proper management of data throughout its lifecycle, including secure storage, archival,
and deletion when it's no longer needed, is crucial. Unused or out-dated data can
become a security risk if not managed appropriately.

14
Data Pre-Processing Overview in DAV

 Effective data collection is the foundation of meaningful data analytics


and visualization.

 Data pre-processing is a crucial step in the data analytics and visualization


process.

 It involves cleaning, transforming, and organizing raw data into a


format that can be effectively utilized for analysis and visualization

15
Data Pre-Processing Overview in DAV

16
Data Pre-Processing Overview in DAV

1) Data Cleaning:
• Data cleaning is the process of detecting corrupt data and inaccurate records
from a record set or database table.
• The main use of cleaning step is based on detecting incomplete, inaccurate,
inconsistent and irrelevant data and applying techniques to modify or delete
this useless data.

17
Data Pre-Processing Overview in DAV

2) Data Integration:
• Data Integration focuses on unification of data residing in different sources and
presenting a unified view of these data.
• Data with different representations are put together and any conflicts resulting from
it are resolved.
• This process becomes vital in a number of scientific and commercial applications.
With increasing volume and exponential growth of data, integrating it becomes even
more significant.

18
Data Pre-Processing Overview in DAV

3) Data Transformation:
• Data transformation plays a pivotal role in converting unprocessed data into
understandable form.
• It consists of data normalization, aggregation and generalization.
• Data normalization helps to arrange the columns and tables of a database such that
redundancy is minimum. This helps cut down on the processing time and
complexity.
• Data aggregation helps in creating a brief summary for faster overview.
• The process of data generalization is also known as rolling-up data. It helps in
generalizing data and creates successive layers of summary in evaluation database.
19
Data Pre-Processing Overview in DAV

4) Data Reduction:
• Data reduction is the process of transforming digital info into ordered and
simplified form.
• This data is generally derived through empirical and experimental means.
• It involves reducing large amounts of data into smaller and meaningful
fragments.

20
Data Pre-Processing Overview in DAV

5) Data Discretization:
• Data discretization is an important concept when you have a large amount of
numeric data, but only want to classify it based on nominal values.
• In this scenario, the continuous data is split into discrete forms and the values
of these discrete sets are said to be the nominal value. It is basically a process
of converting continuous data attributes into a finite set of intervals with
minimal loss of information.

21
Summary

Data collection and preprocessing are critical steps in the data analysis process,
laying the foundation for accurate and reliable insights derived from the data.
These steps involve identifying, acquiring, cleaning, and transforming data to
make it suitable for analysis and modelling. Properly executed data collection and
preprocessing enhance the quality and effectiveness of downstream analyses and
machine learning tasks.

22
SELF-ASSESSMENT QUESTIONS

1. Have I detected and dealt with outliers in the data?


2. Did I use appropriate methods such as visualization, statistical tests, or transformation
techniques?

1. Have I normalized the data to ensure consistent ranges


for variables?
2. Did I consider the impact of normalization on the
performance of different algorithms?

1. Have I split the dataset into training and testing sets to assess model generalization?
2. Did I consider stratified sampling, especially for imbalanced datasets?

1. Why is transactional data important for businesses, and how can it be used for analysis?
2. What measures would you take to ensure the security and privacy of transactional data?
Summary

a) Data Collection: The primary purpose of data collection is to gather raw data for analysis
and decision-making in data science and analytics.

b) Data Sources: Structured data, such as Excel spreadsheets, is organized into a predefined
format, making it suitable for analysis.

c) Data Cleaning: Techniques like replacing missing values with the median of the column
are used to handle missing data in datasets.

d) Feature Engineering: Feature engineering involves creating new features from existing
data to enhance model performance and analysis.

24
TERMINAL QUESTIONS

 What are the main objectives of data collection in the context of data science
and analytics?
 How do you differentiate between structured and unstructured data? Provide
examples of each.
 Explain the importance of handling missing values in a dataset during data
preprocessing.
 What is feature engineering, and why is it a crucial step in data
preprocessing?
 Describe the process of data integration and its significance in data analysis.
TERMINAL QUESTIONS

 Why are ethical considerations important in data collection and


preprocessing? Provide examples of ethical dilemmas in data handling.
 What are the common techniques used for data normalization in data
preprocessing? How does normalization improve data analysis?
 Discuss the role of data quality assurance in ensuring reliable and accurate
data for analysis.
REFERENCES FOR FURTHER LEARNING OF THE
SESSION

Reference Books:
1. Paulraj Ponniah, DATA MODELING FUNDAMENTALS A Practical Guide for IT Professionals.
Sites and Web links:
2. http://www.cs.toronto.edu/~sme/CSC340F/slides/11-objects.pdf
THANK YOU

Team – DAV

You might also like